ray-project / ray

Ray is a unified framework for scaling AI and Python applications. Ray consists of a core distributed runtime and a set of AI Libraries for accelerating ML workloads.
https://ray.io
Apache License 2.0
33.44k stars 5.67k forks source link

Ray gets stuck at the second training iteration #46795

Open yanzhaodong2024 opened 2 months ago

yanzhaodong2024 commented 2 months ago

What happened + What you expected to happen

I'm running the online DPO code on multi nodes in MegatronLM. There are a total of three nodes. Among them, four cards are allocated for the actor model, expert parallel size and pipeline parallel size are both set to 2. After completing the first training iteration, when using the actor model to generate tokens in the second iteration, it gets stuck. I am wondering how to find the location where the process gets stuck and do ray support expert parallel.

Versions / Dependencies

Versions: Ray=2.22.0 Python=3.10

Reproduction script

for context_length in range(start, max_sequence_length): xxx

broadcast among pp ranks

Only ep0 gets here and ep1 gets stuck some where in the loop

Issue Severity

None

yanzhaodong2024 commented 2 months ago

logs: (MegatronActor pid=5991, ip=33.207.59.232) self._dummy_overflow_buf = torch.cuda.IntTensor([0]) (MegatronActor pid=12878, ip=33.207.59.232) /usr/local/lib/python3.10/dist-packages/apex/optimizers/fused_adam.py:112: UserWarning: The torch.cuda.DtypeTensor constructors are no longer recommended. It's best to use methods such as torch.tensor(data, dtype=, device='cuda') to create tensors. (Triggered internally at /opt/pytorch/pytorch/torch/csrc/tensor/python_tensor.cpp:83.) (MegatronActor pid=12878, ip=33.207.59.232) self._dummy_overflow_buf = torch.cuda.IntTensor([0]) (MegatronActor pid=12879, ip=33.207.59.232) /usr/local/lib/python3.10/dist-packages/apex/optimizers/fused_adam.py:112: UserWarning: The torch.cuda.DtypeTensor constructors are no longer recommended. It's best to use methods such as torch.tensor(data, dtype=, device='cuda') to create tensors. (Triggered internally at /opt/pytorch/pytorch/torch/csrc/tensor/python_tensor.cpp:83.) (MegatronActor pid=12879, ip=33.207.59.232) self._dummy_overflow_buf = torch.cuda.IntTensor([0]) (MegatronActor pid=5991, ip=33.207.59.232) INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.3.mlp.experts.local_experts.0.linear_fc1.weight (MegatronActor pid=5991, ip=33.207.59.232) INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.3.mlp.experts.local_experts.1.linear_fc1.weight (MegatronActor pid=5991, ip=33.207.59.232) INFO:megatron.core.distributed.param_and_grad_buffer:Params for bucket 11 (40894464 elements): (MegatronActor pid=5991, ip=33.207.59.232) INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.2.mlp.experts.local_experts.3.linear_fc2.weight (MegatronActor pid=5991, ip=33.207.59.232) INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.2.mlp.experts.local_experts.2.linear_fc2.weight (MegatronActor pid=5991, ip=33.207.59.232) INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.2.mlp.experts.local_experts.2.linear_fc1.weight (MegatronActor pid=5991, ip=33.207.59.232) INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.2.mlp.experts.local_experts.3.linear_fc1.weight (MegatronActor pid=5991, ip=33.207.59.232) INFO:megatron.core.distributed.param_and_grad_buffer:Params for bucket 12 (40894464 elements): (MegatronActor pid=5991, ip=33.207.59.232) INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.2.mlp.experts.local_experts.1.linear_fc2.weight (MegatronActor pid=5991, ip=33.207.59.232) INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.2.mlp.experts.local_experts.0.linear_fc1.weight (MegatronActor pid=5991, ip=33.207.59.232) INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.2.mlp.experts.local_experts.1.linear_fc1.weight (MegatronActor pid=5991, ip=33.207.59.232) INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.2.mlp.experts.local_experts.0.linear_fc2.weight (MegatronActor pid=5991, ip=33.207.59.232) INFO:megatron.core.distributed.param_and_grad_buffer:Params for bucket 13 (40894464 elements): (MegatronActor pid=5991, ip=33.207.59.232) INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.1.mlp.experts.local_experts.3.linear_fc2.weight (MegatronActor pid=5991, ip=33.207.59.232) INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.1.mlp.experts.local_experts.2.linear_fc2.weight (MegatronActor pid=5991, ip=33.207.59.232) INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.1.mlp.experts.local_experts.2.linear_fc1.weight (MegatronActor pid=5991, ip=33.207.59.232) INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.1.mlp.experts.local_experts.3.linear_fc1.weight (MegatronActor pid=5991, ip=33.207.59.232) INFO:megatron.core.distributed.param_and_grad_buffer:Params for bucket 14 (40894464 elements): (MegatronActor pid=5991, ip=33.207.59.232) INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.1.mlp.experts.local_experts.1.linear_fc2.weight (MegatronActor pid=5991, ip=33.207.59.232) INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.1.mlp.experts.local_experts.0.linear_fc2.weight (MegatronActor pid=5991, ip=33.207.59.232) INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.1.mlp.experts.local_experts.0.linear_fc1.weight (MegatronActor pid=5991, ip=33.207.59.232) INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.1.mlp.experts.local_experts.1.linear_fc1.weight (MegatronActor pid=5991, ip=33.207.59.232) INFO:megatron.core.optimizer:Setting up optimizer with config OptimizerConfig(optimizer='adam', lr=5e-06, min_lr=1.5e-06, decoupled_lr=None, decoupled_min_lr=None, weight_decay=0.1, fp16=False, bf16=True, params_dtype=torch.bfloat16, loss_scale=None, initial_loss_scale=4294967296, min_loss_scale=1.0, loss_scale_window=1000, hysteresis=2, adam_beta1=0.9, adam_beta2=0.95, adam_eps=1e-08, sgd_momentum=0.9, use_distributed_optimizer=True, overlap_grad_reduce=True, overlap_param_gather=True, clip_grad=1.0, log_num_zeros_in_grad=False, barrier_with_L1_time=True, timers=<megatron.core.timers.Timers object at 0x7ecdbe29bfd0>) (MegatronActor pid=5991, ip=33.207.59.232) > learning rate decay style: cosine (MegatronActor pid=12878, ip=33.207.59.232) in model_provider after init model (MegatronActor pid=12878, ip=33.207.59.232) ] -> 0[0] via P2P/IPC/read (MegatronActor pid=12878, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12878:13584 [0] NCCL INFO Channel 06/0 : 1[1] -> 0[0] via P2P/IPC/read (MegatronActor pid=12878, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12878:13584 [0] NCCL INFO Channel 07/0 : 1[1] -> 0[0] via P2P/IPC/read (MegatronActor pid=12878, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12878:13584 [0] NCCL INFO Channel 08/0 : 1[1] -> 0[0] via P2P/IPC/read (MegatronActor pid=12878, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12878:13584 [0] NCCL INFO Channel 09/0 : 1[1] -> 0[0] via P2P/IPC/read (MegatronActor pid=12878, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12878:13584 [0] NCCL INFO Channel 10/0 : 1[1] -> 0[0] via P2P/IPC/read (MegatronActor pid=12878, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12878:13584 [0] NCCL INFO Channel 11/0 : 1[1] -> 0[0] via P2P/IPC/read (MegatronActor pid=12878, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12878:13584 [0] NCCL INFO Channel 12/0 : 1[1] -> 0[0] via P2P/IPC/read (MegatronActor pid=12878, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12878:13584 [0] NCCL INFO Channel 13/0 : 1[1] -> 0[0] via P2P/IPC/read (MegatronActor pid=12878, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12878:13584 [0] NCCL INFO Channel 14/0 : 1[1] -> 0[0] via P2P/IPC/read (MegatronActor pid=12878, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12878:13584 [0] NCCL INFO Channel 15/0 : 1[1] -> 0[0] via P2P/IPC/read (MegatronActor pid=12878, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12878:13584 [0] NCCL INFO Connected all trees (MegatronActor pid=12878, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12878:13584 [0] NCCL INFO threadThresholds 8/8/64 | 32/8/64 | 512 | 512 (MegatronActor pid=12878, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12878:13584 [0] NCCL INFO 16 coll channels, 0 nvls channels, 16 p2p channels, 16 p2p channels per peer (MegatronActor pid=12878, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12878:13584 [0] NCCL INFO comm 0x56172e5f8f90 rank 1 nranks 4 cudaDev 0 nvmlDev 1 busId 13000 commId 0xb6a49f7b21c00a5b - Init COMPLETE (MegatronActor pid=12879, ip=33.207.59.232) in model_provider after init model (MegatronActor pid=12879, ip=33.207.59.232) > number of parameters on (tensor, pipeline) model parallel rank (0, 1): 1646596096 (MegatronActor pid=12879, ip=33.207.59.232) ] -> 1[1] via P2P/IPC/read (MegatronActor pid=12879, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12879:13582 [0] NCCL INFO Channel 06/0 : 2[2] -> 1[1] via P2P/IPC/read (MegatronActor pid=12879, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12879:13582 [0] NCCL INFO Channel 07/0 : 2[2] -> 1[1] via P2P/IPC/read (MegatronActor pid=12879, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12879:13582 [0] NCCL INFO Channel 08/0 : 2[2] -> 1[1] via P2P/IPC/read (MegatronActor pid=12879, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12879:13582 [0] NCCL INFO Channel 09/0 : 2[2] -> 1[1] via P2P/IPC/read (MegatronActor pid=12879, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12879:13582 [0] NCCL INFO Channel 10/0 : 2[2] -> 1[1] via P2P/IPC/read (MegatronActor pid=12879, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12879:13582 [0] NCCL INFO Channel 11/0 : 2[2] -> 1[1] via P2P/IPC/read (MegatronActor pid=12879, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12879:13582 [0] NCCL INFO Channel 12/0 : 2[2] -> 1[1] via P2P/IPC/read (MegatronActor pid=12879, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12879:13582 [0] NCCL INFO Channel 13/0 : 2[2] -> 1[1] via P2P/IPC/read (MegatronActor pid=12879, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12879:13582 [0] NCCL INFO Channel 14/0 : 2[2] -> 1[1] via P2P/IPC/read (MegatronActor pid=12879, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12879:13582 [0] NCCL INFO Channel 15/0 : 2[2] -> 1[1] via P2P/IPC/read (MegatronActor pid=12879, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12879:13582 [0] NCCL INFO Connected all trees (MegatronActor pid=12879, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12879:13582 [0] NCCL INFO threadThresholds 8/8/64 | 32/8/64 | 512 | 512 (MegatronActor pid=12879, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12879:13582 [0] NCCL INFO 16 coll channels, 0 nvls channels, 16 p2p channels, 16 p2p channels per peer (MegatronActor pid=12879, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12879:13582 [0] NCCL INFO comm 0x55a87ae26a10 rank 2 nranks 4 cudaDev 0 nvmlDev 2 busId 29000 commId 0xb6a49f7b21c00a5b - Init COMPLETE (MegatronActor pid=12879, ip=33.207.59.232) INFO:megatron.core.distributed.param_and_grad_buffer:Number of buckets for gradient all-reduce / reduce-scatter: 1 (MegatronActor pid=12879, ip=33.207.59.232) INFO:megatron.core.distributed.param_and_grad_buffer:Params for bucket 1 (992284672 elements): (MegatronActor pid=12879, ip=33.207.59.232) INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.7.mlp.router.weight (MegatronActor pid=12879, ip=33.207.59.232) INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.7.pre_mlp_layernorm.weight (MegatronActor pid=12879, ip=33.207.59.232) INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.5.mlp.router.weight (MegatronActor pid=12879, ip=33.207.59.232) INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.3.pre_mlp_layernorm.weight (MegatronActor pid=12879, ip=33.207.59.232) INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.0.mlp.shared_mlp.linear_fc1.weight (MegatronActor pid=12879, ip=33.207.59.232) INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.0.self_attention.linear_proj.weight (MegatronActor pid=12879, ip=33.207.59.232) INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.6.self_attention.linear_qkv.weight (MegatronActor pid=12879, ip=33.207.59.232) INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.5.pre_mlp_layernorm.weight (MegatronActor pid=12879, ip=33.207.59.232) INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.5.self_attention.linear_qkv.weight (MegatronActor pid=12879, ip=33.207.59.232) INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.0.mlp.shared_mlp.linear_fc2.weight (MegatronActor pid=12879, ip=33.207.59.232) INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.0.mlp.router.weight (MegatronActor pid=12879, ip=33.207.59.232) INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.0.pre_mlp_layernorm.weight (MegatronActor pid=12879, ip=33.207.59.232) INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.0.self_attention.linear_qkv.weight (MegatronActor pid=12879, ip=33.207.59.232) INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.7.self_attention.linear_qkv.weight (MegatronActor pid=12879, ip=33.207.59.232) INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.6.mlp.shared_mlp.linear_fc2.weight (MegatronActor pid=12879, ip=33.207.59.232) INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.6.pre_mlp_layernorm.weight (MegatronActor pid=12879, ip=33.207.59.232) INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.6.self_attention.linear_qkv.layer_norm_weight (MegatronActor pid=12879, ip=33.207.59.232) INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.4.mlp.shared_mlp.linear_fc1.weight (MegatronActor pid=12879, ip=33.207.59.232) INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.4.mlp.router.weight (MegatronActor pid=12879, ip=33.207.59.232) INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.3.mlp.shared_mlp.linear_fc2.weight (MegatronActor pid=12879, ip=33.207.59.232) INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.1.mlp.router.weight (MegatronActor pid=12879, ip=33.207.59.232) INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.1.self_attention.linear_qkv.weight (MegatronActor pid=12879, ip=33.207.59.232) INFO:megatron.core.distributed.param_and_grad_buffer: module.output_layer.weight (MegatronActor pid=12879, ip=33.207.59.232) INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.5.self_attention.linear_qkv.layer_norm_weight (MegatronActor pid=12879, ip=33.207.59.232) INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.4.self_attention.linear_proj.weight (MegatronActor pid=12879, ip=33.207.59.232) INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.3.mlp.router.weight (MegatronActor pid=12879, ip=33.207.59.232) INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.3.self_attention.linear_qkv.weight (MegatronActor pid=12879, ip=33.207.59.232) INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.2.self_attention.linear_qkv.weight (MegatronActor pid=12879, ip=33.207.59.232) INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.2.self_attention.linear_proj.weight (MegatronActor pid=12879, ip=33.207.59.232) INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.1.self_attention.linear_qkv.layer_norm_weight (MegatronActor pid=12879, ip=33.207.59.232) INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.0.self_attention.linear_qkv.layer_norm_weight (MegatronActor pid=12879, ip=33.207.59.232) INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.5.self_attention.linear_proj.weight (MegatronActor pid=12879, ip=33.207.59.232) INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.2.mlp.router.weight (MegatronActor pid=12879, ip=33.207.59.232) INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.2.pre_mlp_layernorm.weight (MegatronActor pid=12879, ip=33.207.59.232) INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.7.mlp.shared_mlp.linear_fc2.weight (MegatronActor pid=12879, ip=33.207.59.232) INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.4.self_attention.linear_qkv.weight (MegatronActor pid=12879, ip=33.207.59.232) INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.2.mlp.shared_mlp.linear_fc2.weight (MegatronActor pid=12879, ip=33.207.59.232) INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.1.self_attention.linear_proj.weight (MegatronActor pid=12879, ip=33.207.59.232) INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.7.mlp.shared_mlp.linear_fc1.weight (MegatronActor pid=12879, ip=33.207.59.232) INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.7.self_attention.linear_qkv.layer_norm_weight (MegatronActor pid=12879, ip=33.207.59.232) INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.6.mlp.router.weight (MegatronActor pid=12879, ip=33.207.59.232) INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.5.mlp.shared_mlp.linear_fc2.weight (MegatronActor pid=12879, ip=33.207.59.232) INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.5.mlp.shared_mlp.linear_fc1.weight (MegatronActor pid=12879, ip=33.207.59.232) INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.4.pre_mlp_layernorm.weight (MegatronActor pid=12879, ip=33.207.59.232) INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.3.mlp.shared_mlp.linear_fc1.weight (MegatronActor pid=12879, ip=33.207.59.232) INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.3.self_attention.linear_qkv.layer_norm_weight (MegatronActor pid=12879, ip=33.207.59.232) INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.3.self_attention.linear_proj.weight (MegatronActor pid=12879, ip=33.207.59.232) INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.1.mlp.shared_mlp.linear_fc1.weight (MegatronActor pid=12879, ip=33.207.59.232) INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.final_layernorm.weight (MegatronActor pid=12879, ip=33.207.59.232) INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.7.self_attention.linear_proj.weight (MegatronActor pid=12879, ip=33.207.59.232) INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.6.mlp.shared_mlp.linear_fc1.weight (MegatronActor pid=12879, ip=33.207.59.232) INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.6.self_attention.linear_proj.weight (MegatronActor pid=12879, ip=33.207.59.232) INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.4.mlp.shared_mlp.linear_fc2.weight (MegatronActor pid=12879, ip=33.207.59.232) INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.4.self_attention.linear_qkv.layer_norm_weight (MegatronActor pid=12879, ip=33.207.59.232) INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.2.mlp.shared_mlp.linear_fc1.weight (MegatronActor pid=12879, ip=33.207.59.232) INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.2.self_attention.linear_qkv.layer_norm_weight (MegatronActor pid=12879, ip=33.207.59.232) INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.1.mlp.shared_mlp.linear_fc2.weight (MegatronActor pid=12879, ip=33.207.59.232) INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.1.pre_mlp_layernorm.weight (MegatronActor pid=12879, ip=33.207.59.232) INFO:megatron.core.distributed.param_and_grad_buffer:Number of buckets for gradient all-reduce / reduce-scatter: 1 (MegatronActor pid=12879, ip=33.207.59.232) INFO:megatron.core.distributed.param_and_grad_buffer:Params for bucket 1 (654311424 elements): (MegatronActor pid=12879, ip=33.207.59.232) INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.7.mlp.experts.local_experts.2.linear_fc1.weight (MegatronActor pid=12879, ip=33.207.59.232) INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.6.mlp.experts.local_experts.1.linear_fc2.weight (MegatronActor pid=12879, ip=33.207.59.232) INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.5.mlp.experts.local_experts.0.linear_fc1.weight (MegatronActor pid=12879, ip=33.207.59.232) INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.3.mlp.experts.local_experts.3.linear_fc2.weight (MegatronActor pid=12879, ip=33.207.59.232) INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.1.mlp.experts.local_experts.0.linear_fc1.weight (MegatronActor pid=12879, ip=33.207.59.232) INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.0.mlp.experts.local_experts.0.linear_fc1.weight (MegatronActor pid=12879, ip=33.207.59.232) INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.3.mlp.experts.local_experts.0.linear_fc2.weight (MegatronActor pid=12879, ip=33.207.59.232) INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.7.mlp.experts.local_experts.3.linear_fc2.weight (MegatronActor pid=12879, ip=33.207.59.232) INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.5.mlp.experts.local_experts.3.linear_fc2.weight (MegatronActor pid=12879, ip=33.207.59.232) INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.6.mlp.experts.local_experts.0.linear_fc2.weight (MegatronActor pid=12879, ip=33.207.59.232) INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.4.mlp.experts.local_experts.0.linear_fc1.weight (MegatronActor pid=12879, ip=33.207.59.232) INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.2.mlp.experts.local_experts.0.linear_fc2.weight (MegatronActor pid=12879, ip=33.207.59.232) INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.1.mlp.experts.local_experts.3.linear_fc1.weight (MegatronActor pid=12879, ip=33.207.59.232) INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.7.mlp.experts.local_experts.0.linear_fc1.weight (MegatronActor pid=12879, ip=33.207.59.232) INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.6.mlp.experts.local_experts.2.linear_fc2.weight (MegatronActor pid=12879, ip=33.207.59.232) INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.1.mlp.experts.local_experts.1.linear_fc1.weight (MegatronActor pid=12879, ip=33.207.59.232) INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.0.mlp.experts.local_experts.3.linear_fc1.weight (MegatronActor pid=12879, ip=33.207.59.232) INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.0.mlp.experts.local_experts.0.linear_fc2.weight (MegatronActor pid=12879, ip=33.207.59.232) INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.7.mlp.experts.local_experts.1.linear_fc1.weight (MegatronActor pid=12879, ip=33.207.59.232) INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.6.mlp.experts.local_experts.3.linear_fc2.weight (MegatronActor pid=12879, ip=33.207.59.232) INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.5.mlp.experts.local_experts.1.linear_fc2.weight (MegatronActor pid=12879, ip=33.207.59.232) INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.4.mlp.experts.local_experts.1.linear_fc1.weight (MegatronActor pid=12879, ip=33.207.59.232) INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.2.mlp.experts.local_experts.3.linear_fc2.weight (MegatronActor pid=12879, ip=33.207.59.232) INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.0.mlp.experts.local_experts.1.linear_fc1.weight (MegatronActor pid=12879, ip=33.207.59.232) INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.4.mlp.experts.local_experts.2.linear_fc2.weight (MegatronActor pid=12879, ip=33.207.59.232) INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.2.mlp.experts.local_experts.2.linear_fc1.weight (MegatronActor pid=12879, ip=33.207.59.232) INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.0.mlp.experts.local_experts.2.linear_fc2.weight (MegatronActor pid=12879, ip=33.207.59.232) INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.7.mlp.experts.local_experts.2.linear_fc2.weight (MegatronActor pid=12879, ip=33.207.59.232) INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.5.mlp.experts.local_experts.3.linear_fc1.weight (MegatronActor pid=12879, ip=33.207.59.232) INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.5.mlp.experts.local_experts.0.linear_fc2.weight (MegatronActor pid=12879, ip=33.207.59.232) INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.3.mlp.experts.local_experts.0.linear_fc1.weight (MegatronActor pid=12879, ip=33.207.59.232) INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.3.mlp.experts.local_experts.3.linear_fc1.weight (MegatronActor pid=12879, ip=33.207.59.232) INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.5.mlp.experts.local_experts.2.linear_fc2.weight (MegatronActor pid=12879, ip=33.207.59.232) INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.2.mlp.experts.local_experts.2.linear_fc2.weight (MegatronActor pid=12879, ip=33.207.59.232) INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.1.mlp.experts.local_experts.2.linear_fc2.weight (MegatronActor pid=12879, ip=33.207.59.232) INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.5.mlp.experts.local_experts.1.linear_fc1.weight (MegatronActor pid=12879, ip=33.207.59.232) INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.4.mlp.experts.local_experts.2.linear_fc1.weight (MegatronActor pid=12879, ip=33.207.59.232) INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.4.mlp.experts.local_experts.0.linear_fc2.weight (MegatronActor pid=12879, ip=33.207.59.232) INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.2.mlp.experts.local_experts.3.linear_fc1.weight (MegatronActor pid=12879, ip=33.207.59.232) INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.0.mlp.experts.local_experts.3.linear_fc2.weight (MegatronActor pid=12879, ip=33.207.59.232) INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.0.mlp.experts.local_experts.2.linear_fc1.weight (MegatronActor pid=12879, ip=33.207.59.232) INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.7.mlp.experts.local_experts.1.linear_fc2.weight (MegatronActor pid=12879, ip=33.207.59.232) INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.4.mlp.experts.local_experts.3.linear_fc2.weight (MegatronActor pid=12879, ip=33.207.59.232) INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.1.mlp.experts.local_experts.3.linear_fc2.weight (MegatronActor pid=12879, ip=33.207.59.232) INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.1.mlp.experts.local_experts.0.linear_fc2.weight (MegatronActor pid=12879, ip=33.207.59.232) INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.6.mlp.experts.local_experts.2.linear_fc1.weight (MegatronActor pid=12879, ip=33.207.59.232) INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.6.mlp.experts.local_experts.0.linear_fc1.weight (MegatronActor pid=12879, ip=33.207.59.232) INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.2.mlp.experts.local_experts.1.linear_fc2.weight (MegatronActor pid=12879, ip=33.207.59.232) INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.7.mlp.experts.local_experts.3.linear_fc1.weight (MegatronActor pid=12879, ip=33.207.59.232) INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.6.mlp.experts.local_experts.1.linear_fc1.weight (MegatronActor pid=12879, ip=33.207.59.232) INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.3.mlp.experts.local_experts.2.linear_fc2.weight (MegatronActor pid=12879, ip=33.207.59.232) INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.3.mlp.experts.local_experts.1.linear_fc2.weight (MegatronActor pid=12879, ip=33.207.59.232) INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.2.mlp.experts.local_experts.1.linear_fc1.weight (MegatronActor pid=12879, ip=33.207.59.232) INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.5.mlp.experts.local_experts.2.linear_fc1.weight (MegatronActor pid=12879, ip=33.207.59.232) INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.3.mlp.experts.local_experts.1.linear_fc1.weight (MegatronActor pid=12879, ip=33.207.59.232) INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.1.mlp.experts.local_experts.2.linear_fc1.weight (MegatronActor pid=12879, ip=33.207.59.232) INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.7.mlp.experts.local_experts.0.linear_fc2.weight (MegatronActor pid=12879, ip=33.207.59.232) INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.6.mlp.experts.local_experts.3.linear_fc1.weight (MegatronActor pid=12879, ip=33.207.59.232) INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.4.mlp.experts.local_experts.1.linear_fc2.weight (MegatronActor pid=12879, ip=33.207.59.232) INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.1.mlp.experts.local_experts.1.linear_fc2.weight (MegatronActor pid=12879, ip=33.207.59.232) INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.4.mlp.experts.local_experts.3.linear_fc1.weight (MegatronActor pid=12879, ip=33.207.59.232) INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.3.mlp.experts.local_experts.2.linear_fc1.weight (MegatronActor pid=12879, ip=33.207.59.232) INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.2.mlp.experts.local_experts.0.linear_fc1.weight (MegatronActor pid=12879, ip=33.207.59.232) INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.0.mlp.experts.local_experts.1.linear_fc2.weight (MegatronActor pid=12880, ip=33.207.59.232) in model_provider after init model (MegatronActor pid=12880, ip=33.207.59.232) annel 05/0 : 3[3] -> 2[2] via P2P/IPC/read (MegatronActor pid=12880, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12880:13583 [0] NCCL INFO Channel 06/0 : 3[3] -> 2[2] via P2P/IPC/read (MegatronActor pid=12880, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12880:13583 [0] NCCL INFO Channel 07/0 : 3[3] -> 2[2] via P2P/IPC/read (MegatronActor pid=12880, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12880:13583 [0] NCCL INFO Channel 08/0 : 3[3] -> 2[2] via P2P/IPC/read (MegatronActor pid=12880, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12880:13583 [0] NCCL INFO Channel 09/0 : 3[3] -> 2[2] via P2P/IPC/read (MegatronActor pid=12880, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12880:13583 [0] NCCL INFO Channel 10/0 : 3[3] -> 2[2] via P2P/IPC/read (MegatronActor pid=12880, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12880:13583 [0] NCCL INFO Channel 11/0 : 3[3] -> 2[2] via P2P/IPC/read (MegatronActor pid=12880, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12880:13583 [0] NCCL INFO Channel 12/0 : 3[3] -> 2[2] via P2P/IPC/read (MegatronActor pid=12880, ip=33.207.59.232) /usr/local/lib/python3.10/dist-packages/apex/optimizers/fused_adam.py:112: UserWarning: The torch.cuda.DtypeTensor constructors are no longer recommended. It's best to use methods such as torch.tensor(data, dtype=, device='cuda') to create tensors. (Triggered internally at /opt/pytorch/pytorch/torch/csrc/tensor/python_tensor.cpp:83.) (MegatronActor pid=12880, ip=33.207.59.232) self._dummy_overflow_buf = torch.cuda.IntTensor([0]) (MegatronActor pid=12880, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12880:13583 [0] NCCL INFO Channel 13/0 : 3[3] -> 2[2] via P2P/IPC/read (MegatronActor pid=12880, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12880:13583 [0] NCCL INFO Channel 14/0 : 3[3] -> 2[2] via P2P/IPC/read (MegatronActor pid=12880, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12880:13583 [0] NCCL INFO Channel 15/0 : 3[3] -> 2[2] via P2P/IPC/read (MegatronActor pid=12880, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12880:13583 [0] NCCL INFO Connected all trees (MegatronActor pid=12880, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12880:13583 [0] NCCL INFO threadThresholds 8/8/64 | 32/8/64 | 512 | 512 (MegatronActor pid=12880, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12880:13583 [0] NCCL INFO 16 coll channels, 0 nvls channels, 16 p2p channels, 16 p2p channels per peer (MegatronActor pid=12880, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12880:13583 [0] NCCL INFO comm 0x55b7f3c0c580 rank 3 nranks 4 cudaDev 0 nvmlDev 3 busId 2d000 commId 0xb6a49f7b21c00a5b - Init COMPLETE debug, sync actor_init_job_ref_list done Loading checkpoint shards: 50%|█████ | 2/4 [00:10<00:10, 5.15s/it] Loading checkpoint shards: 75%|███████▌ | 3/4 [00:15<00:05, 5.14s/it] Loading checkpoint shards: 100%|██████████| 4/4 [00:18<00:00, 4.58s/it] (Ref pid=12953) length of tokenizer is 64000 (Ref pid=12953) resize_token_embeddings is 64000 length of actor_handle_list is 4 and length of ref_ref_list is 1 remote handler registered (MegatronActor pid=5991, ip=33.207.59.232) > datasets target sizes (minimum size): (MegatronActor pid=5991, ip=33.207.59.232) train: 2484 (MegatronActor pid=5991, ip=33.207.59.232) validation: 0 (MegatronActor pid=5991, ip=33.207.59.232) test: 0 (MegatronActor pid=5991, ip=33.207.59.232) > building train, validation, and test datasets for GPT ... (MegatronActor pid=5991, ip=33.207.59.232) data_prefix is ['/ML-A100/team/infra/jiangcheng/dataset/rm-static'] (MegatronActor pid=5991, ip=33.207.59.232) Single data path provided for train, valid & test (MegatronActor pid=5991, ip=33.207.59.232) > dataset split: (MegatronActor pid=5991, ip=33.207.59.232) loading dataset (MegatronActor pid=12878, ip=33.207.59.232) data_prefix is ['/ML-A100/team/infra/jiangcheng/dataset/rm-static'] (MegatronActor pid=12878, ip=33.207.59.232) loading dataset (MegatronActor pid=12879, ip=33.207.59.232) data_prefix is ['/ML-A100/team/infra/jiangcheng/dataset/rm-static'] (MegatronActor pid=12879, ip=33.207.59.232) loading dataset (MegatronActor pid=12880, ip=33.207.59.232) data_prefix is ['/ML-A100/team/infra/jiangcheng/dataset/rm-static'] (MegatronActor pid=12880, ip=33.207.59.232) loading dataset (MegatronActor pid=5991, ip=33.207.59.232) dataset loaded (MegatronActor pid=5991, ip=33.207.59.232) > finished creating GPT datasets ... (MegatronActor pid=5991, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:5991:13778 [0] NCCL INFO Using network IBext (MegatronActor pid=5991, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:5991:13778 [0] NCCL INFO comm 0x562a1bd1f1c0 rank 0 nranks 1 cudaDev 0 nvmlDev 0 busId d000 commId 0xe2560cf5213aefd7 - Init START (MegatronActor pid=5991, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:5991:13778 [0] NCCL INFO Setting affinity for GPU 0 to ffffffff,00000000,ffffffff (MegatronActor pid=5991, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:5991:13778 [0] NCCL INFO Channel 00/32 : 0 (MegatronActor pid=5991, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:5991:13778 [0] NCCL INFO Channel 01/32 : 0 (MegatronActor pid=5991, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:5991:13778 [0] NCCL INFO Channel 02/32 : 0 (MegatronActor pid=5991, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:5991:13778 [0] NCCL INFO Channel 03/32 : 0 (MegatronActor pid=5991, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:5991:13778 [0] NCCL INFO Channel 04/32 : 0 (MegatronActor pid=5991, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:5991:13778 [0] NCCL INFO Channel 05/32 : 0 (MegatronActor pid=5991, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:5991:13778 [0] NCCL INFO Channel 06/32 : 0 (MegatronActor pid=5991, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:5991:13778 [0] NCCL INFO Channel 07/32 : 0 (MegatronActor pid=5991, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:5991:13778 [0] NCCL INFO Channel 08/32 : 0 (MegatronActor pid=5991, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:5991:13778 [0] NCCL INFO Channel 09/32 : 0 (MegatronActor pid=5991, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:5991:13778 [0] NCCL INFO Channel 10/32 : 0 (MegatronActor pid=5991, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:5991:13778 [0] NCCL INFO Channel 11/32 : 0 (MegatronActor pid=5991, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:5991:13778 [0] NCCL INFO Channel 12/32 : 0 (MegatronActor pid=5991, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:5991:13778 [0] NCCL INFO Channel 13/32 : 0 (MegatronActor pid=5991, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:5991:13778 [0] NCCL INFO Channel 14/32 : 0 (MegatronActor pid=5991, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:5991:13778 [0] NCCL INFO Channel 15/32 : 0 (MegatronActor pid=5991, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:5991:13778 [0] NCCL INFO Channel 16/32 : 0 (MegatronActor pid=5991, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:5991:13778 [0] NCCL INFO Channel 17/32 : 0 (MegatronActor pid=5991, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:5991:13778 [0] NCCL INFO Channel 18/32 : 0 (MegatronActor pid=5991, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:5991:13778 [0] NCCL INFO Channel 19/32 : 0 (MegatronActor pid=5991, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:5991:13778 [0] NCCL INFO Channel 20/32 : 0 (MegatronActor pid=5991, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:5991:13778 [0] NCCL INFO Channel 21/32 : 0 (MegatronActor pid=5991, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:5991:13778 [0] NCCL INFO Channel 22/32 : 0 (MegatronActor pid=5991, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:5991:13778 [0] NCCL INFO Channel 23/32 : 0 (MegatronActor pid=5991, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:5991:13778 [0] NCCL INFO Channel 24/32 : 0 (MegatronActor pid=5991, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:5991:13778 [0] NCCL INFO Channel 25/32 : 0 (MegatronActor pid=5991, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:5991:13778 [0] NCCL INFO Channel 26/32 : 0 (MegatronActor pid=5991, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:5991:13778 [0] NCCL INFO Channel 27/32 : 0 (MegatronActor pid=5991, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:5991:13778 [0] NCCL INFO Channel 28/32 : 0 (MegatronActor pid=5991, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:5991:13778 [0] NCCL INFO Channel 29/32 : 0 (MegatronActor pid=5991, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:5991:13778 [0] NCCL INFO Channel 30/32 : 0 (MegatronActor pid=5991, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:5991:13778 [0] NCCL INFO Channel 31/32 : 0 (MegatronActor pid=5991, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:5991:13778 [0] NCCL INFO Trees [0] -1/-1/-1->0->-1 [1] -1/-1/-1->0->-1 [2] -1/-1/-1->0->-1 [3] -1/-1/-1->0->-1 [4] -1/-1/-1->0->-1 [5] -1/-1/-1->0->-1 [6] -1/-1/-1->0->-1 [7] -1/-1/-1->0->-1 [8] -1/-1/-1->0->-1 [9] -1/-1/-1->0->-1 [10] -1/-1/-1->0->-1 [11] -1/-1/-1->0->-1 [12] -1/-1/-1->0->-1 [13] -1/-1/-1->0->-1 [14] -1/-1/-1->0->-1 [15] -1/-1/-1->0->-1 [16] -1/-1/-1->0->-1 [17] -1/-1/-1->0->-1 [18] -1/-1/-1->0->-1 [19] -1/-1/-1->0->-1 [20] -1/-1/-1->0->-1 [21] -1/-1/-1->0->-1 [22] -1/-1/-1->0->-1 [23] -1/-1/-1->0->-1 [24] -1/-1/-1->0->-1 [25] -1/-1/-1->0->-1 [26] -1/-1/-1->0->-1 [27] -1/-1/-1->0->-1 [28] -1/-1/-1->0->-1 [29] -1/-1/-1->0->-1 [30] -1/-1/-1->0->-1 [31] -1/-1/-1->0->-1 (MegatronActor pid=5991, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:5991:13778 [0] NCCL INFO P2P Chunksize set to 131072 (MegatronActor pid=5991, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:5991:13778 [0] NCCL INFO Connected all rings (MegatronActor pid=5991, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:5991:13778 [0] NCCL INFO Connected all trees (MegatronActor pid=5991, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:5991:13778 [0] NCCL INFO 32 coll channels, 0 nvls channels, 32 p2p channels, 32 p2p channels per peer (MegatronActor pid=5991, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:5991:13778 [0] NCCL INFO comm 0x562a1bd1f1c0 rank 0 nranks 1 cudaDev 0 nvmlDev 0 busId d000 commId 0xe2560cf5213aefd7 - Init COMPLETE (MegatronActor pid=5991, ip=33.207.59.232) > building train, validation, and test datasets ... (MegatronActor pid=5991, ip=33.207.59.232) done with setup ... (MegatronActor pid=12878, ip=33.207.59.232) dataset loaded (MegatronActor pid=12878, ip=33.207.59.232) NCCL version 2.18.3+cuda12.1 (MegatronActor pid=12878, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12878:13782 [0] NCCL INFO Using network IBext (MegatronActor pid=12878, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12878:13782 [0] NCCL INFO comm 0x56173a4a4240 rank 0 nranks 1 cudaDev 0 nvmlDev 1 busId 13000 commId 0x6b7d71b64c4a1d70 - Init START (MegatronActor pid=12878, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12878:13782 [0] NCCL INFO Setting affinity for GPU 1 to ffffffff,00000000,ffffffff (MegatronActor pid=12878, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12878:13782 [0] NCCL INFO Channel 00/32 : 0 (MegatronActor pid=12878, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12878:13782 [0] NCCL INFO Channel 01/32 : 0 (MegatronActor pid=12878, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12878:13782 [0] NCCL INFO Channel 02/32 : 0 (MegatronActor pid=12878, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12878:13782 [0] NCCL INFO Channel 03/32 : 0 (MegatronActor pid=12878, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12878:13782 [0] NCCL INFO Channel 04/32 : 0 (MegatronActor pid=12878, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12878:13782 [0] NCCL INFO Channel 05/32 : 0 (MegatronActor pid=12878, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12878:13782 [0] NCCL INFO Channel 06/32 : 0 (MegatronActor pid=12878, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12878:13782 [0] NCCL INFO Channel 07/32 : 0 (MegatronActor pid=12878, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12878:13782 [0] NCCL INFO Channel 08/32 : 0 (MegatronActor pid=12878, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12878:13782 [0] NCCL INFO Channel 09/32 : 0 (MegatronActor pid=12878, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12878:13782 [0] NCCL INFO Channel 10/32 : 0 (MegatronActor pid=12878, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12878:13782 [0] NCCL INFO Channel 11/32 : 0 (MegatronActor pid=12878, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12878:13782 [0] NCCL INFO Channel 12/32 : 0 (MegatronActor pid=12878, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12878:13782 [0] NCCL INFO Channel 13/32 : 0 (MegatronActor pid=12878, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12878:13782 [0] NCCL INFO Channel 14/32 : 0 (MegatronActor pid=12878, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12878:13782 [0] NCCL INFO Channel 15/32 : 0 (MegatronActor pid=12878, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12878:13782 [0] NCCL INFO Channel 16/32 : 0 (MegatronActor pid=12878, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12878:13782 [0] NCCL INFO Channel 17/32 : 0 (MegatronActor pid=12878, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12878:13782 [0] NCCL INFO Channel 18/32 : 0 (MegatronActor pid=12878, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12878:13782 [0] NCCL INFO Channel 19/32 : 0 (MegatronActor pid=12878, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12878:13782 [0] NCCL INFO Channel 20/32 : 0 (MegatronActor pid=12878, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12878:13782 [0] NCCL INFO Channel 21/32 : 0 (MegatronActor pid=12878, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12878:13782 [0] NCCL INFO Channel 22/32 : 0 (MegatronActor pid=12878, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12878:13782 [0] NCCL INFO Channel 23/32 : 0 (MegatronActor pid=12878, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12878:13782 [0] NCCL INFO Channel 24/32 : 0 (MegatronActor pid=12878, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12878:13782 [0] NCCL INFO Channel 25/32 : 0 (MegatronActor pid=12878, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12878:13782 [0] NCCL INFO Channel 26/32 : 0 (MegatronActor pid=12878, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12878:13782 [0] NCCL INFO Channel 27/32 : 0 (MegatronActor pid=12878, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12878:13782 [0] NCCL INFO Channel 28/32 : 0 (MegatronActor pid=12878, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12878:13782 [0] NCCL INFO Channel 29/32 : 0 (MegatronActor pid=12878, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12878:13782 [0] NCCL INFO Channel 30/32 : 0 (MegatronActor pid=12878, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12878:13782 [0] NCCL INFO Channel 31/32 : 0 (MegatronActor pid=12878, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12878:13782 [0] NCCL INFO Trees [0] -1/-1/-1->0->-1 [1] -1/-1/-1->0->-1 [2] -1/-1/-1->0->-1 [3] -1/-1/-1->0->-1 [4] -1/-1/-1->0->-1 [5] -1/-1/-1->0->-1 [6] -1/-1/-1->0->-1 [7] -1/-1/-1->0->-1 [8] -1/-1/-1->0->-1 [9] -1/-1/-1->0->-1 [10] -1/-1/-1->0->-1 [11] -1/-1/-1->0->-1 [12] -1/-1/-1->0->-1 [13] -1/-1/-1->0->-1 [14] -1/-1/-1->0->-1 [15] -1/-1/-1->0->-1 [16] -1/-1/-1->0->-1 [17] -1/-1/-1->0->-1 [18] -1/-1/-1->0->-1 [19] -1/-1/-1->0->-1 [20] -1/-1/-1->0->-1 [21] -1/-1/-1->0->-1 [22] -1/-1/-1->0->-1 [23] -1/-1/-1->0->-1 [24] -1/-1/-1->0->-1 [25] -1/-1/-1->0->-1 [26] -1/-1/-1->0->-1 [27] -1/-1/-1->0->-1 [28] -1/-1/-1->0->-1 [29] -1/-1/-1->0->-1 [30] -1/-1/-1->0->-1 [31] -1/-1/-1->0->-1 (MegatronActor pid=12878, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12878:13782 [0] NCCL INFO P2P Chunksize set to 131072 (MegatronActor pid=12878, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12878:13782 [0] NCCL INFO Connected all rings (MegatronActor pid=12878, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12878:13782 [0] NCCL INFO Connected all trees (MegatronActor pid=12878, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12878:13782 [0] NCCL INFO 32 coll channels, 0 nvls channels, 32 p2p channels, 32 p2p channels per peer (MegatronActor pid=12878, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12878:13782 [0] NCCL INFO comm 0x56173a4a4240 rank 0 nranks 1 cudaDev 0 nvmlDev 1 busId 13000 commId 0x6b7d71b64c4a1d70 - Init COMPLETE (MegatronActor pid=12879, ip=33.207.59.232) dataset loaded (MegatronActor pid=12879, ip=33.207.59.232) NCCL version 2.18.3+cuda12.1 (MegatronActor pid=12879, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12879:13788 [0] NCCL INFO Using network IBext (MegatronActor pid=12879, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12879:13788 [0] NCCL INFO comm 0x55a886cca200 rank 0 nranks 1 cudaDev 0 nvmlDev 2 busId 29000 commId 0x9e0815345fc88f1b - Init START (MegatronActor pid=12880, ip=33.207.59.232) dataset loaded (MegatronActor pid=12880, ip=33.207.59.232) NCCL version 2.18.3+cuda12.1 (MegatronActor pid=12880, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12880:13786 [0] NCCL INFO Using network IBext (MegatronActor pid=12880, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12880:13786 [0] NCCL INFO comm 0x55b7ffabf1c0 rank 0 nranks 1 cudaDev 0 nvmlDev 3 busId 2d000 commId 0xf0602ab21a3d90c7 - Init START (MegatronActor pid=12880, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12880:13786 [0] NCCL INFO Setting affinity for GPU 3 to ffffffff,00000000,ffffffff (MegatronActor pid=12880, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12880:13786 [0] NCCL INFO Channel 00/32 : 0 (MegatronActor pid=12880, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12880:13786 [0] NCCL INFO Channel 01/32 : 0 (MegatronActor pid=12880, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12880:13786 [0] NCCL INFO Channel 02/32 : 0 (MegatronActor pid=12880, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12880:13786 [0] NCCL INFO Channel 03/32 : 0 (MegatronActor pid=12880, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12880:13786 [0] NCCL INFO Channel 04/32 : 0 (MegatronActor pid=12880, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12880:13786 [0] NCCL INFO Channel 05/32 : 0 (MegatronActor pid=12880, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12880:13786 [0] NCCL INFO Channel 06/32 : 0 (MegatronActor pid=12880, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12880:13786 [0] NCCL INFO Channel 07/32 : 0 (MegatronActor pid=12880, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12880:13786 [0] NCCL INFO Channel 08/32 : 0 (MegatronActor pid=12880, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12880:13786 [0] NCCL INFO Channel 09/32 : 0 (MegatronActor pid=12880, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12880:13786 [0] NCCL INFO Channel 10/32 : 0 (MegatronActor pid=12880, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12880:13786 [0] NCCL INFO Channel 11/32 : 0 (MegatronActor pid=12880, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12880:13786 [0] NCCL INFO Channel 12/32 : 0 (MegatronActor pid=12880, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12880:13786 [0] NCCL INFO Channel 13/32 : 0 (MegatronActor pid=12880, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12880:13786 [0] NCCL INFO Channel 14/32 : 0 (MegatronActor pid=12880, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12880:13786 [0] NCCL INFO Channel 15/32 : 0 (MegatronActor pid=12880, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12880:13786 [0] NCCL INFO Channel 16/32 : 0 (MegatronActor pid=12880, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12880:13786 [0] NCCL INFO Channel 17/32 : 0 (MegatronActor pid=12880, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12880:13786 [0] NCCL INFO Channel 18/32 : 0 (MegatronActor pid=12880, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12880:13786 [0] NCCL INFO Channel 19/32 : 0 (MegatronActor pid=12880, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12880:13786 [0] NCCL INFO Channel 20/32 : 0 (MegatronActor pid=12880, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12880:13786 [0] NCCL INFO Channel 21/32 : 0 (MegatronActor pid=12880, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12880:13786 [0] NCCL INFO Channel 22/32 : 0 (MegatronActor pid=12880, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12880:13786 [0] NCCL INFO Channel 23/32 : 0 (MegatronActor pid=12880, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12880:13786 [0] NCCL INFO Channel 24/32 : 0 (MegatronActor pid=12880, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12880:13786 [0] NCCL INFO Channel 25/32 : 0 (MegatronActor pid=12880, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12880:13786 [0] NCCL INFO Channel 26/32 : 0 (MegatronActor pid=12880, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12880:13786 [0] NCCL INFO Channel 27/32 : 0 (MegatronActor pid=12880, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12880:13786 [0] NCCL INFO Channel 28/32 : 0 (MegatronActor pid=12880, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12880:13786 [0] NCCL INFO Channel 29/32 : 0 (MegatronActor pid=12880, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12880:13786 [0] NCCL INFO Channel 30/32 : 0 (MegatronActor pid=12880, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12880:13786 [0] NCCL INFO Channel 31/32 : 0 (MegatronActor pid=12880, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12880:13786 [0] NCCL INFO Trees [0] -1/-1/-1->0->-1 [1] -1/-1/-1->0->-1 [2] -1/-1/-1->0->-1 [3] -1/-1/-1->0->-1 [4] -1/-1/-1->0->-1 [5] -1/-1/-1->0->-1 [6] -1/-1/-1->0->-1 [7] -1/-1/-1->0->-1 [8] -1/-1/-1->0->-1 [9] -1/-1/-1->0->-1 [10] -1/-1/-1->0->-1 [11] -1/-1/-1->0->-1 [12] -1/-1/-1->0->-1 [13] -1/-1/-1->0->-1 [14] -1/-1/-1->0->-1 [15] -1/-1/-1->0->-1 [16] -1/-1/-1->0->-1 [17] -1/-1/-1->0->-1 [18] -1/-1/-1->0->-1 [19] -1/-1/-1->0->-1 [20] -1/-1/-1->0->-1 [21] -1/-1/-1->0->-1 [22] -1/-1/-1->0->-1 [23] -1/-1/-1->0->-1 [24] -1/-1/-1->0->-1 [25] -1/-1/-1->0->-1 [26] -1/-1/-1->0->-1 [27] -1/-1/-1->0->-1 [28] -1/-1/-1->0->-1 [29] -1/-1/-1->0->-1 [30] -1/-1/-1->0->-1 [31] -1/-1/-1->0->-1 (MegatronActor pid=12880, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12880:13786 [0] NCCL INFO P2P Chunksize set to 131072 (MegatronActor pid=12880, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12880:13786 [0] NCCL INFO Connected all rings (MegatronActor pid=12880, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12880:13786 [0] NCCL INFO Connected all trees (MegatronActor pid=12880, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12880:13786 [0] NCCL INFO 32 coll channels, 0 nvls channels, 32 p2p channels, 32 p2p channels per peer (MegatronActor pid=5991, ip=33.207.59.232) entering train() (MegatronActor pid=5991, ip=33.207.59.232) [before the start of training step] datetime: 2024-07-25 16:20:08 (MegatronActor pid=12878, ip=33.207.59.232) entering train() (MegatronActor pid=12879, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12879:13788 [0] NCCL INFO Setting affinity for GPU 2 to ffffffff,00000000,ffffffff (MegatronActor pid=12879, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12879:13788 [0] NCCL INFO Channel 00/32 : 0 (MegatronActor pid=12879, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12879:13788 [0] NCCL INFO Channel 01/32 : 0 (MegatronActor pid=12879, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12879:13788 [0] NCCL INFO Channel 02/32 : 0 (MegatronActor pid=12879, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12879:13788 [0] NCCL INFO Channel 03/32 : 0 (MegatronActor pid=12879, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12879:13788 [0] NCCL INFO Channel 04/32 : 0 (MegatronActor pid=12879, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12879:13788 [0] NCCL INFO Channel 05/32 : 0 (MegatronActor pid=12879, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12879:13788 [0] NCCL INFO Channel 06/32 : 0 (MegatronActor pid=12879, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12879:13788 [0] NCCL INFO Channel 07/32 : 0 (MegatronActor pid=12879, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12879:13788 [0] NCCL INFO Channel 08/32 : 0 (MegatronActor pid=12879, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12879:13788 [0] NCCL INFO Channel 09/32 : 0 (MegatronActor pid=12879, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12879:13788 [0] NCCL INFO Channel 10/32 : 0 (MegatronActor pid=12879, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12879:13788 [0] NCCL INFO Channel 11/32 : 0 (MegatronActor pid=12879, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12879:13788 [0] NCCL INFO Channel 12/32 : 0 (MegatronActor pid=12879, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12879:13788 [0] NCCL INFO Channel 13/32 : 0 (MegatronActor pid=12879, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12879:13788 [0] NCCL INFO Channel 14/32 : 0 (MegatronActor pid=12879, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12879:13788 [0] NCCL INFO Channel 15/32 : 0 (MegatronActor pid=12879, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12879:13788 [0] NCCL INFO Channel 16/32 : 0 (MegatronActor pid=12879, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12879:13788 [0] NCCL INFO Channel 17/32 : 0 (MegatronActor pid=12879, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12879:13788 [0] NCCL INFO Channel 18/32 : 0 (MegatronActor pid=12879, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12879:13788 [0] NCCL INFO Channel 19/32 : 0 (MegatronActor pid=12879, ip=33.207.59.232) [W ProcessGroupNCCL.cpp:1658] Warning: 0NCCL_AVOID_RECORD_STREAMS=1 has no effect for point-to-point collectives. (function operator()) (MegatronActor pid=12880, ip=33.207.59.232) [W ProcessGroupNCCL.cpp:1658] Warning: 0NCCL_AVOID_RECORD_STREAMS=1 has no effect for point-to-point collectives. (function operator()) (MegatronActor pid=5991, ip=33.207.59.232) /ML-A100/team/infra/zhaodong/code/bob/test/Megatron-LM/megatron/core/transformer/transformer_layer.py:237: UserWarning: operator() profile_node %21 : int = prim::profile_ivalue(%16) (MegatronActor pid=5991, ip=33.207.59.232) does not have profile information (Triggered internally at /opt/pytorch/pytorch/third_party/nvfuser/csrc/graph_fuser.cpp:104.) (MegatronActor pid=5991, ip=33.207.59.232) hidden_states = self.mlp_bda(self.training, self.config.bias_dropout_fusion)( (MegatronActor pid=12878, ip=33.207.59.232) /ML-A100/team/infra/zhaodong/code/bob/test/Megatron-LM/megatron/core/transformer/transformer_layer.py:237: UserWarning: operator() profile_node %21 : int = prim::profile_ivalue(%16) (MegatronActor pid=12878, ip=33.207.59.232) does not have profile information (Triggered internally at /opt/pytorch/pytorch/third_party/nvfuser/csrc/graph_fuser.cpp:104.) (MegatronActor pid=12878, ip=33.207.59.232) hidden_states = self.mlp_bda(self.training, self.config.bias_dropout_fusion)( (MegatronActor pid=12879, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12879:13788 [0] NCCL INFO Channel 20/32 : 0 (MegatronActor pid=12879, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12879:13788 [0] NCCL INFO Channel 21/32 : 0 (MegatronActor pid=12879, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12879:13788 [0] NCCL INFO Channel 22/32 : 0 (MegatronActor pid=12879, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12879:13788 [0] NCCL INFO Channel 23/32 : 0 (MegatronActor pid=12879, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12879:13788 [0] NCCL INFO Channel 24/32 : 0 (MegatronActor pid=12879, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12879:13788 [0] NCCL INFO Channel 25/32 : 0 (MegatronActor pid=12879, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12879:13788 [0] NCCL INFO Channel 26/32 : 0 (MegatronActor pid=12879, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12879:13788 [0] NCCL INFO Channel 27/32 : 0 (MegatronActor pid=12879, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12879:13788 [0] NCCL INFO Channel 28/32 : 0 (MegatronActor pid=12879, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12879:13788 [0] NCCL INFO Channel 29/32 : 0 (MegatronActor pid=12879, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12879:13788 [0] NCCL INFO Channel 30/32 : 0 (MegatronActor pid=12879, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12879:13788 [0] NCCL INFO Channel 31/32 : 0 (MegatronActor pid=12879, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12879:13788 [0] NCCL INFO Trees [0] -1/-1/-1->0->-1 [1] -1/-1/-1->0->-1 [2] -1/-1/-1->0->-1 [3] -1/-1/-1->0->-1 [4] -1/-1/-1->0->-1 [5] -1/-1/-1->0->-1 [6] -1/-1/-1->0->-1 [7] -1/-1/-1->0->-1 [8] -1/-1/-1->0->-1 [9] -1/-1/-1->0->-1 [10] -1/-1/-1->0->-1 [11] -1/-1/-1->0->-1 [12] -1/-1/-1->0->-1 [13] -1/-1/-1->0->-1 [14] -1/-1/-1->0->-1 [15] -1/-1/-1->0->-1 [16] -1/-1/-1->0->-1 [17] -1/-1/-1->0->-1 [18] -1/-1/-1->0->-1 [19] -1/-1/-1->0->-1 [20] -1/-1/-1->0->-1 [21] -1/-1/-1->0->-1 [22] -1/-1/-1->0->-1 [23] -1/-1/-1->0->-1 [24] -1/-1/-1->0->-1 [25] -1/-1/-1->0->-1 [26] -1/-1/-1->0->-1 [27] -1/-1/-1->0->-1 [28] -1/-1/-1->0->-1 [29] -1/-1/-1->0->-1 [30] -1/-1/-1->0->-1 [31] -1/-1/-1->0->-1 (MegatronActor pid=12879, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12879:13788 [0] NCCL INFO P2P Chunksize set to 131072 (MegatronActor pid=12879, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12879:13788 [0] NCCL INFO Connected all rings (MegatronActor pid=12879, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12879:13788 [0] NCCL INFO Connected all trees (MegatronActor pid=12879, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12879:13788 [0] NCCL INFO 32 coll channels, 0 nvls channels, 32 p2p channels, 32 p2p channels per peer (MegatronActor pid=12879, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12879:13788 [0] NCCL INFO comm 0x55a886cca200 rank 0 nranks 1 cudaDev 0 nvmlDev 2 busId 29000 commId 0x9e0815345fc88f1b - Init COMPLETE (MegatronActor pid=12879, ip=33.207.59.232) entering train() (MegatronActor pid=12880, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12880:13786 [0] NCCL INFO comm 0x55b7ffabf1c0 rank 0 nranks 1 cudaDev 0 nvmlDev 3 busId 2d000 commId 0xf0602ab21a3d90c7 - Init COMPLETE (MegatronActor pid=12880, ip=33.207.59.232) (min, max) time across ranks (ms): (MegatronActor pid=12880, ip=33.207.59.232) model-and-optimizer-setup ......................: (108.81, 110.90) (MegatronActor pid=12880, ip=33.207.59.232) entering train() (MegatronActor pid=5991, ip=33.207.59.232) in generate_tokens_probs_and_return_on_first_stage start is 33 (MegatronActor pid=5991, ip=33.207.59.232) WARNING:megatron.core.models.common.embeddings.rotary_pos_embedding:Setting apply_rope_fusion to false because its implementation is not included in Apex. Try upgrading to the latest version (MegatronActor pid=12878, ip=33.207.59.232) WARNING:megatron.core.models.common.embeddings.rotary_pos_embedding:Setting apply_rope_fusion to false because its implementation is not included in Apex. Try upgrading to the latest version (MegatronActor pid=5991, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:5991:26737 [0] NCCL INFO Using network IBext (MegatronActor pid=5991, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:5991:26737 [0] NCCL INFO comm 0x562a375aef10 rank 0 nranks 2 cudaDev 0 nvmlDev 0 busId d000 commId 0x7a02352266bcc377 - Init START (MegatronActor pid=5991, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:5991:26737 [0] NCCL INFO Setting affinity for GPU 0 to ffffffff,00000000,ffffffff (MegatronActor pid=5991, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:5991:26737 [0] NCCL INFO Channel 00/16 : 0 1 (MegatronActor pid=5991, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:5991:26737 [0] NCCL INFO Channel 01/16 : 0 1 (MegatronActor pid=5991, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:5991:26737 [0] NCCL INFO Channel 02/16 : 0 1 (MegatronActor pid=5991, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:5991:26737 [0] NCCL INFO Channel 03/16 : 0 1 (MegatronActor pid=5991, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:5991:26737 [0] NCCL INFO Channel 04/16 : 0 1 (MegatronActor pid=5991, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:5991:26737 [0] NCCL INFO Channel 05/16 : 0 1 (MegatronActor pid=5991, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:5991:26737 [0] NCCL INFO Channel 06/16 : 0 1 (MegatronActor pid=5991, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:5991:26737 [0] NCCL INFO Channel 07/16 : 0 1 (MegatronActor pid=5991, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:5991:26737 [0] NCCL INFO Channel 08/16 : 0 1 (MegatronActor pid=5991, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:5991:26737 [0] NCCL INFO Channel 09/16 : 0 1 (MegatronActor pid=5991, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:5991:26737 [0] NCCL INFO Channel 10/16 : 0 1 (MegatronActor pid=5991, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:5991:26737 [0] NCCL INFO Channel 11/16 : 0 1 (MegatronActor pid=5991, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:5991:26737 [0] NCCL INFO Channel 12/16 : 0 1 (MegatronActor pid=5991, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:5991:26737 [0] NCCL INFO Channel 13/16 : 0 1 (MegatronActor pid=5991, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:5991:26737 [0] NCCL INFO Channel 14/16 : 0 1 (MegatronActor pid=5991, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:5991:26737 [0] NCCL INFO Channel 15/16 : 0 1 (MegatronActor pid=5991, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:5991:26737 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] 1/-1/-1->0->-1 [2] 1/-1/-1->0->-1 [3] 1/-1/-1->0->-1 [4] -1/-1/-1->0->1 [5] -1/-1/-1->0->1 [6] -1/-1/-1->0->1 [7] -1/-1/-1->0->1 [8] 1/-1/-1->0->-1 [9] 1/-1/-1->0->-1 [10] 1/-1/-1->0->-1 [11] 1/-1/-1->0->-1 [12] -1/-1/-1->0->1 [13] -1/-1/-1->0->1 [14] -1/-1/-1->0->1 [15] -1/-1/-1->0->1 (MegatronActor pid=5991, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:5991:26737 [0] NCCL INFO P2P Chunksize set to 524288 (MegatronActor pid=5991, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:5991:26737 [0] NCCL INFO Channel 00/0 : 0[0] -> 1[1] via P2P/IPC/read (MegatronActor pid=5991, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:5991:26737 [0] NCCL INFO Channel 01/0 : 0[0] -> 1[1] via P2P/IPC/read (MegatronActor pid=5991, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:5991:26737 [0] NCCL INFO Channel 02/0 : 0[0] -> 1[1] via P2P/IPC/read (MegatronActor pid=5991, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:5991:26737 [0] NCCL INFO Channel 03/0 : 0[0] -> 1[1] via P2P/IPC/read (MegatronActor pid=5991, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:5991:26737 [0] NCCL INFO Channel 04/0 : 0[0] -> 1[1] via P2P/IPC/read (MegatronActor pid=5991, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:5991:26737 [0] NCCL INFO Channel 05/0 : 0[0] -> 1[1] via P2P/IPC/read (MegatronActor pid=5991, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:5991:26737 [0] NCCL INFO Channel 06/0 : 0[0] -> 1[1] via P2P/IPC/read (MegatronActor pid=5991, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:5991:26737 [0] NCCL INFO Channel 07/0 : 0[0] -> 1[1] via P2P/IPC/read (MegatronActor pid=5991, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:5991:26737 [0] NCCL INFO Channel 08/0 : 0[0] -> 1[1] via P2P/IPC/read (MegatronActor pid=5991, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:5991:26737 [0] NCCL INFO Channel 09/0 : 0[0] -> 1[1] via P2P/IPC/read (MegatronActor pid=5991, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:5991:26737 [0] NCCL INFO Channel 10/0 : 0[0] -> 1[1] via P2P/IPC/read (MegatronActor pid=5991, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:5991:26737 [0] NCCL INFO Channel 11/0 : 0[0] -> 1[1] via P2P/IPC/read (MegatronActor pid=5991, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:5991:26737 [0] NCCL INFO Channel 12/0 : 0[0] -> 1[1] via P2P/IPC/read (MegatronActor pid=5991, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:5991:26737 [0] NCCL INFO Channel 13/0 : 0[0] -> 1[1] via P2P/IPC/read (MegatronActor pid=5991, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:5991:26737 [0] NCCL INFO Channel 14/0 : 0[0] -> 1[1] via P2P/IPC/read (MegatronActor pid=5991, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:5991:26737 [0] NCCL INFO Channel 15/0 : 0[0] -> 1[1] via P2P/IPC/read (MegatronActor pid=12878, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12878:26738 [0] NCCL INFO Using network IBext (MegatronActor pid=12878, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12878:26738 [0] NCCL INFO comm 0x561755f6c0d0 rank 1 nranks 2 cudaDev 0 nvmlDev 1 busId 13000 commId 0x7a02352266bcc377 - Init START (MegatronActor pid=12878, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12878:26738 [0] NCCL INFO Setting affinity for GPU 1 to ffffffff,00000000,ffffffff (MegatronActor pid=12878, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12878:26738 [0] NCCL INFO Trees [0] -1/-1/-1->1->0 [1] -1/-1/-1->1->0 [2] -1/-1/-1->1->0 [3] -1/-1/-1->1->0 [4] 0/-1/-1->1->-1 [5] 0/-1/-1->1->-1 [6] 0/-1/-1->1->-1 [7] 0/-1/-1->1->-1 [8] -1/-1/-1->1->0 [9] -1/-1/-1->1->0 [10] -1/-1/-1->1->0 [11] -1/-1/-1->1->0 [12] 0/-1/-1->1->-1 [13] 0/-1/-1->1->-1 [14] 0/-1/-1->1->-1 [15] 0/-1/-1->1->-1 (MegatronActor pid=12878, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12878:26738 [0] NCCL INFO P2P Chunksize set to 524288 (MegatronActor pid=12878, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12878:26738 [0] NCCL INFO Channel 00/0 : 1[1] -> 0[0] via P2P/IPC/read (MegatronActor pid=12878, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12878:26738 [0] NCCL INFO Channel 01/0 : 1[1] -> 0[0] via P2P/IPC/read (MegatronActor pid=12878, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12878:26738 [0] NCCL INFO Channel 02/0 : 1[1] -> 0[0] via P2P/IPC/read (MegatronActor pid=12878, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12878:26738 [0] NCCL INFO Channel 03/0 : 1[1] -> 0[0] via P2P/IPC/read (MegatronActor pid=12878, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12878:26738 [0] NCCL INFO Channel 04/0 : 1[1] -> 0[0] via P2P/IPC/read (MegatronActor pid=12878, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12878:26738 [0] NCCL INFO Channel 05/0 : 1[1] -> 0[0] via P2P/IPC/read (MegatronActor pid=12878, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12878:26738 [0] NCCL INFO Channel 06/0 : 1[1] -> 0[0] via P2P/IPC/read (MegatronActor pid=12878, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12878:26738 [0] NCCL INFO Channel 07/0 : 1[1] -> 0[0] via P2P/IPC/read (MegatronActor pid=12878, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12878:26738 [0] NCCL INFO Channel 08/0 : 1[1] -> 0[0] via P2P/IPC/read (MegatronActor pid=12878, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12878:26738 [0] NCCL INFO Channel 09/0 : 1[1] -> 0[0] via P2P/IPC/read (MegatronActor pid=12878, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12878:26738 [0] NCCL INFO Channel 10/0 : 1[1] -> 0[0] via P2P/IPC/read (MegatronActor pid=12878, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12878:26738 [0] NCCL INFO Channel 11/0 : 1[1] -> 0[0] via P2P/IPC/read (MegatronActor pid=12878, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12878:26738 [0] NCCL INFO Channel 12/0 : 1[1] -> 0[0] via P2P/IPC/read (MegatronActor pid=12878, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12878:26738 [0] NCCL INFO Channel 13/0 : 1[1] -> 0[0] via P2P/IPC/read (MegatronActor pid=12878, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12878:26738 [0] NCCL INFO Channel 14/0 : 1[1] -> 0[0] via P2P/IPC/read (MegatronActor pid=12878, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12878:26738 [0] NCCL INFO Channel 15/0 : 1[1] -> 0[0] via P2P/IPC/read (MegatronActor pid=5991, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:5991:26737 [0] NCCL INFO Connected all rings (MegatronActor pid=5991, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:5991:26737 [0] NCCL INFO Connected all trees (MegatronActor pid=5991, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:5991:26737 [0] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 512 | 512 (MegatronActor pid=5991, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:5991:26737 [0] NCCL INFO 16 coll channels, 0 nvls channels, 16 p2p channels, 16 p2p channels per peer (MegatronActor pid=5991, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:5991:26737 [0] NCCL INFO comm 0x562a375aef10 rank 0 nranks 2 cudaDev 0 nvmlDev 0 busId d000 commId 0x7a02352266bcc377 - Init COMPLETE (MegatronActor pid=12878, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12878:26738 [0] NCCL INFO Connected all rings (MegatronActor pid=12878, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12878:26738 [0] NCCL INFO Connected all trees (MegatronActor pid=12878, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12878:26738 [0] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 512 | 512 (MegatronActor pid=12878, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12878:26738 [0] NCCL INFO 16 coll channels, 0 nvls channels, 16 p2p channels, 16 p2p channels per peer (MegatronActor pid=12878, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12878:26738 [0] NCCL INFO comm 0x561755f6c0d0 rank 1 nranks 2 cudaDev 0 nvmlDev 1 busId 13000 commId 0x7a02352266bcc377 - Init COMPLETE (MegatronActor pid=5991, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:5991:26744 [0] NCCL INFO Using network IBext (MegatronActor pid=5991, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:5991:26744 [0] NCCL INFO comm 0x562a375c5c60 rank 0 nranks 2 cudaDev 0 nvmlDev 0 busId d000 commId 0x1020cef89e7d415d - Init START (MegatronActor pid=5991, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:5991:26744 [0] NCCL INFO Setting affinity for GPU 0 to ffffffff,00000000,ffffffff (MegatronActor pid=5991, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:5991:26744 [0] NCCL INFO Channel 00/16 : 0 1 (MegatronActor pid=5991, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:5991:26744 [0] NCCL INFO Channel 01/16 : 0 1 (MegatronActor pid=5991, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:5991:26744 [0] NCCL INFO Channel 02/16 : 0 1 (MegatronActor pid=5991, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:5991:26744 [0] NCCL INFO Channel 03/16 : 0 1 (MegatronActor pid=5991, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:5991:26744 [0] NCCL INFO Channel 04/16 : 0 1 (MegatronActor pid=5991, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:5991:26744 [0] NCCL INFO Channel 05/16 : 0 1 (MegatronActor pid=5991, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:5991:26744 [0] NCCL INFO Channel 06/16 : 0 1 (MegatronActor pid=5991, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:5991:26744 [0] NCCL INFO Channel 07/16 : 0 1 (MegatronActor pid=5991, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:5991:26744 [0] NCCL INFO Channel 08/16 : 0 1 (MegatronActor pid=5991, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:5991:26744 [0] NCCL INFO Channel 09/16 : 0 1 (MegatronActor pid=5991, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:5991:26744 [0] NCCL INFO Channel 10/16 : 0 1 (MegatronActor pid=5991, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:5991:26744 [0] NCCL INFO Channel 11/16 : 0 1 (MegatronActor pid=5991, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:5991:26744 [0] NCCL INFO Channel 12/16 : 0 1 (MegatronActor pid=5991, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:5991:26744 [0] NCCL INFO Channel 13/16 : 0 1 (MegatronActor pid=5991, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:5991:26744 [0] NCCL INFO Channel 14/16 : 0 1 (MegatronActor pid=5991, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:5991:26744 [0] NCCL INFO Channel 15/16 : 0 1 (MegatronActor pid=5991, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:5991:26744 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] 1/-1/-1->0->-1 [2] 1/-1/-1->0->-1 [3] 1/-1/-1->0->-1 [4] -1/-1/-1->0->1 [5] -1/-1/-1->0->1 [6] -1/-1/-1->0->1 [7] -1/-1/-1->0->1 [8] 1/-1/-1->0->-1 [9] 1/-1/-1->0->-1 [10] 1/-1/-1->0->-1 [11] 1/-1/-1->0->-1 [12] -1/-1/-1->0->1 [13] -1/-1/-1->0->1 [14] -1/-1/-1->0->1 [15] -1/-1/-1->0->1 (MegatronActor pid=5991, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:5991:26744 [0] NCCL INFO P2P Chunksize set to 524288 (MegatronActor pid=5991, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:5991:26744 [0] NCCL INFO Channel 00/0 : 0[0] -> 1[1] via P2P/IPC/read (MegatronActor pid=5991, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:5991:26744 [0] NCCL INFO Channel 01/0 : 0[0] -> 1[1] via P2P/IPC/read (MegatronActor pid=5991, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:5991:26744 [0] NCCL INFO Channel 02/0 : 0[0] -> 1[1] via P2P/IPC/read (MegatronActor pid=5991, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:5991:26744 [0] NCCL INFO Channel 03/0 : 0[0] -> 1[1] via P2P/IPC/read (MegatronActor pid=5991, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:5991:26744 [0] NCCL INFO Channel 04/0 : 0[0] -> 1[1] via P2P/IPC/read (MegatronActor pid=5991, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:5991:26744 [0] NCCL INFO Channel 05/0 : 0[0] -> 1[1] via P2P/IPC/read (MegatronActor pid=5991, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:5991:26744 [0] NCCL INFO Channel 06/0 : 0[0] -> 1[1] via P2P/IPC/read (MegatronActor pid=5991, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:5991:26744 [0] NCCL INFO Channel 07/0 : 0[0] -> 1[1] via P2P/IPC/read (MegatronActor pid=5991, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:5991:26744 [0] NCCL INFO Channel 08/0 : 0[0] -> 1[1] via P2P/IPC/read (MegatronActor pid=5991, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:5991:26744 [0] NCCL INFO Channel 09/0 : 0[0] -> 1[1] via P2P/IPC/read (MegatronActor pid=5991, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:5991:26744 [0] NCCL INFO Channel 10/0 : 0[0] -> 1[1] via P2P/IPC/read (MegatronActor pid=5991, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:5991:26744 [0] NCCL INFO Channel 11/0 : 0[0] -> 1[1] via P2P/IPC/read (MegatronActor pid=5991, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:5991:26744 [0] NCCL INFO Channel 12/0 : 0[0] -> 1[1] via P2P/IPC/read (MegatronActor pid=5991, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:5991:26744 [0] NCCL INFO Channel 13/0 : 0[0] -> 1[1] via P2P/IPC/read (MegatronActor pid=5991, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:5991:26744 [0] NCCL INFO Channel 14/0 : 0[0] -> 1[1] via P2P/IPC/read (MegatronActor pid=5991, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:5991:26744 [0] NCCL INFO Channel 15/0 : 0[0] -> 1[1] via P2P/IPC/read (MegatronActor pid=5991, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:5991:26744 [0] NCCL INFO Connected all rings (MegatronActor pid=5991, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:5991:26744 [0] NCCL INFO Connected all trees (MegatronActor pid=5991, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:5991:26744 [0] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 512 | 512 (MegatronActor pid=5991, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:5991:26744 [0] NCCL INFO 16 coll channels, 0 nvls channels, 16 p2p channels, 16 p2p channels per peer (MegatronActor pid=5991, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:5991:26744 [0] NCCL INFO comm 0x562a375c5c60 rank 0 nranks 2 cudaDev 0 nvmlDev 0 busId d000 commId 0x1020cef89e7d415d - Init COMPLETE (MegatronActor pid=12878, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12878:26745 [0] NCCL INFO Using network IBext (MegatronActor pid=12878, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12878:26745 [0] NCCL INFO comm 0x561755f82c60 rank 1 nranks 2 cudaDev 0 nvmlDev 1 busId 13000 commId 0x1020cef89e7d415d - Init START (MegatronActor pid=12878, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12878:26745 [0] NCCL INFO Setting affinity for GPU 1 to ffffffff,00000000,ffffffff (MegatronActor pid=12878, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12878:26745 [0] NCCL INFO Trees [0] -1/-1/-1->1->0 [1] -1/-1/-1->1->0 [2] -1/-1/-1->1->0 [3] -1/-1/-1->1->0 [4] 0/-1/-1->1->-1 [5] 0/-1/-1->1->-1 [6] 0/-1/-1->1->-1 [7] 0/-1/-1->1->-1 [8] -1/-1/-1->1->0 [9] -1/-1/-1->1->0 [10] -1/-1/-1->1->0 [11] -1/-1/-1->1->0 [12] 0/-1/-1->1->-1 [13] 0/-1/-1->1->-1 [14] 0/-1/-1->1->-1 [15] 0/-1/-1->1->-1 (MegatronActor pid=12878, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12878:26745 [0] NCCL INFO P2P Chunksize set to 524288 (MegatronActor pid=12878, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12878:26745 [0] NCCL INFO Channel 00/0 : 1[1] -> 0[0] via P2P/IPC/read (MegatronActor pid=12878, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12878:26745 [0] NCCL INFO Channel 01/0 : 1[1] -> 0[0] via P2P/IPC/read (MegatronActor pid=12878, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12878:26745 [0] NCCL INFO Channel 02/0 : 1[1] -> 0[0] via P2P/IPC/read (MegatronActor pid=12878, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12878:26745 [0] NCCL INFO Channel 03/0 : 1[1] -> 0[0] via P2P/IPC/read (MegatronActor pid=12878, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12878:26745 [0] NCCL INFO Channel 04/0 : 1[1] -> 0[0] via P2P/IPC/read (MegatronActor pid=12878, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12878:26745 [0] NCCL INFO Channel 05/0 : 1[1] -> 0[0] via P2P/IPC/read (MegatronActor pid=12878, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12878:26745 [0] NCCL INFO Channel 06/0 : 1[1] -> 0[0] via P2P/IPC/read (MegatronActor pid=12878, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12878:26745 [0] NCCL INFO Channel 07/0 : 1[1] -> 0[0] via P2P/IPC/read (MegatronActor pid=12878, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12878:26745 [0] NCCL INFO Channel 08/0 : 1[1] -> 0[0] via P2P/IPC/read (MegatronActor pid=12878, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12878:26745 [0] NCCL INFO Channel 09/0 : 1[1] -> 0[0] via P2P/IPC/read (MegatronActor pid=5991, ip=33.207.59.232) /usr/local/lib/python3.10/dist-packages/grouped_gemm/ops.py:92: UserWarning: The data type of the input indices of permute_topK op is torch.int64! The recommended type is torch.int32. (MegatronActor pid=5991, ip=33.207.59.232) warnings.warn(f"The data type of the input indices of permute_topK op is {indices.dtype}! " (MegatronActor pid=12878, ip=33.207.59.232) /usr/local/lib/python3.10/dist-packages/grouped_gemm/ops.py:92: UserWarning: The data type of the input indices of permute_topK op is torch.int64! The recommended type is torch.int32. (MegatronActor pid=12878, ip=33.207.59.232) warnings.warn(f"The data type of the input indices of permute_topK op is {indices.dtype}! " (MegatronActor pid=5991, ip=33.207.59.232) /usr/local/lib/python3.10/dist-packages/grouped_gemm/ops.py:196: UserWarning: The data type of the input probs of unpermute_topK op is torch.bfloat16! The recommended type is torch.float32. (MegatronActor pid=5991, ip=33.207.59.232) warnings.warn(f"The data type of the input probs of unpermute_topK op is {probs.dtype}! " (MegatronActor pid=12878, ip=33.207.59.232) /usr/local/lib/python3.10/dist-packages/grouped_gemm/ops.py:196: UserWarning: The data type of the input probs of unpermute_topK op is torch.bfloat16! The recommended type is torch.float32. (MegatronActor pid=12878, ip=33.207.59.232) warnings.warn(f"The data type of the input probs of unpermute_topK op is {probs.dtype}! " (MegatronActor pid=12878, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12878:26745 [0] NCCL INFO Channel 10/0 : 1[1] -> 0[0] via P2P/IPC/read (MegatronActor pid=12878, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12878:26745 [0] NCCL INFO Channel 11/0 : 1[1] -> 0[0] via P2P/IPC/read (MegatronActor pid=12878, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12878:26745 [0] NCCL INFO Channel 12/0 : 1[1] -> 0[0] via P2P/IPC/read (MegatronActor pid=12878, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12878:26745 [0] NCCL INFO Channel 13/0 : 1[1] -> 0[0] via P2P/IPC/read (MegatronActor pid=12878, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12878:26745 [0] NCCL INFO Channel 14/0 : 1[1] -> 0[0] via P2P/IPC/read (MegatronActor pid=12878, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12878:26745 [0] NCCL INFO Channel 15/0 : 1[1] -> 0[0] via P2P/IPC/read (MegatronActor pid=12878, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12878:26745 [0] NCCL INFO Connected all rings (MegatronActor pid=12878, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12878:26745 [0] NCCL INFO Connected all trees (MegatronActor pid=12878, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12878:26745 [0] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 512 | 512 (MegatronActor pid=12878, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12878:26745 [0] NCCL INFO 16 coll channels, 0 nvls channels, 16 p2p channels, 16 p2p channels per peer (MegatronActor pid=12878, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12878:26745 [0] NCCL INFO comm 0x561755f82c60 rank 1 nranks 2 cudaDev 0 nvmlDev 1 busId 13000 commId 0x1020cef89e7d415d - Init COMPLETE (MegatronActor pid=5991, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:5991:26750 [0] NCCL INFO Channel 00/1 : 0[0] -> 1[1] via P2P/IPC/read (MegatronActor pid=5991, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:5991:26750 [0] NCCL INFO Channel 01/1 : 0[0] -> 1[1] via P2P/IPC/read (MegatronActor pid=5991, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:5991:26750 [0] NCCL INFO Channel 02/1 : 0[0] -> 1[1] via P2P/IPC/read (MegatronActor pid=5991, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:5991:26750 [0] NCCL INFO Channel 03/1 : 0[0] -> 1[1] via P2P/IPC/read (MegatronActor pid=5991, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:5991:26750 [0] NCCL INFO Channel 04/1 : 0[0] -> 1[1] via P2P/IPC/read (MegatronActor pid=5991, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:5991:26750 [0] NCCL INFO Channel 05/1 : 0[0] -> 1[1] via P2P/IPC/read (MegatronActor pid=5991, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:5991:26750 [0] NCCL INFO Channel 06/1 : 0[0] -> 1[1] via P2P/IPC/read (MegatronActor pid=5991, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:5991:26750 [0] NCCL INFO Channel 07/1 : 0[0] -> 1[1] via P2P/IPC/read (MegatronActor pid=5991, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:5991:26750 [0] NCCL INFO Channel 08/1 : 0[0] -> 1[1] via P2P/IPC/read (MegatronActor pid=5991, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:5991:26750 [0] NCCL INFO Channel 09/1 : 0[0] -> 1[1] via P2P/IPC/read (MegatronActor pid=5991, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:5991:26750 [0] NCCL INFO Channel 10/1 : 0[0] -> 1[1] via P2P/IPC/read (MegatronActor pid=5991, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:5991:26750 [0] NCCL INFO Channel 11/1 : 0[0] -> 1[1] via P2P/IPC/read (MegatronActor pid=5991, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:5991:26750 [0] NCCL INFO Channel 12/1 : 0[0] -> 1[1] via P2P/IPC/read (MegatronActor pid=5991, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:5991:26750 [0] NCCL INFO Channel 13/1 : 0[0] -> 1[1] via P2P/IPC/read (MegatronActor pid=5991, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:5991:26750 [0] NCCL INFO Channel 14/1 : 0[0] -> 1[1] via P2P/IPC/read (MegatronActor pid=5991, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:5991:26750 [0] NCCL INFO Channel 15/1 : 0[0] -> 1[1] via P2P/IPC/read (MegatronActor pid=12878, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12878:26751 [0] NCCL INFO Channel 00/1 : 1[1] -> 0[0] via P2P/IPC/read (MegatronActor pid=12878, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12878:26751 [0] NCCL INFO Channel 01/1 : 1[1] -> 0[0] via P2P/IPC/read (MegatronActor pid=12878, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12878:26751 [0] NCCL INFO Channel 02/1 : 1[1] -> 0[0] via P2P/IPC/read (MegatronActor pid=12878, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12878:26751 [0] NCCL INFO Channel 03/1 : 1[1] -> 0[0] via P2P/IPC/read (MegatronActor pid=12878, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12878:26751 [0] NCCL INFO Channel 04/1 : 1[1] -> 0[0] via P2P/IPC/read (MegatronActor pid=12878, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12878:26751 [0] NCCL INFO Channel 05/1 : 1[1] -> 0[0] via P2P/IPC/read (MegatronActor pid=12878, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12878:26751 [0] NCCL INFO Channel 06/1 : 1[1] -> 0[0] via P2P/IPC/read (MegatronActor pid=12878, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12878:26751 [0] NCCL INFO Channel 07/1 : 1[1] -> 0[0] via P2P/IPC/read (MegatronActor pid=12878, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12878:26751 [0] NCCL INFO Channel 08/1 : 1[1] -> 0[0] via P2P/IPC/read (MegatronActor pid=12878, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12878:26751 [0] NCCL INFO Channel 09/1 : 1[1] -> 0[0] via P2P/IPC/read (MegatronActor pid=12878, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12878:26751 [0] NCCL INFO Channel 10/1 : 1[1] -> 0[0] via P2P/IPC/read (MegatronActor pid=12878, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12878:26751 [0] NCCL INFO Channel 11/1 : 1[1] -> 0[0] via P2P/IPC/read (MegatronActor pid=12878, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12878:26751 [0] NCCL INFO Channel 12/1 : 1[1] -> 0[0] via P2P/IPC/read (MegatronActor pid=12878, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12878:26751 [0] NCCL INFO Channel 13/1 : 1[1] -> 0[0] via P2P/IPC/read (MegatronActor pid=12878, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12878:26751 [0] NCCL INFO Channel 14/1 : 1[1] -> 0[0] via P2P/IPC/read (MegatronActor pid=12878, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12878:26751 [0] NCCL INFO Channel 15/1 : 1[1] -> 0[0] via P2P/IPC/read (MegatronActor pid=5991, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:5991:26753 [0] NCCL INFO Channel 00/1 : 0[0] -> 2[2] via P2P/IPC/read (MegatronActor pid=5991, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:5991:26753 [0] NCCL INFO Channel 01/1 : 0[0] -> 2[2] via P2P/IPC/read (MegatronActor pid=5991, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:5991:26753 [0] NCCL INFO Channel 02/1 : 0[0] -> 2[2] via P2P/IPC/read (MegatronActor pid=5991, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:5991:26753 [0] NCCL INFO Channel 03/1 : 0[0] -> 2[2] via P2P/IPC/read (MegatronActor pid=5991, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:5991:26753 [0] NCCL INFO Channel 04/1 : 0[0] -> 2[2] via P2P/IPC/read (MegatronActor pid=5991, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:5991:26753 [0] NCCL INFO Channel 05/1 : 0[0] -> 2[2] via P2P/IPC/read (MegatronActor pid=5991, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:5991:26753 [0] NCCL INFO Channel 06/1 : 0[0] -> 2[2] via P2P/IPC/read (MegatronActor pid=5991, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:5991:26753 [0] NCCL INFO Channel 07/1 : 0[0] -> 2[2] via P2P/IPC/read (MegatronActor pid=5991, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:5991:26753 [0] NCCL INFO Channel 08/1 : 0[0] -> 2[2] via P2P/IPC/read (MegatronActor pid=5991, ip=33.207.59.232) [W ProcessGroupNCCL.cpp:1658] Warning: 0NCCL_AVOID_RECORD_STREAMS=1 has no effect for point-to-point collectives. (function operator()) (MegatronActor pid=12878, ip=33.207.59.232) [W ProcessGroupNCCL.cpp:1658] Warning: 0NCCL_AVOID_RECORD_STREAMS=1 has no effect for point-to-point collectives. (function operator()) (MegatronActor pid=5991, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:5991:26753 [0] NCCL INFO Channel 09/1 : 0[0] -> 2[2] via P2P/IPC/read (MegatronActor pid=5991, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:5991:26753 [0] NCCL INFO Channel 10/1 : 0[0] -> 2[2] via P2P/IPC/read (MegatronActor pid=5991, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:5991:26753 [0] NCCL INFO Channel 11/1 : 0[0] -> 2[2] via P2P/IPC/read (MegatronActor pid=5991, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:5991:26753 [0] NCCL INFO Channel 12/1 : 0[0] -> 2[2] via P2P/IPC/read (MegatronActor pid=5991, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:5991:26753 [0] NCCL INFO Channel 13/1 : 0[0] -> 2[2] via P2P/IPC/read (MegatronActor pid=5991, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:5991:26753 [0] NCCL INFO Channel 14/1 : 0[0] -> 2[2] via P2P/IPC/read (MegatronActor pid=5991, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:5991:26753 [0] NCCL INFO Channel 15/1 : 0[0] -> 2[2] via P2P/IPC/read (MegatronActor pid=12878, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12878:26752 [0] NCCL INFO Channel 00/1 : 1[1] -> 3[3] via P2P/IPC/read (MegatronActor pid=12878, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12878:26752 [0] NCCL INFO Channel 01/1 : 1[1] -> 3[3] via P2P/IPC/read (MegatronActor pid=12878, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12878:26752 [0] NCCL INFO Channel 02/1 : 1[1] -> 3[3] via P2P/IPC/read (MegatronActor pid=12878, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12878:26752 [0] NCCL INFO Channel 03/1 : 1[1] -> 3[3] via P2P/IPC/read (MegatronActor pid=12878, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12878:26752 [0] NCCL INFO Channel 04/1 : 1[1] -> 3[3] via P2P/IPC/read (MegatronActor pid=12878, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12878:26752 [0] NCCL INFO Channel 05/1 : 1[1] -> 3[3] via P2P/IPC/read (MegatronActor pid=12878, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12878:26752 [0] NCCL INFO Channel 06/1 : 1[1] -> 3[3] via P2P/IPC/read (MegatronActor pid=12878, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12878:26752 [0] NCCL INFO Channel 07/1 : 1[1] -> 3[3] via P2P/IPC/read (MegatronActor pid=12878, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12878:26752 [0] NCCL INFO Channel 08/1 : 1[1] -> 3[3] via P2P/IPC/read (MegatronActor pid=12878, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12878:26752 [0] NCCL INFO Channel 09/1 : 1[1] -> 3[3] via P2P/IPC/read (MegatronActor pid=12878, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12878:26752 [0] NCCL INFO Channel 10/1 : 1[1] -> 3[3] via P2P/IPC/read (MegatronActor pid=12878, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12878:26752 [0] NCCL INFO Channel 11/1 : 1[1] -> 3[3] via P2P/IPC/read (MegatronActor pid=12878, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12878:26752 [0] NCCL INFO Channel 12/1 : 1[1] -> 3[3] via P2P/IPC/read (MegatronActor pid=12878, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12878:26752 [0] NCCL INFO Channel 13/1 : 1[1] -> 3[3] via P2P/IPC/read (MegatronActor pid=12878, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12878:26752 [0] NCCL INFO Channel 14/1 : 1[1] -> 3[3] via P2P/IPC/read (MegatronActor pid=12878, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12878:26752 [0] NCCL INFO Channel 15/1 : 1[1] -> 3[3] via P2P/IPC/read (MegatronActor pid=5991, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:5991:26757 [0] NCCL INFO Using network IBext (MegatronActor pid=12879, ip=33.207.59.232) WARNING:megatron.core.models.common.embeddings.rotary_pos_embedding:Setting apply_rope_fusion to false because its implementation is not included in Apex. Try upgrading to the latest version (MegatronActor pid=12880, ip=33.207.59.232) WARNING:megatron.core.models.common.embeddings.rotary_pos_embedding:Setting apply_rope_fusion to false because its implementation is not included in Apex. Try upgrading to the latest version (MegatronActor pid=12878, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12878:26755 [0] NCCL INFO Using network IBext (MegatronActor pid=12879, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12879:26759 [0] NCCL INFO Using network IBext (MegatronActor pid=12879, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12879:26759 [0] NCCL INFO comm 0x55a8a2764520 rank 0 nranks 2 cudaDev 0 nvmlDev 2 busId 29000 commId 0xf1fb37da16d51e60 - Init START (MegatronActor pid=12879, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12879:26759 [0] NCCL INFO Setting affinity for GPU 2 to ffffffff,00000000,ffffffff (MegatronActor pid=12879, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12879:26759 [0] NCCL INFO Channel 00/16 : 0 1 (MegatronActor pid=12879, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12879:26759 [0] NCCL INFO Channel 01/16 : 0 1 (MegatronActor pid=12879, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12879:26759 [0] NCCL INFO Channel 02/16 : 0 1 (MegatronActor pid=12879, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12879:26759 [0] NCCL INFO Channel 03/16 : 0 1 (MegatronActor pid=12879, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12879:26759 [0] NCCL INFO Channel 04/16 : 0 1 (MegatronActor pid=12879, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12879:26759 [0] NCCL INFO Channel 05/16 : 0 1 (MegatronActor pid=12879, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12879:26759 [0] NCCL INFO Channel 06/16 : 0 1 (MegatronActor pid=12879, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12879:26759 [0] NCCL INFO Channel 07/16 : 0 1 (MegatronActor pid=12879, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12879:26759 [0] NCCL INFO Channel 08/16 : 0 1 (MegatronActor pid=12879, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12879:26759 [0] NCCL INFO Channel 09/16 : 0 1 (MegatronActor pid=12879, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12879:26759 [0] NCCL INFO Channel 10/16 : 0 1 (MegatronActor pid=12879, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12879:26759 [0] NCCL INFO Channel 11/16 : 0 1 (MegatronActor pid=12879, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12879:26759 [0] NCCL INFO Channel 12/16 : 0 1 (MegatronActor pid=12879, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12879:26759 [0] NCCL INFO Channel 13/16 : 0 1 (MegatronActor pid=12879, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12879:26759 [0] NCCL INFO Channel 14/16 : 0 1 (MegatronActor pid=12879, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12879:26759 [0] NCCL INFO Channel 15/16 : 0 1 (MegatronActor pid=12879, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12879:26759 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] 1/-1/-1->0->-1 [2] 1/-1/-1->0->-1 [3] 1/-1/-1->0->-1 [4] -1/-1/-1->0->1 [5] -1/-1/-1->0->1 [6] -1/-1/-1->0->1 [7] -1/-1/-1->0->1 [8] 1/-1/-1->0->-1 [9] 1/-1/-1->0->-1 [10] 1/-1/-1->0->-1 [11] 1/-1/-1->0->-1 [12] -1/-1/-1->0->1 [13] -1/-1/-1->0->1 [14] -1/-1/-1->0->1 [15] -1/-1/-1->0->1 (MegatronActor pid=12879, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12879:26759 [0] NCCL INFO P2P Chunksize set to 524288 (MegatronActor pid=12879, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12879:26759 [0] NCCL INFO Channel 00/0 : 0[2] -> 1[3] via P2P/IPC/read (MegatronActor pid=12879, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12879:26759 [0] NCCL INFO Channel 01/0 : 0[2] -> 1[3] via P2P/IPC/read (MegatronActor pid=12879, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12879:26759 [0] NCCL INFO Channel 02/0 : 0[2] -> 1[3] via P2P/IPC/read (MegatronActor pid=12879, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12879:26759 [0] NCCL INFO Channel 03/0 : 0[2] -> 1[3] via P2P/IPC/read (MegatronActor pid=12879, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12879:26759 [0] NCCL INFO Channel 04/0 : 0[2] -> 1[3] via P2P/IPC/read (MegatronActor pid=12879, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12879:26759 [0] NCCL INFO Channel 05/0 : 0[2] -> 1[3] via P2P/IPC/read (MegatronActor pid=12879, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12879:26759 [0] NCCL INFO Channel 06/0 : 0[2] -> 1[3] via P2P/IPC/read (MegatronActor pid=12879, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12879:26759 [0] NCCL INFO Channel 07/0 : 0[2] -> 1[3] via P2P/IPC/read (MegatronActor pid=12879, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12879:26759 [0] NCCL INFO Channel 08/0 : 0[2] -> 1[3] via P2P/IPC/read (MegatronActor pid=12879, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12879:26759 [0] NCCL INFO Channel 09/0 : 0[2] -> 1[3] via P2P/IPC/read (MegatronActor pid=12879, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12879:26759 [0] NCCL INFO Channel 10/0 : 0[2] -> 1[3] via P2P/IPC/read (MegatronActor pid=12879, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12879:26759 [0] NCCL INFO Channel 11/0 : 0[2] -> 1[3] via P2P/IPC/read (MegatronActor pid=12879, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12879:26759 [0] NCCL INFO Channel 12/0 : 0[2] -> 1[3] via P2P/IPC/read (MegatronActor pid=12879, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12879:26759 [0] NCCL INFO Channel 13/0 : 0[2] -> 1[3] via P2P/IPC/read (MegatronActor pid=12879, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12879:26759 [0] NCCL INFO Channel 14/0 : 0[2] -> 1[3] via P2P/IPC/read (MegatronActor pid=12879, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12879:26759 [0] NCCL INFO Channel 15/0 : 0[2] -> 1[3] via P2P/IPC/read (MegatronActor pid=12879, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12879:26759 [0] NCCL INFO Connected all rings (MegatronActor pid=12879, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12879:26759 [0] NCCL INFO Connected all trees (MegatronActor pid=12879, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12879:26759 [0] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 512 | 512 (MegatronActor pid=12879, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12879:26759 [0] NCCL INFO 16 coll channels, 0 nvls channels, 16 p2p channels, 16 p2p channels per peer (MegatronActor pid=12879, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12879:26759 [0] NCCL INFO comm 0x55a8a2764520 rank 0 nranks 2 cudaDev 0 nvmlDev 2 busId 29000 commId 0xf1fb37da16d51e60 - Init COMPLETE (MegatronActor pid=12880, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12880:26760 [0] NCCL INFO Using network IBext (MegatronActor pid=12880, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12880:26760 [0] NCCL INFO comm 0x55b81b271450 rank 1 nranks 2 cudaDev 0 nvmlDev 3 busId 2d000 commId 0xf1fb37da16d51e60 - Init START (MegatronActor pid=12880, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12880:26760 [0] NCCL INFO Setting affinity for GPU 3 to ffffffff,00000000,ffffffff (MegatronActor pid=12880, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12880:26760 [0] NCCL INFO Trees [0] -1/-1/-1->1->0 [1] -1/-1/-1->1->0 [2] -1/-1/-1->1->0 [3] -1/-1/-1->1->0 [4] 0/-1/-1->1->-1 [5] 0/-1/-1->1->-1 [6] 0/-1/-1->1->-1 [7] 0/-1/-1->1->-1 [8] -1/-1/-1->1->0 [9] -1/-1/-1->1->0 [10] -1/-1/-1->1->0 [11] -1/-1/-1->1->0 [12] 0/-1/-1->1->-1 [13] 0/-1/-1->1->-1 [14] 0/-1/-1->1->-1 [15] 0/-1/-1->1->-1 (MegatronActor pid=12880, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12880:26760 [0] NCCL INFO P2P Chunksize set to 524288 (MegatronActor pid=12880, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12880:26760 [0] NCCL INFO Channel 00/0 : 1[3] -> 0[2] via P2P/IPC/read (MegatronActor pid=12880, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12880:26760 [0] NCCL INFO Channel 01/0 : 1[3] -> 0[2] via P2P/IPC/read (MegatronActor pid=12880, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12880:26760 [0] NCCL INFO Channel 02/0 : 1[3] -> 0[2] via P2P/IPC/read (MegatronActor pid=12880, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12880:26760 [0] NCCL INFO Channel 03/0 : 1[3] -> 0[2] via P2P/IPC/read (MegatronActor pid=12880, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12880:26760 [0] NCCL INFO Channel 04/0 : 1[3] -> 0[2] via P2P/IPC/read (MegatronActor pid=12880, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12880:26760 [0] NCCL INFO Channel 05/0 : 1[3] -> 0[2] via P2P/IPC/read (MegatronActor pid=12880, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12880:26760 [0] NCCL INFO Channel 06/0 : 1[3] -> 0[2] via P2P/IPC/read (MegatronActor pid=12880, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12880:26760 [0] NCCL INFO Channel 07/0 : 1[3] -> 0[2] via P2P/IPC/read (MegatronActor pid=12880, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12880:26760 [0] NCCL INFO Channel 08/0 : 1[3] -> 0[2] via P2P/IPC/read (MegatronActor pid=12880, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12880:26760 [0] NCCL INFO Channel 09/0 : 1[3] -> 0[2] via P2P/IPC/read (MegatronActor pid=12880, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12880:26760 [0] NCCL INFO Channel 10/0 : 1[3] -> 0[2] via P2P/IPC/read (MegatronActor pid=12880, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12880:26760 [0] NCCL INFO Channel 11/0 : 1[3] -> 0[2] via P2P/IPC/read (MegatronActor pid=12880, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12880:26760 [0] NCCL INFO Channel 12/0 : 1[3] -> 0[2] via P2P/IPC/read (MegatronActor pid=12880, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12880:26760 [0] NCCL INFO Channel 13/0 : 1[3] -> 0[2] via P2P/IPC/read (MegatronActor pid=12880, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12880:26760 [0] NCCL INFO Channel 14/0 : 1[3] -> 0[2] via P2P/IPC/read (MegatronActor pid=12880, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12880:26760 [0] NCCL INFO Channel 15/0 : 1[3] -> 0[2] via P2P/IPC/read (MegatronActor pid=12880, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12880:26760 [0] NCCL INFO Connected all rings (MegatronActor pid=12880, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12880:26760 [0] NCCL INFO Connected all trees (MegatronActor pid=12880, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12880:26760 [0] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 512 | 512 (MegatronActor pid=12880, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12880:26760 [0] NCCL INFO 16 coll channels, 0 nvls channels, 16 p2p channels, 16 p2p channels per peer (MegatronActor pid=12880, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12880:26760 [0] NCCL INFO comm 0x55b81b271450 rank 1 nranks 2 cudaDev 0 nvmlDev 3 busId 2d000 commId 0xf1fb37da16d51e60 - Init COMPLETE (MegatronActor pid=12880, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12880:26767 [0] NCCL INFO Using network IBext (MegatronActor pid=12880, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12880:26767 [0] NCCL INFO comm 0x55b81b2870d0 rank 1 nranks 2 cudaDev 0 nvmlDev 3 busId 2d000 commId 0xff1c7a4ab47bed4c - Init START (MegatronActor pid=12879, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12879:26766 [0] NCCL INFO Using network IBext (MegatronActor pid=12879, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12879:26766 [0] NCCL INFO comm 0x55a8a27790a0 rank 0 nranks 2 cudaDev 0 nvmlDev 2 busId 29000 commId 0xff1c7a4ab47bed4c - Init START (MegatronActor pid=12879, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12879:26766 [0] NCCL INFO Setting affinity for GPU 2 to ffffffff,00000000,ffffffff (MegatronActor pid=12879, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12879:26766 [0] NCCL INFO Channel 00/16 : 0 1 (MegatronActor pid=12879, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12879:26766 [0] NCCL INFO Channel 01/16 : 0 1 (MegatronActor pid=12879, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12879:26766 [0] NCCL INFO Channel 02/16 : 0 1 (MegatronActor pid=12879, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12879:26766 [0] NCCL INFO Channel 03/16 : 0 1 (MegatronActor pid=12879, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12879:26766 [0] NCCL INFO Channel 04/16 : 0 1 (MegatronActor pid=12879, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12879:26766 [0] NCCL INFO Channel 05/16 : 0 1 (MegatronActor pid=12879, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12879:26766 [0] NCCL INFO Channel 06/16 : 0 1 (MegatronActor pid=12879, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12879:26766 [0] NCCL INFO Channel 07/16 : 0 1 (MegatronActor pid=12879, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12879:26766 [0] NCCL INFO Channel 08/16 : 0 1 (MegatronActor pid=12879, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12879:26766 [0] NCCL INFO Channel 09/16 : 0 1 (MegatronActor pid=12879, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12879:26766 [0] NCCL INFO Channel 10/16 : 0 1 (MegatronActor pid=12879, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12879:26766 [0] NCCL INFO Channel 11/16 : 0 1 (MegatronActor pid=12879, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12879:26766 [0] NCCL INFO Channel 12/16 : 0 1 (MegatronActor pid=12879, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12879:26766 [0] NCCL INFO Channel 13/16 : 0 1 (MegatronActor pid=12879, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12879:26766 [0] NCCL INFO Channel 14/16 : 0 1 (MegatronActor pid=12879, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12879:26766 [0] NCCL INFO Channel 15/16 : 0 1 (MegatronActor pid=12879, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12879:26766 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] 1/-1/-1->0->-1 [2] 1/-1/-1->0->-1 [3] 1/-1/-1->0->-1 [4] -1/-1/-1->0->1 [5] -1/-1/-1->0->1 [6] -1/-1/-1->0->1 [7] -1/-1/-1->0->1 [8] 1/-1/-1->0->-1 [9] 1/-1/-1->0->-1 [10] 1/-1/-1->0->-1 [11] 1/-1/-1->0->-1 [12] -1/-1/-1->0->1 [13] -1/-1/-1->0->1 [14] -1/-1/-1->0->1 [15] -1/-1/-1->0->1 (MegatronActor pid=12879, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12879:26766 [0] NCCL INFO P2P Chunksize set to 524288 (MegatronActor pid=12879, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12879:26766 [0] NCCL INFO Channel 00/0 : 0[2] -> 1[3] via P2P/IPC/read (MegatronActor pid=12879, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12879:26766 [0] NCCL INFO Channel 01/0 : 0[2] -> 1[3] via P2P/IPC/read (MegatronActor pid=12879, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12879:26766 [0] NCCL INFO Channel 02/0 : 0[2] -> 1[3] via P2P/IPC/read (MegatronActor pid=12879, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12879:26766 [0] NCCL INFO Channel 03/0 : 0[2] -> 1[3] via P2P/IPC/read (MegatronActor pid=12879, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12879:26766 [0] NCCL INFO Channel 04/0 : 0[2] -> 1[3] via P2P/IPC/read (MegatronActor pid=12879, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12879:26766 [0] NCCL INFO Channel 05/0 : 0[2] -> 1[3] via P2P/IPC/read (MegatronActor pid=12879, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12879:26766 [0] NCCL INFO Channel 06/0 : 0[2] -> 1[3] via P2P/IPC/read (MegatronActor pid=12879, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12879:26766 [0] NCCL INFO Channel 07/0 : 0[2] -> 1[3] via P2P/IPC/read (MegatronActor pid=12879, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12879:26766 [0] NCCL INFO Channel 08/0 : 0[2] -> 1[3] via P2P/IPC/read (MegatronActor pid=12879, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12879:26766 [0] NCCL INFO Channel 09/0 : 0[2] -> 1[3] via P2P/IPC/read (MegatronActor pid=12879, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12879:26766 [0] NCCL INFO Channel 10/0 : 0[2] -> 1[3] via P2P/IPC/read (MegatronActor pid=12879, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12879:26766 [0] NCCL INFO Channel 11/0 : 0[2] -> 1[3] via P2P/IPC/read (MegatronActor pid=12879, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12879:26766 [0] NCCL INFO Channel 12/0 : 0[2] -> 1[3] via P2P/IPC/read (MegatronActor pid=12879, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12879:26766 [0] NCCL INFO Channel 13/0 : 0[2] -> 1[3] via P2P/IPC/read (MegatronActor pid=12879, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12879:26766 [0] NCCL INFO Channel 14/0 : 0[2] -> 1[3] via P2P/IPC/read (MegatronActor pid=12879, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12879:26766 [0] NCCL INFO Channel 15/0 : 0[2] -> 1[3] via P2P/IPC/read (MegatronActor pid=12879, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12879:26766 [0] NCCL INFO Connected all rings (MegatronActor pid=12879, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12879:26766 [0] NCCL INFO Connected all trees (MegatronActor pid=12879, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12879:26766 [0] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 512 | 512 (MegatronActor pid=12879, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12879:26766 [0] NCCL INFO 16 coll channels, 0 nvls channels, 16 p2p channels, 16 p2p channels per peer (MegatronActor pid=12880, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12880:26767 [0] NCCL INFO Setting affinity for GPU 3 to ffffffff,00000000,ffffffff (MegatronActor pid=12880, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12880:26767 [0] NCCL INFO Trees [0] -1/-1/-1->1->0 [1] -1/-1/-1->1->0 [2] -1/-1/-1->1->0 [3] -1/-1/-1->1->0 [4] 0/-1/-1->1->-1 [5] 0/-1/-1->1->-1 [6] 0/-1/-1->1->-1 [7] 0/-1/-1->1->-1 [8] -1/-1/-1->1->0 [9] -1/-1/-1->1->0 [10] -1/-1/-1->1->0 [11] -1/-1/-1->1->0 [12] 0/-1/-1->1->-1 [13] 0/-1/-1->1->-1 [14] 0/-1/-1->1->-1 [15] 0/-1/-1->1->-1 (MegatronActor pid=12880, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12880:26767 [0] NCCL INFO P2P Chunksize set to 524288 (MegatronActor pid=12880, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12880:26767 [0] NCCL INFO Channel 00/0 : 1[3] -> 0[2] via P2P/IPC/read (MegatronActor pid=12880, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12880:26767 [0] NCCL INFO Channel 01/0 : 1[3] -> 0[2] via P2P/IPC/read (MegatronActor pid=12880, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12880:26767 [0] NCCL INFO Channel 02/0 : 1[3] -> 0[2] via P2P/IPC/read (MegatronActor pid=12880, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12880:26767 [0] NCCL INFO Channel 03/0 : 1[3] -> 0[2] via P2P/IPC/read (MegatronActor pid=12880, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12880:26767 [0] NCCL INFO Channel 04/0 : 1[3] -> 0[2] via P2P/IPC/read (MegatronActor pid=12880, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12880:26767 [0] NCCL INFO Channel 05/0 : 1[3] -> 0[2] via P2P/IPC/read (MegatronActor pid=12880, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12880:26767 [0] NCCL INFO Channel 06/0 : 1[3] -> 0[2] via P2P/IPC/read (MegatronActor pid=12880, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12880:26767 [0] NCCL INFO Channel 07/0 : 1[3] -> 0[2] via P2P/IPC/read (MegatronActor pid=12880, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12880:26767 [0] NCCL INFO Channel 08/0 : 1[3] -> 0[2] via P2P/IPC/read (MegatronActor pid=12880, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12880:26767 [0] NCCL INFO Channel 09/0 : 1[3] -> 0[2] via P2P/IPC/read (MegatronActor pid=12879, ip=33.207.59.232) /usr/local/lib/python3.10/dist-packages/grouped_gemm/ops.py:92: UserWarning: The data type of the input indices of permute_topK op is torch.int64! The recommended type is torch.int32. (MegatronActor pid=12879, ip=33.207.59.232) warnings.warn(f"The data type of the input indices of permute_topK op is {indices.dtype}! " (MegatronActor pid=12880, ip=33.207.59.232) /usr/local/lib/python3.10/dist-packages/grouped_gemm/ops.py:92: UserWarning: The data type of the input indices of permute_topK op is torch.int64! The recommended type is torch.int32. (MegatronActor pid=12880, ip=33.207.59.232) warnings.warn(f"The data type of the input indices of permute_topK op is {indices.dtype}! " (MegatronActor pid=12879, ip=33.207.59.232) /usr/local/lib/python3.10/dist-packages/grouped_gemm/ops.py:196: UserWarning: The data type of the input probs of unpermute_topK op is torch.bfloat16! The recommended type is torch.float32. (MegatronActor pid=12879, ip=33.207.59.232) warnings.warn(f"The data type of the input probs of unpermute_topK op is {probs.dtype}! " (MegatronActor pid=12880, ip=33.207.59.232) /usr/local/lib/python3.10/dist-packages/grouped_gemm/ops.py:196: UserWarning: The data type of the input probs of unpermute_topK op is torch.bfloat16! The recommended type is torch.float32. (MegatronActor pid=12880, ip=33.207.59.232) warnings.warn(f"The data type of the input probs of unpermute_topK op is {probs.dtype}! " (MegatronActor pid=12879, ip=33.207.59.232) /ML-A100/team/infra/zhaodong/code/bob/test/Megatron-LM/megatron/core/transformer/transformer_layer.py:237: UserWarning: operator() profile_node %21 : int = prim::profile_ivalue(%16) (MegatronActor pid=12879, ip=33.207.59.232) does not have profile information (Triggered internally at /opt/pytorch/pytorch/third_party/nvfuser/csrc/graph_fuser.cpp:104.) (MegatronActor pid=12879, ip=33.207.59.232) hidden_states = self.mlp_bda(self.training, self.config.bias_dropout_fusion)( (MegatronActor pid=12880, ip=33.207.59.232) /ML-A100/team/infra/zhaodong/code/bob/test/Megatron-LM/megatron/core/transformer/transformer_layer.py:237: UserWarning: operator() profile_node %21 : int = prim::profile_ivalue(%16) (MegatronActor pid=12880, ip=33.207.59.232) does not have profile information (Triggered internally at /opt/pytorch/pytorch/third_party/nvfuser/csrc/graph_fuser.cpp:104.) (MegatronActor pid=12880, ip=33.207.59.232) hidden_states = self.mlp_bda(self.training, self.config.bias_dropout_fusion)( (MegatronActor pid=12880, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12880:26767 [0] NCCL INFO Channel 10/0 : 1[3] -> 0[2] via P2P/IPC/read (MegatronActor pid=12880, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12880:26767 [0] NCCL INFO Channel 11/0 : 1[3] -> 0[2] via P2P/IPC/read (MegatronActor pid=12880, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12880:26767 [0] NCCL INFO Channel 12/0 : 1[3] -> 0[2] via P2P/IPC/read (MegatronActor pid=12880, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12880:26767 [0] NCCL INFO Channel 13/0 : 1[3] -> 0[2] via P2P/IPC/read (MegatronActor pid=12880, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12880:26767 [0] NCCL INFO Channel 14/0 : 1[3] -> 0[2] via P2P/IPC/read (MegatronActor pid=12880, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12880:26767 [0] NCCL INFO Channel 15/0 : 1[3] -> 0[2] via P2P/IPC/read (MegatronActor pid=12880, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12880:26767 [0] NCCL INFO Connected all rings (MegatronActor pid=12880, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12880:26767 [0] NCCL INFO Connected all trees (MegatronActor pid=12880, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12880:26767 [0] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 512 | 512 (MegatronActor pid=12880, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12880:26767 [0] NCCL INFO 16 coll channels, 0 nvls channels, 16 p2p channels, 16 p2p channels per peer (MegatronActor pid=12879, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12879:26766 [0] NCCL INFO comm 0x55a8a27790a0 rank 0 nranks 2 cudaDev 0 nvmlDev 2 busId 29000 commId 0xff1c7a4ab47bed4c - Init COMPLETE (MegatronActor pid=12879, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12879:26772 [0] NCCL INFO Channel 00/1 : 0[2] -> 1[3] via P2P/IPC/read (MegatronActor pid=12879, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12879:26772 [0] NCCL INFO Channel 01/1 : 0[2] -> 1[3] via P2P/IPC/read (MegatronActor pid=12879, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12879:26772 [0] NCCL INFO Channel 02/1 : 0[2] -> 1[3] via P2P/IPC/read (MegatronActor pid=12879, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12879:26772 [0] NCCL INFO Channel 03/1 : 0[2] -> 1[3] via P2P/IPC/read (MegatronActor pid=12879, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12879:26772 [0] NCCL INFO Channel 04/1 : 0[2] -> 1[3] via P2P/IPC/read (MegatronActor pid=12879, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12879:26772 [0] NCCL INFO Channel 05/1 : 0[2] -> 1[3] via P2P/IPC/read (MegatronActor pid=12879, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12879:26772 [0] NCCL INFO Channel 06/1 : 0[2] -> 1[3] via P2P/IPC/read (MegatronActor pid=12879, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12879:26772 [0] NCCL INFO Channel 07/1 : 0[2] -> 1[3] via P2P/IPC/read (MegatronActor pid=12879, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12879:26772 [0] NCCL INFO Channel 08/1 : 0[2] -> 1[3] via P2P/IPC/read (MegatronActor pid=12879, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12879:26772 [0] NCCL INFO Channel 09/1 : 0[2] -> 1[3] via P2P/IPC/read (MegatronActor pid=12879, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12879:26772 [0] NCCL INFO Channel 10/1 : 0[2] -> 1[3] via P2P/IPC/read (MegatronActor pid=12879, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12879:26772 [0] NCCL INFO Channel 11/1 : 0[2] -> 1[3] via P2P/IPC/read (MegatronActor pid=12879, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12879:26772 [0] NCCL INFO Channel 12/1 : 0[2] -> 1[3] via P2P/IPC/read (MegatronActor pid=12879, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12879:26772 [0] NCCL INFO Channel 13/1 : 0[2] -> 1[3] via P2P/IPC/read (MegatronActor pid=12879, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12879:26772 [0] NCCL INFO Channel 14/1 : 0[2] -> 1[3] via P2P/IPC/read (MegatronActor pid=12879, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12879:26772 [0] NCCL INFO Channel 15/1 : 0[2] -> 1[3] via P2P/IPC/read (MegatronActor pid=12880, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12880:26767 [0] NCCL INFO comm 0x55b81b2870d0 rank 1 nranks 2 cudaDev 0 nvmlDev 3 busId 2d000 commId 0xff1c7a4ab47bed4c - Init COMPLETE (MegatronActor pid=12880, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12880:26773 [0] NCCL INFO Channel 00/1 : 1[3] -> 0[2] via P2P/IPC/read (MegatronActor pid=12880, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12880:26773 [0] NCCL INFO Channel 01/1 : 1[3] -> 0[2] via P2P/IPC/read (MegatronActor pid=12880, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12880:26773 [0] NCCL INFO Channel 02/1 : 1[3] -> 0[2] via P2P/IPC/read (MegatronActor pid=12880, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12880:26773 [0] NCCL INFO Channel 03/1 : 1[3] -> 0[2] via P2P/IPC/read (MegatronActor pid=12880, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12880:26773 [0] NCCL INFO Channel 04/1 : 1[3] -> 0[2] via P2P/IPC/read (MegatronActor pid=12880, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12880:26773 [0] NCCL INFO Channel 05/1 : 1[3] -> 0[2] via P2P/IPC/read (MegatronActor pid=12880, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12880:26773 [0] NCCL INFO Channel 06/1 : 1[3] -> 0[2] via P2P/IPC/read (MegatronActor pid=12880, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12880:26773 [0] NCCL INFO Channel 07/1 : 1[3] -> 0[2] via P2P/IPC/read (MegatronActor pid=12880, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12880:26773 [0] NCCL INFO Channel 08/1 : 1[3] -> 0[2] via P2P/IPC/read (MegatronActor pid=12880, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12880:26773 [0] NCCL INFO Channel 09/1 : 1[3] -> 0[2] via P2P/IPC/read (MegatronActor pid=12880, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12880:26773 [0] NCCL INFO Channel 10/1 : 1[3] -> 0[2] via P2P/IPC/read (MegatronActor pid=12880, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12880:26773 [0] NCCL INFO Channel 11/1 : 1[3] -> 0[2] via P2P/IPC/read (MegatronActor pid=12880, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12880:26773 [0] NCCL INFO Channel 12/1 : 1[3] -> 0[2] via P2P/IPC/read (MegatronActor pid=12880, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12880:26773 [0] NCCL INFO Channel 13/1 : 1[3] -> 0[2] via P2P/IPC/read (MegatronActor pid=12880, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12880:26773 [0] NCCL INFO Channel 14/1 : 1[3] -> 0[2] via P2P/IPC/read (MegatronActor pid=12880, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12880:26773 [0] NCCL INFO Channel 15/1 : 1[3] -> 0[2] via P2P/IPC/read (MegatronActor pid=5991, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:5991:26757 [0] NCCL INFO comm 0x562a3b83d400 rank 0 nranks 2 cudaDev 0 nvmlDev 0 busId d000 commId 0xa45ec5e128be0288 - Init START (MegatronActor pid=12878, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12878:26755 [0] NCCL INFO comm 0x56175a63fc20 rank 0 nranks 2 cudaDev 0 nvmlDev 1 busId 13000 commId 0x725f57147a25777d - Init START (MegatronActor pid=12879, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12879:26775 [0] NCCL INFO Using network IBext (MegatronActor pid=12879, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12879:26775 [0] NCCL INFO comm 0x55a8a5a28380 rank 1 nranks 2 cudaDev 0 nvmlDev 2 busId 29000 commId 0xa45ec5e128be0288 - Init START (MegatronActor pid=12880, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12880:26774 [0] NCCL INFO Using network IBext (MegatronActor pid=12880, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12880:26774 [0] NCCL INFO comm 0x55b81fc464b0 rank 1 nranks 2 cudaDev 0 nvmlDev 3 busId 2d000 commId 0x725f57147a25777d - Init START (MegatronActor pid=5991, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:5991:26757 [0] NCCL INFO Setting affinity for GPU 0 to ffffffff,00000000,ffffffff (MegatronActor pid=5991, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:5991:26757 [0] NCCL INFO Channel 00/16 : 0 1 (MegatronActor pid=5991, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:5991:26757 [0] NCCL INFO Channel 01/16 : 0 1 (MegatronActor pid=5991, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:5991:26757 [0] NCCL INFO Channel 02/16 : 0 1 (MegatronActor pid=5991, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:5991:26757 [0] NCCL INFO Channel 03/16 : 0 1 (MegatronActor pid=5991, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:5991:26757 [0] NCCL INFO Channel 04/16 : 0 1 (MegatronActor pid=5991, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:5991:26757 [0] NCCL INFO Channel 05/16 : 0 1 (MegatronActor pid=5991, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:5991:26757 [0] NCCL INFO Channel 06/16 : 0 1 (MegatronActor pid=5991, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:5991:26757 [0] NCCL INFO Channel 07/16 : 0 1 (MegatronActor pid=5991, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:5991:26757 [0] NCCL INFO Channel 08/16 : 0 1 (MegatronActor pid=5991, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:5991:26757 [0] NCCL INFO Channel 09/16 : 0 1 (MegatronActor pid=5991, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:5991:26757 [0] NCCL INFO Channel 10/16 : 0 1 (MegatronActor pid=5991, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:5991:26757 [0] NCCL INFO Channel 11/16 : 0 1 (MegatronActor pid=5991, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:5991:26757 [0] NCCL INFO Channel 12/16 : 0 1 (MegatronActor pid=5991, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:5991:26757 [0] NCCL INFO Channel 13/16 : 0 1 (MegatronActor pid=5991, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:5991:26757 [0] NCCL INFO Channel 14/16 : 0 1 (MegatronActor pid=5991, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:5991:26757 [0] NCCL INFO Channel 15/16 : 0 1 (MegatronActor pid=5991, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:5991:26757 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] 1/-1/-1->0->-1 [2] 1/-1/-1->0->-1 [3] 1/-1/-1->0->-1 [4] -1/-1/-1->0->1 [5] -1/-1/-1->0->1 [6] -1/-1/-1->0->1 [7] -1/-1/-1->0->1 [8] 1/-1/-1->0->-1 [9] 1/-1/-1->0->-1 [10] 1/-1/-1->0->-1 [11] 1/-1/-1->0->-1 [12] -1/-1/-1->0->1 [13] -1/-1/-1->0->1 [14] -1/-1/-1->0->1 [15] -1/-1/-1->0->1 (MegatronActor pid=5991, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:5991:26757 [0] NCCL INFO P2P Chunksize set to 524288 (MegatronActor pid=5991, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:5991:26757 [0] NCCL INFO Channel 00/0 : 0[0] -> 1[2] via P2P/IPC/read (MegatronActor pid=5991, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:5991:26757 [0] NCCL INFO Channel 01/0 : 0[0] -> 1[2] via P2P/IPC/read (MegatronActor pid=5991, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:5991:26757 [0] NCCL INFO Channel 02/0 : 0[0] -> 1[2] via P2P/IPC/read (MegatronActor pid=5991, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:5991:26757 [0] NCCL INFO Channel 03/0 : 0[0] -> 1[2] via P2P/IPC/read (MegatronActor pid=5991, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:5991:26757 [0] NCCL INFO Channel 04/0 : 0[0] -> 1[2] via P2P/IPC/read (MegatronActor pid=5991, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:5991:26757 [0] NCCL INFO Channel 05/0 : 0[0] -> 1[2] via P2P/IPC/read (MegatronActor pid=5991, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:5991:26757 [0] NCCL INFO Channel 06/0 : 0[0] -> 1[2] via P2P/IPC/read (MegatronActor pid=5991, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:5991:26757 [0] NCCL INFO Channel 07/0 : 0[0] -> 1[2] via P2P/IPC/read (MegatronActor pid=5991, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:5991:26757 [0] NCCL INFO Channel 08/0 : 0[0] -> 1[2] via P2P/IPC/read (MegatronActor pid=5991, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:5991:26757 [0] NCCL INFO Channel 09/0 : 0[0] -> 1[2] via P2P/IPC/read (MegatronActor pid=5991, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:5991:26757 [0] NCCL INFO Channel 10/0 : 0[0] -> 1[2] via P2P/IPC/read (MegatronActor pid=5991, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:5991:26757 [0] NCCL INFO Channel 11/0 : 0[0] -> 1[2] via P2P/IPC/read (MegatronActor pid=5991, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:5991:26757 [0] NCCL INFO Channel 12/0 : 0[0] -> 1[2] via P2P/IPC/read (MegatronActor pid=5991, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:5991:26757 [0] NCCL INFO Channel 13/0 : 0[0] -> 1[2] via P2P/IPC/read (MegatronActor pid=5991, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:5991:26757 [0] NCCL INFO Channel 14/0 : 0[0] -> 1[2] via P2P/IPC/read (MegatronActor pid=5991, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:5991:26757 [0] NCCL INFO Channel 15/0 : 0[0] -> 1[2] via P2P/IPC/read (MegatronActor pid=12878, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12878:26755 [0] NCCL INFO Setting affinity for GPU 1 to ffffffff,00000000,ffffffff (MegatronActor pid=12878, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12878:26755 [0] NCCL INFO Channel 00/16 : 0 1 (MegatronActor pid=12878, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12878:26755 [0] NCCL INFO Channel 01/16 : 0 1 (MegatronActor pid=12878, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12878:26755 [0] NCCL INFO Channel 02/16 : 0 1 (MegatronActor pid=12878, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12878:26755 [0] NCCL INFO Channel 03/16 : 0 1 (MegatronActor pid=12878, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12878:26755 [0] NCCL INFO Channel 04/16 : 0 1 (MegatronActor pid=12878, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12878:26755 [0] NCCL INFO Channel 05/16 : 0 1 (MegatronActor pid=12878, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12878:26755 [0] NCCL INFO Channel 06/16 : 0 1 (MegatronActor pid=12878, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12878:26755 [0] NCCL INFO Channel 07/16 : 0 1 (MegatronActor pid=12878, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12878:26755 [0] NCCL INFO Channel 08/16 : 0 1 (MegatronActor pid=12878, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12878:26755 [0] NCCL INFO Channel 09/16 : 0 1 (MegatronActor pid=12878, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12878:26755 [0] NCCL INFO Channel 10/16 : 0 1 (MegatronActor pid=12878, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12878:26755 [0] NCCL INFO Channel 11/16 : 0 1 (MegatronActor pid=12878, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12878:26755 [0] NCCL INFO Channel 12/16 : 0 1 (MegatronActor pid=12878, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12878:26755 [0] NCCL INFO Channel 13/16 : 0 1 (MegatronActor pid=12878, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12878:26755 [0] NCCL INFO Channel 14/16 : 0 1 (MegatronActor pid=12878, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12878:26755 [0] NCCL INFO Channel 15/16 : 0 1 (MegatronActor pid=12878, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12878:26755 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] 1/-1/-1->0->-1 [2] 1/-1/-1->0->-1 [3] 1/-1/-1->0->-1 [4] -1/-1/-1->0->1 [5] -1/-1/-1->0->1 [6] -1/-1/-1->0->1 [7] -1/-1/-1->0->1 [8] 1/-1/-1->0->-1 [9] 1/-1/-1->0->-1 [10] 1/-1/-1->0->-1 [11] 1/-1/-1->0->-1 [12] -1/-1/-1->0->1 [13] -1/-1/-1->0->1 [14] -1/-1/-1->0->1 [15] -1/-1/-1->0->1 (MegatronActor pid=12878, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12878:26755 [0] NCCL INFO P2P Chunksize set to 524288 (MegatronActor pid=12878, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12878:26755 [0] NCCL INFO Channel 00/0 : 0[1] -> 1[3] via P2P/IPC/read (MegatronActor pid=12878, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12878:26755 [0] NCCL INFO Channel 01/0 : 0[1] -> 1[3] via P2P/IPC/read (MegatronActor pid=12878, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12878:26755 [0] NCCL INFO Channel 02/0 : 0[1] -> 1[3] via P2P/IPC/read (MegatronActor pid=12878, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12878:26755 [0] NCCL INFO Channel 03/0 : 0[1] -> 1[3] via P2P/IPC/read (MegatronActor pid=12878, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12878:26755 [0] NCCL INFO Channel 04/0 : 0[1] -> 1[3] via P2P/IPC/read (MegatronActor pid=12878, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12878:26755 [0] NCCL INFO Channel 05/0 : 0[1] -> 1[3] via P2P/IPC/read (MegatronActor pid=12878, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12878:26755 [0] NCCL INFO Channel 06/0 : 0[1] -> 1[3] via P2P/IPC/read (MegatronActor pid=12878, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12878:26755 [0] NCCL INFO Channel 07/0 : 0[1] -> 1[3] via P2P/IPC/read (MegatronActor pid=12878, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12878:26755 [0] NCCL INFO Channel 08/0 : 0[1] -> 1[3] via P2P/IPC/read (MegatronActor pid=12878, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12878:26755 [0] NCCL INFO Channel 09/0 : 0[1] -> 1[3] via P2P/IPC/read (MegatronActor pid=12878, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12878:26755 [0] NCCL INFO Channel 10/0 : 0[1] -> 1[3] via P2P/IPC/read (MegatronActor pid=12878, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12878:26755 [0] NCCL INFO Channel 11/0 : 0[1] -> 1[3] via P2P/IPC/read (MegatronActor pid=12878, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12878:26755 [0] NCCL INFO Channel 12/0 : 0[1] -> 1[3] via P2P/IPC/read (MegatronActor pid=12878, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12878:26755 [0] NCCL INFO Channel 13/0 : 0[1] -> 1[3] via P2P/IPC/read (MegatronActor pid=12878, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12878:26755 [0] NCCL INFO Channel 14/0 : 0[1] -> 1[3] via P2P/IPC/read (MegatronActor pid=12878, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12878:26755 [0] NCCL INFO Channel 15/0 : 0[1] -> 1[3] via P2P/IPC/read (MegatronActor pid=12879, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12879:26775 [0] NCCL INFO Setting affinity for GPU 2 to ffffffff,00000000,ffffffff (MegatronActor pid=12879, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12879:26775 [0] NCCL INFO Trees [0] -1/-1/-1->1->0 [1] -1/-1/-1->1->0 [2] -1/-1/-1->1->0 [3] -1/-1/-1->1->0 [4] 0/-1/-1->1->-1 [5] 0/-1/-1->1->-1 [6] 0/-1/-1->1->-1 [7] 0/-1/-1->1->-1 [8] -1/-1/-1->1->0 [9] -1/-1/-1->1->0 [10] -1/-1/-1->1->0 [11] -1/-1/-1->1->0 [12] 0/-1/-1->1->-1 [13] 0/-1/-1->1->-1 [14] 0/-1/-1->1->-1 [15] 0/-1/-1->1->-1 (MegatronActor pid=12879, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12879:26775 [0] NCCL INFO P2P Chunksize set to 524288 (MegatronActor pid=12879, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12879:26775 [0] NCCL INFO Channel 00/0 : 1[2] -> 0[0] via P2P/IPC/read (MegatronActor pid=12879, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12879:26775 [0] NCCL INFO Channel 01/0 : 1[2] -> 0[0] via P2P/IPC/read (MegatronActor pid=12879, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12879:26775 [0] NCCL INFO Channel 02/0 : 1[2] -> 0[0] via P2P/IPC/read (MegatronActor pid=12879, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12879:26775 [0] NCCL INFO Channel 03/0 : 1[2] -> 0[0] via P2P/IPC/read (MegatronActor pid=12879, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12879:26775 [0] NCCL INFO Channel 04/0 : 1[2] -> 0[0] via P2P/IPC/read (MegatronActor pid=12879, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12879:26775 [0] NCCL INFO Channel 05/0 : 1[2] -> 0[0] via P2P/IPC/read (MegatronActor pid=12879, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12879:26775 [0] NCCL INFO Channel 06/0 : 1[2] -> 0[0] via P2P/IPC/read (MegatronActor pid=12879, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12879:26775 [0] NCCL INFO Channel 07/0 : 1[2] -> 0[0] via P2P/IPC/read (MegatronActor pid=12879, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12879:26775 [0] NCCL INFO Channel 08/0 : 1[2] -> 0[0] via P2P/IPC/read (MegatronActor pid=12879, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12879:26775 [0] NCCL INFO Channel 09/0 : 1[2] -> 0[0] via P2P/IPC/read (MegatronActor pid=12879, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12879:26775 [0] NCCL INFO Channel 10/0 : 1[2] -> 0[0] via P2P/IPC/read (MegatronActor pid=12879, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12879:26775 [0] NCCL INFO Channel 11/0 : 1[2] -> 0[0] via P2P/IPC/read (MegatronActor pid=12879, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12879:26775 [0] NCCL INFO Channel 12/0 : 1[2] -> 0[0] via P2P/IPC/read (MegatronActor pid=12879, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12879:26775 [0] NCCL INFO Channel 13/0 : 1[2] -> 0[0] via P2P/IPC/read (MegatronActor pid=12879, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12879:26775 [0] NCCL INFO Channel 14/0 : 1[2] -> 0[0] via P2P/IPC/read (MegatronActor pid=12879, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12879:26775 [0] NCCL INFO Channel 15/0 : 1[2] -> 0[0] via P2P/IPC/read (MegatronActor pid=12880, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12880:26774 [0] NCCL INFO Setting affinity for GPU 3 to ffffffff,00000000,ffffffff (MegatronActor pid=12880, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12880:26774 [0] NCCL INFO Trees [0] -1/-1/-1->1->0 [1] -1/-1/-1->1->0 [2] -1/-1/-1->1->0 [3] -1/-1/-1->1->0 [4] 0/-1/-1->1->-1 [5] 0/-1/-1->1->-1 [6] 0/-1/-1->1->-1 [7] 0/-1/-1->1->-1 [8] -1/-1/-1->1->0 [9] -1/-1/-1->1->0 [10] -1/-1/-1->1->0 [11] -1/-1/-1->1->0 [12] 0/-1/-1->1->-1 [13] 0/-1/-1->1->-1 [14] 0/-1/-1->1->-1 [15] 0/-1/-1->1->-1 (MegatronActor pid=12880, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12880:26774 [0] NCCL INFO P2P Chunksize set to 524288 (MegatronActor pid=12880, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12880:26774 [0] NCCL INFO Channel 00/0 : 1[3] -> 0[1] via P2P/IPC/read (MegatronActor pid=12880, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12880:26774 [0] NCCL INFO Channel 01/0 : 1[3] -> 0[1] via P2P/IPC/read (MegatronActor pid=12880, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12880:26774 [0] NCCL INFO Channel 02/0 : 1[3] -> 0[1] via P2P/IPC/read (MegatronActor pid=12880, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12880:26774 [0] NCCL INFO Channel 03/0 : 1[3] -> 0[1] via P2P/IPC/read (MegatronActor pid=12880, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12880:26774 [0] NCCL INFO Channel 04/0 : 1[3] -> 0[1] via P2P/IPC/read (MegatronActor pid=12880, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12880:26774 [0] NCCL INFO Channel 05/0 : 1[3] -> 0[1] via P2P/IPC/read (MegatronActor pid=12880, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12880:26774 [0] NCCL INFO Channel 06/0 : 1[3] -> 0[1] via P2P/IPC/read (MegatronActor pid=12880, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12880:26774 [0] NCCL INFO Channel 07/0 : 1[3] -> 0[1] via P2P/IPC/read (MegatronActor pid=12880, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12880:26774 [0] NCCL INFO Channel 08/0 : 1[3] -> 0[1] via P2P/IPC/read (MegatronActor pid=12880, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12880:26774 [0] NCCL INFO Channel 09/0 : 1[3] -> 0[1] via P2P/IPC/read (MegatronActor pid=12880, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12880:26774 [0] NCCL INFO Channel 10/0 : 1[3] -> 0[1] via P2P/IPC/read (MegatronActor pid=12880, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12880:26774 [0] NCCL INFO Channel 11/0 : 1[3] -> 0[1] via P2P/IPC/read (MegatronActor pid=12880, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12880:26774 [0] NCCL INFO Channel 12/0 : 1[3] -> 0[1] via P2P/IPC/read (MegatronActor pid=12880, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12880:26774 [0] NCCL INFO Channel 13/0 : 1[3] -> 0[1] via P2P/IPC/read (MegatronActor pid=12880, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12880:26774 [0] NCCL INFO Channel 14/0 : 1[3] -> 0[1] via P2P/IPC/read (MegatronActor pid=12880, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12880:26774 [0] NCCL INFO Channel 15/0 : 1[3] -> 0[1] via P2P/IPC/read (MegatronActor pid=5991, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:5991:26757 [0] NCCL INFO Connected all rings (MegatronActor pid=5991, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:5991:26757 [0] NCCL INFO Connected all trees (MegatronActor pid=5991, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:5991:26757 [0] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 512 | 512 (MegatronActor pid=5991, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:5991:26757 [0] NCCL INFO 16 coll channels, 0 nvls channels, 16 p2p channels, 16 p2p channels per peer (MegatronActor pid=12879, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12879:26775 [0] NCCL INFO Connected all rings (MegatronActor pid=12879, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12879:26775 [0] NCCL INFO Connected all trees (MegatronActor pid=12879, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12879:26775 [0] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 512 | 512 (MegatronActor pid=12879, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12879:26775 [0] NCCL INFO 16 coll channels, 0 nvls channels, 16 p2p channels, 16 p2p channels per peer (MegatronActor pid=5991, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:5991:26757 [0] NCCL INFO comm 0x562a3b83d400 rank 0 nranks 2 cudaDev 0 nvmlDev 0 busId d000 commId 0xa45ec5e128be0288 - Init COMPLETE (MegatronActor pid=5991, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:5991:26785 [0] NCCL INFO Using network IBext (MegatronActor pid=5991, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:5991:26785 [0] NCCL INFO comm 0x562a3a942530 rank 0 nranks 2 cudaDev 0 nvmlDev 0 busId d000 commId 0x528285140b14295e - Init START (MegatronActor pid=12878, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12878:26755 [0] NCCL INFO Connected all rings (MegatronActor pid=12878, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12878:26755 [0] NCCL INFO Connected all trees (MegatronActor pid=12878, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12878:26755 [0] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 512 | 512 (MegatronActor pid=12878, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12878:26755 [0] NCCL INFO 16 coll channels, 0 nvls channels, 16 p2p channels, 16 p2p channels per peer (MegatronActor pid=12878, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12878:26755 [0] NCCL INFO comm 0x56175a63fc20 rank 0 nranks 2 cudaDev 0 nvmlDev 1 busId 13000 commId 0x725f57147a25777d - Init COMPLETE (MegatronActor pid=12878, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12878:26788 [0] NCCL INFO Using network IBext (MegatronActor pid=12878, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12878:26788 [0] NCCL INFO comm 0x56175a643210 rank 0 nranks 2 cudaDev 0 nvmlDev 1 busId 13000 commId 0x6f85f68e0e8d2faf - Init START (MegatronActor pid=12879, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12879:26775 [0] NCCL INFO comm 0x55a8a5a28380 rank 1 nranks 2 cudaDev 0 nvmlDev 2 busId 29000 commId 0xa45ec5e128be0288 - Init COMPLETE (MegatronActor pid=12879, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12879:26786 [0] NCCL INFO Using network IBext (MegatronActor pid=12879, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12879:26786 [0] NCCL INFO comm 0x55a8a6b21980 rank 1 nranks 2 cudaDev 0 nvmlDev 2 busId 29000 commId 0x528285140b14295e - Init START (MegatronActor pid=12880, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12880:26774 [0] NCCL INFO Connected all rings (MegatronActor pid=12880, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12880:26774 [0] NCCL INFO Connected all trees (MegatronActor pid=12880, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12880:26774 [0] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 512 | 512 (MegatronActor pid=12880, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12880:26774 [0] NCCL INFO 16 coll channels, 0 nvls channels, 16 p2p channels, 16 p2p channels per peer (MegatronActor pid=12880, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12880:26774 [0] NCCL INFO comm 0x55b81fc464b0 rank 1 nranks 2 cudaDev 0 nvmlDev 3 busId 2d000 commId 0x725f57147a25777d - Init COMPLETE (MegatronActor pid=12880, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12880:26789 [0] NCCL INFO Using network IBext (MegatronActor pid=12880, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12880:26789 [0] NCCL INFO comm 0x55b81fdb94b0 rank 1 nranks 2 cudaDev 0 nvmlDev 3 busId 2d000 commId 0x6f85f68e0e8d2faf - Init START (MegatronActor pid=5991, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:5991:26785 [0] NCCL INFO Setting affinity for GPU 0 to ffffffff,00000000,ffffffff (MegatronActor pid=5991, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:5991:26785 [0] NCCL INFO Channel 00/16 : 0 1 (MegatronActor pid=5991, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:5991:26785 [0] NCCL INFO Channel 01/16 : 0 1 (MegatronActor pid=5991, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:5991:26785 [0] NCCL INFO Channel 02/16 : 0 1 (MegatronActor pid=5991, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:5991:26785 [0] NCCL INFO Channel 03/16 : 0 1 (MegatronActor pid=5991, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:5991:26785 [0] NCCL INFO Channel 04/16 : 0 1 (MegatronActor pid=5991, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:5991:26785 [0] NCCL INFO Channel 05/16 : 0 1 (MegatronActor pid=5991, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:5991:26785 [0] NCCL INFO Channel 06/16 : 0 1 (MegatronActor pid=5991, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:5991:26785 [0] NCCL INFO Channel 07/16 : 0 1 (MegatronActor pid=5991, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:5991:26785 [0] NCCL INFO Channel 08/16 : 0 1 (MegatronActor pid=5991, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:5991:26785 [0] NCCL INFO Channel 09/16 : 0 1 (MegatronActor pid=5991, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:5991:26785 [0] NCCL INFO Channel 10/16 : 0 1 (MegatronActor pid=5991, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:5991:26785 [0] NCCL INFO Channel 11/16 : 0 1 (MegatronActor pid=5991, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:5991:26785 [0] NCCL INFO Channel 12/16 : 0 1 (MegatronActor pid=5991, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:5991:26785 [0] NCCL INFO Channel 13/16 : 0 1 (MegatronActor pid=5991, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:5991:26785 [0] NCCL INFO Channel 14/16 : 0 1 (MegatronActor pid=5991, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:5991:26785 [0] NCCL INFO Channel 15/16 : 0 1 (MegatronActor pid=5991, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:5991:26785 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] 1/-1/-1->0->-1 [2] 1/-1/-1->0->-1 [3] 1/-1/-1->0->-1 [4] -1/-1/-1->0->1 [5] -1/-1/-1->0->1 [6] -1/-1/-1->0->1 [7] -1/-1/-1->0->1 [8] 1/-1/-1->0->-1 [9] 1/-1/-1->0->-1 [10] 1/-1/-1->0->-1 [11] 1/-1/-1->0->-1 [12] -1/-1/-1->0->1 [13] -1/-1/-1->0->1 [14] -1/-1/-1->0->1 [15] -1/-1/-1->0->1 (MegatronActor pid=5991, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:5991:26785 [0] NCCL INFO P2P Chunksize set to 524288 (MegatronActor pid=5991, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:5991:26785 [0] NCCL INFO Channel 00/0 : 0[0] -> 1[2] via P2P/IPC/read (MegatronActor pid=5991, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:5991:26785 [0] NCCL INFO Channel 01/0 : 0[0] -> 1[2] via P2P/IPC/read (MegatronActor pid=12878, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12878:26788 [0] NCCL INFO Setting affinity for GPU 1 to ffffffff,00000000,ffffffff (MegatronActor pid=12878, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12878:26788 [0] NCCL INFO Channel 00/16 : 0 1 (MegatronActor pid=12878, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12878:26788 [0] NCCL INFO Channel 01/16 : 0 1 (MegatronActor pid=12878, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12878:26788 [0] NCCL INFO Channel 02/16 : 0 1 (MegatronActor pid=12878, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12878:26788 [0] NCCL INFO Channel 03/16 : 0 1 (MegatronActor pid=12878, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12878:26788 [0] NCCL INFO Channel 04/16 : 0 1 (MegatronActor pid=12878, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12878:26788 [0] NCCL INFO Channel 05/16 : 0 1 (MegatronActor pid=12878, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12878:26788 [0] NCCL INFO Channel 06/16 : 0 1 (MegatronActor pid=12878, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12878:26788 [0] NCCL INFO Channel 07/16 : 0 1 (MegatronActor pid=12878, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12878:26788 [0] NCCL INFO Channel 08/16 : 0 1 (MegatronActor pid=12878, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12878:26788 [0] NCCL INFO Channel 09/16 : 0 1 (MegatronActor pid=12878, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12878:26788 [0] NCCL INFO Channel 10/16 : 0 1 (MegatronActor pid=12878, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12878:26788 [0] NCCL INFO Channel 11/16 : 0 1 (MegatronActor pid=12878, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12878:26788 [0] NCCL INFO Channel 12/16 : 0 1 (MegatronActor pid=12878, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12878:26788 [0] NCCL INFO Channel 13/16 : 0 1 (MegatronActor pid=12878, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12878:26788 [0] NCCL INFO Channel 14/16 : 0 1 (MegatronActor pid=12878, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12878:26788 [0] NCCL INFO Channel 15/16 : 0 1 (MegatronActor pid=12878, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12878:26788 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] 1/-1/-1->0->-1 [2] 1/-1/-1->0->-1 [3] 1/-1/-1->0->-1 [4] -1/-1/-1->0->1 [5] -1/-1/-1->0->1 [6] -1/-1/-1->0->1 [7] -1/-1/-1->0->1 [8] 1/-1/-1->0->-1 [9] 1/-1/-1->0->-1 [10] 1/-1/-1->0->-1 [11] 1/-1/-1->0->-1 [12] -1/-1/-1->0->1 [13] -1/-1/-1->0->1 [14] -1/-1/-1->0->1 [15] -1/-1/-1->0->1 (MegatronActor pid=12878, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12878:26788 [0] NCCL INFO P2P Chunksize set to 524288 (MegatronActor pid=12879, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12879:26786 [0] NCCL INFO Setting affinity for GPU 2 to ffffffff,00000000,ffffffff (MegatronActor pid=12879, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12879:26786 [0] NCCL INFO Trees [0] -1/-1/-1->1->0 [1] -1/-1/-1->1->0 [2] -1/-1/-1->1->0 [3] -1/-1/-1->1->0 [4] 0/-1/-1->1->-1 [5] 0/-1/-1->1->-1 [6] 0/-1/-1->1->-1 [7] 0/-1/-1->1->-1 [8] -1/-1/-1->1->0 [9] -1/-1/-1->1->0 [10] -1/-1/-1->1->0 [11] -1/-1/-1->1->0 [12] 0/-1/-1->1->-1 [13] 0/-1/-1->1->-1 [14] 0/-1/-1->1->-1 [15] 0/-1/-1->1->-1 (MegatronActor pid=12879, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12879:26786 [0] NCCL INFO P2P Chunksize set to 524288 (MegatronActor pid=12879, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12879:26786 [0] NCCL INFO Channel 00/0 : 1[2] -> 0[0] via P2P/IPC/read (MegatronActor pid=12879, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12879:26786 [0] NCCL INFO Channel 01/0 : 1[2] -> 0[0] via P2P/IPC/read (MegatronActor pid=12880, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12880:26789 [0] NCCL INFO Setting affinity for GPU 3 to ffffffff,00000000,ffffffff (MegatronActor pid=12880, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12880:26789 [0] NCCL INFO Trees [0] -1/-1/-1->1->0 [1] -1/-1/-1->1->0 [2] -1/-1/-1->1->0 [3] -1/-1/-1->1->0 [4] 0/-1/-1->1->-1 [5] 0/-1/-1->1->-1 [6] 0/-1/-1->1->-1 [7] 0/-1/-1->1->-1 [8] -1/-1/-1->1->0 [9] -1/-1/-1->1->0 [10] -1/-1/-1->1->0 [11] -1/-1/-1->1->0 [12] 0/-1/-1->1->-1 [13] 0/-1/-1->1->-1 [14] 0/-1/-1->1->-1 [15] 0/-1/-1->1->-1 (MegatronActor pid=12880, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12880:26789 [0] NCCL INFO P2P Chunksize set to 524288 (MegatronActor pid=12880, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12880:26789 [0] NCCL INFO Channel 00/0 : 1[3] -> 0[1] via P2P/IPC/read (MegatronActor pid=5991, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:5991:26785 [0] NCCL INFO Channel 02/0 : 0[0] -> 1[2] via P2P/IPC/read (MegatronActor pid=5991, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:5991:26785 [0] NCCL INFO Channel 03/0 : 0[0] -> 1[2] via P2P/IPC/read (MegatronActor pid=12878, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12878:26788 [0] NCCL INFO Channel 00/0 : 0[1] -> 1[3] via P2P/IPC/read (MegatronActor pid=12878, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12878:26788 [0] NCCL INFO Channel 01/0 : 0[1] -> 1[3] via P2P/IPC/read (MegatronActor pid=12878, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12878:26788 [0] NCCL INFO Channel 02/0 : 0[1] -> 1[3] via P2P/IPC/read (MegatronActor pid=12879, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12879:26786 [0] NCCL INFO Channel 02/0 : 1[2] -> 0[0] via P2P/IPC/read (MegatronActor pid=12879, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12879:26786 [0] NCCL INFO Channel 03/0 : 1[2] -> 0[0] via P2P/IPC/read (MegatronActor pid=12880, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12880:26789 [0] NCCL INFO Channel 01/0 : 1[3] -> 0[1] via P2P/IPC/read (MegatronActor pid=12880, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12880:26789 [0] NCCL INFO Channel 02/0 : 1[3] -> 0[1] via P2P/IPC/read (MegatronActor pid=5991, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:5991:26785 [0] NCCL INFO Channel 04/0 : 0[0] -> 1[2] via P2P/IPC/read (MegatronActor pid=5991, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:5991:26785 [0] NCCL INFO Channel 05/0 : 0[0] -> 1[2] via P2P/IPC/read (MegatronActor pid=12878, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12878:26788 [0] NCCL INFO Channel 03/0 : 0[1] -> 1[3] via P2P/IPC/read (MegatronActor pid=12879, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12879:26786 [0] NCCL INFO Channel 04/0 : 1[2] -> 0[0] via P2P/IPC/read (MegatronActor pid=12879, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12879:26786 [0] NCCL INFO Channel 05/0 : 1[2] -> 0[0] via P2P/IPC/read (MegatronActor pid=12880, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12880:26789 [0] NCCL INFO Channel 03/0 : 1[3] -> 0[1] via P2P/IPC/read (MegatronActor pid=12880, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12880:26789 [0] NCCL INFO Channel 04/0 : 1[3] -> 0[1] via P2P/IPC/read (MegatronActor pid=5991, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:5991:26785 [0] NCCL INFO Channel 06/0 : 0[0] -> 1[2] via P2P/IPC/read (MegatronActor pid=5991, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:5991:26785 [0] NCCL INFO Channel 07/0 : 0[0] -> 1[2] via P2P/IPC/read (MegatronActor pid=12878, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12878:26788 [0] NCCL INFO Channel 04/0 : 0[1] -> 1[3] via P2P/IPC/read (MegatronActor pid=12878, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12878:26788 [0] NCCL INFO Channel 05/0 : 0[1] -> 1[3] via P2P/IPC/read (MegatronActor pid=12879, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12879:26786 [0] NCCL INFO Channel 06/0 : 1[2] -> 0[0] via P2P/IPC/read (MegatronActor pid=12879, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12879:26786 [0] NCCL INFO Channel 07/0 : 1[2] -> 0[0] via P2P/IPC/read (MegatronActor pid=12880, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12880:26789 [0] NCCL INFO Channel 05/0 : 1[3] -> 0[1] via P2P/IPC/read (MegatronActor pid=12880, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12880:26789 [0] NCCL INFO Channel 06/0 : 1[3] -> 0[1] via P2P/IPC/read (MegatronActor pid=5991, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:5991:26785 [0] NCCL INFO Channel 08/0 : 0[0] -> 1[2] via P2P/IPC/read (MegatronActor pid=5991, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:5991:26785 [0] NCCL INFO Channel 09/0 : 0[0] -> 1[2] via P2P/IPC/read (MegatronActor pid=12878, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12878:26788 [0] NCCL INFO Channel 06/0 : 0[1] -> 1[3] via P2P/IPC/read (MegatronActor pid=12878, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12878:26788 [0] NCCL INFO Channel 07/0 : 0[1] -> 1[3] via P2P/IPC/read (MegatronActor pid=12879, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12879:26786 [0] NCCL INFO Channel 08/0 : 1[2] -> 0[0] via P2P/IPC/read (MegatronActor pid=12879, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12879:26786 [0] NCCL INFO Channel 09/0 : 1[2] -> 0[0] via P2P/IPC/read (MegatronActor pid=12880, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12880:26789 [0] NCCL INFO Channel 07/0 : 1[3] -> 0[1] via P2P/IPC/read (MegatronActor pid=12880, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12880:26789 [0] NCCL INFO Channel 08/0 : 1[3] -> 0[1] via P2P/IPC/read (MegatronActor pid=5991, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:5991:26785 [0] NCCL INFO Channel 10/0 : 0[0] -> 1[2] via P2P/IPC/read (MegatronActor pid=12878, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12878:26788 [0] NCCL INFO Channel 08/0 : 0[1] -> 1[3] via P2P/IPC/read (MegatronActor pid=12878, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12878:26788 [0] NCCL INFO Channel 09/0 : 0[1] -> 1[3] via P2P/IPC/read (MegatronActor pid=12879, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12879:26786 [0] NCCL INFO Channel 10/0 : 1[2] -> 0[0] via P2P/IPC/read (MegatronActor pid=12879, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12879:26786 [0] NCCL INFO Channel 11/0 : 1[2] -> 0[0] via P2P/IPC/read (MegatronActor pid=12880, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12880:26789 [0] NCCL INFO Channel 09/0 : 1[3] -> 0[1] via P2P/IPC/read (MegatronActor pid=5991, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:5991:26785 [0] NCCL INFO Channel 11/0 : 0[0] -> 1[2] via P2P/IPC/read (MegatronActor pid=5991, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:5991:26785 [0] NCCL INFO Channel 12/0 : 0[0] -> 1[2] via P2P/IPC/read (MegatronActor pid=12878, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12878:26788 [0] NCCL INFO Channel 10/0 : 0[1] -> 1[3] via P2P/IPC/read (MegatronActor pid=12878, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12878:26788 [0] NCCL INFO Channel 11/0 : 0[1] -> 1[3] via P2P/IPC/read (MegatronActor pid=12879, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12879:26786 [0] NCCL INFO Channel 12/0 : 1[2] -> 0[0] via P2P/IPC/read (MegatronActor pid=12879, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12879:26786 [0] NCCL INFO Channel 13/0 : 1[2] -> 0[0] via P2P/IPC/read (MegatronActor pid=12880, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12880:26789 [0] NCCL INFO Channel 10/0 : 1[3] -> 0[1] via P2P/IPC/read (MegatronActor pid=12880, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12880:26789 [0] NCCL INFO Channel 11/0 : 1[3] -> 0[1] via P2P/IPC/read (MegatronActor pid=5991, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:5991:26785 [0] NCCL INFO Channel 13/0 : 0[0] -> 1[2] via P2P/IPC/read (MegatronActor pid=5991, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:5991:26785 [0] NCCL INFO Channel 14/0 : 0[0] -> 1[2] via P2P/IPC/read (MegatronActor pid=12878, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12878:26788 [0] NCCL INFO Channel 12/0 : 0[1] -> 1[3] via P2P/IPC/read (MegatronActor pid=12878, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12878:26788 [0] NCCL INFO Channel 13/0 : 0[1] -> 1[3] via P2P/IPC/read (MegatronActor pid=12879, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12879:26786 [0] NCCL INFO Channel 14/0 : 1[2] -> 0[0] via P2P/IPC/read (MegatronActor pid=12879, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12879:26786 [0] NCCL INFO Channel 15/0 : 1[2] -> 0[0] via P2P/IPC/read (MegatronActor pid=12880, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12880:26789 [0] NCCL INFO Channel 12/0 : 1[3] -> 0[1] via P2P/IPC/read (MegatronActor pid=12880, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12880:26789 [0] NCCL INFO Channel 13/0 : 1[3] -> 0[1] via P2P/IPC/read (MegatronActor pid=5991, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:5991:26785 [0] NCCL INFO Channel 15/0 : 0[0] -> 1[2] via P2P/IPC/read (MegatronActor pid=12878, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12878:26788 [0] NCCL INFO Channel 14/0 : 0[1] -> 1[3] via P2P/IPC/read (MegatronActor pid=12878, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12878:26788 [0] NCCL INFO Channel 15/0 : 0[1] -> 1[3] via P2P/IPC/read (MegatronActor pid=12880, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12880:26789 [0] NCCL INFO Channel 14/0 : 1[3] -> 0[1] via P2P/IPC/read (MegatronActor pid=12880, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12880:26789 [0] NCCL INFO Channel 15/0 : 1[3] -> 0[1] via P2P/IPC/read (MegatronActor pid=5991, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:5991:26785 [0] NCCL INFO Connected all rings (MegatronActor pid=5991, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:5991:26785 [0] NCCL INFO Connected all trees (MegatronActor pid=5991, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:5991:26785 [0] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 512 | 512 (MegatronActor pid=5991, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:5991:26785 [0] NCCL INFO 16 coll channels, 0 nvls channels, 16 p2p channels, 16 p2p channels per peer (MegatronActor pid=5991, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:5991:26785 [0] NCCL INFO comm 0x562a3a942530 rank 0 nranks 2 cudaDev 0 nvmlDev 0 busId d000 commId 0x528285140b14295e - Init COMPLETE (MegatronActor pid=12878, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12878:26788 [0] NCCL INFO Connected all rings (MegatronActor pid=12878, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12878:26788 [0] NCCL INFO Connected all trees (MegatronActor pid=12878, ip=33.207.59.232) t-20240725161813-n7hxb-worker-1:12878:26788 [0] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 512 | 512