microsoft / DeepSpeed

DeepSpeed is a deep learning optimization library that makes distributed training and inference easy, efficient, and effective.
https://www.deepspeed.ai/
Apache License 2.0
35.28k stars 4.09k forks source link

DGX-1 / V100-32GB-SMX2 - nvlink is not activated (0 activity) #1444

Open GPTDGX opened 3 years ago

GPTDGX commented 3 years ago

Hi, what are the requirement for NVLINK to function. I have 2 machine - one is regular pcie 3090 - 2 x cards in nvlink - works good and nvlink shows activity via : nvidia-smi nvlink -gt r and DGX-1 server - NVLINK is not activated by DeepSpeed. - show activity as N/A, although nvidia-smi topo - m / nvidia-smi nvlink -s show all nvlink present and ready to go. DGX-1 server run by bare (not NVIDIA image) Ubuntu 20, pytorch, cuda, driver, nvcc, nccl - all installed, DeepSpeed compiled with 7.0 gpu feature ...

Would appreciate any help - terminal screen available if needed. \

I am testing on GPT-J-6B - fine-tuning - using code from this repo:

https://github.com/mallorbc/Finetune_GPTNEO_GPTJ6B

....... nvidia-smi nvlink -gt r : GPU 1: Tesla V100-SXM2-32GB (UUID: GPU-ed46f244-d7b4-5053-89bc-f68119fa49e9) Link 0: Raw Tx: N/A Link 0: Raw Rx: N/A ...... Link 5: Raw Tx: N/A Link 5: Raw Rx: N/A

(gpt) user@user-X9DRG-HF:~/transformers/Finetune_GPTNEO_GPTJ6B/finetuning_repo$ TRANSFORMERS_OFFLINE=1 deepspeed --num_gpus=8 run_clm.py --deepspeed ds_config.json --model_name_or_path gpt2-xl --train_file train.csv --validation_file validation.csv --do_train --do_eval --overwrite_cache --evaluation_strategy="steps" --output_dir finetuned --num_train_epochs 1 --eval_steps 15 --gradient_accumulation_steps 2 --per_device_train_batch_size 1 --use_fast_tokenizer False --learning_rate 5e-06 --warmup_steps 10 --overwrite_output_dir --fp16 [2021-10-10 00:04:17,843] [WARNING] [runner.py:122:fetch_hostfile] Unable to find hostfile, will proceed with training with local resources only. [2021-10-10 00:04:18,286] [INFO] [runner.py:360:main] cmd = /home/user/gpt/bin/python -u -m deepspeed.launcher.launch --world_info=eyJsb2NhbGhvc3QiOiBbMCwgMSwgMiwgMywgNCwgNSwgNiwgN119 --master_addr=127.0.0.1 --master_port=29500 run_clm.py --deepspeed ds_config.json --model_name_or_path gpt2-xl --train_file train.csv --validation_file validation.csv --do_train --do_eval --overwrite_cache --evaluation_strategy=steps --output_dir finetuned --num_train_epochs 1 --eval_steps 15 --gradient_accumulation_steps 2 --per_device_train_batch_size 1 --use_fast_tokenizer False --learning_rate 5e-06 --warmup_steps 10 --overwrite_output_dir --fp16 [2021-10-10 00:04:19,198] [INFO] [launch.py:80:main] WORLD INFO DICT: {'localhost': [0, 1, 2, 3, 4, 5, 6, 7]} [2021-10-10 00:04:19,198] [INFO] [launch.py:86:main] nnodes=1, num_local_procs=8, node_rank=0 [2021-10-10 00:04:19,198] [INFO] [launch.py:101:main] global_rank_mapping=defaultdict(<class 'list'>, {'localhost': [0, 1, 2, 3, 4, 5, 6, 7]}) [2021-10-10 00:04:19,198] [INFO] [launch.py:102:main] dist_world_size=8 [2021-10-10 00:04:19,198] [INFO] [launch.py:104:main] Setting CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 [2021-10-10 00:04:20,924] [INFO] [distributed.py:46:init_distributed] Initializing torch distributed with backend: nccl [2021-10-10 00:04:20,956] [INFO] [distributed.py:46:init_distributed] Initializing torch distributed with backend: nccl [2021-10-10 00:04:21,018] [INFO] [distributed.py:46:init_distributed] Initializing torch distributed with backend: nccl [2021-10-10 00:04:21,019] [INFO] [distributed.py:46:init_distributed] Initializing torch distributed with backend: nccl [2021-10-10 00:04:21,027] [INFO] [distributed.py:46:init_distributed] Initializing torch distributed with backend: nccl [2021-10-10 00:04:21,051] [INFO] [distributed.py:46:init_distributed] Initializing torch distributed with backend: nccl [2021-10-10 00:04:21,064] [INFO] [distributed.py:46:init_distributed] Initializing torch distributed with backend: nccl [2021-10-10 00:04:21,162] [INFO] [distributed.py:46:init_distributed] Initializing torch distributed with backend: nccl 10/10/2021 00:04:22 - WARNING - main - Process rank: 3, device: cuda:3, n_gpu: 1distributed training: True, 16-bits training: True 10/10/2021 00:04:22 - WARNING - main - Process rank: 1, device: cuda:1, n_gpu: 1distributed training: True, 16-bits training: True 10/10/2021 00:04:22 - WARNING - main - Process rank: 0, device: cuda:0, n_gpu: 1distributed training: True, 16-bits training: True 10/10/2021 00:04:22 - WARNING - main - Process rank: 2, device: cuda:2, n_gpu: 1distributed training: True, 16-bits training: True 10/10/2021 00:04:22 - WARNING - main - Process rank: 7, device: cuda:7, n_gpu: 1distributed training: True, 16-bits training: True 10/10/2021 00:04:22 - WARNING - main - Process rank: 5, device: cuda:5, n_gpu: 1distributed training: True, 16-bits training: True 10/10/2021 00:04:22 - INFO - main - Training/evaluation parameters TrainingArguments( _n_gpu=1, adafactor=False, adam_beta1=0.9, adam_beta2=0.999, adam_epsilon=1e-08, dataloader_drop_last=False, dataloader_num_workers=0, dataloader_pin_memory=True, ddp_find_unused_parameters=None, debug=[], deepspeed=ds_config.json, disable_tqdm=False, do_eval=True, do_predict=False, do_train=True, eval_accumulation_steps=None, eval_steps=15, evaluation_strategy=IntervalStrategy.STEPS, fp16=True, fp16_backend=auto, fp16_full_eval=False, fp16_opt_level=O1, gradient_accumulation_steps=2, gradient_checkpointing=False, greater_is_better=None, group_by_length=False, hub_model_id=None, hub_strategy=HubStrategy.EVERY_SAVE, hub_token=, ignore_data_skip=False, label_names=None, label_smoothing_factor=0.0, learning_rate=5e-06, length_column_name=length, load_best_model_at_end=False, local_rank=0, log_level=-1, log_level_replica=-1, log_on_each_node=True, logging_dir=finetuned/runs/Oct10_00-04-20_user-X9DRG-HF, logging_first_step=False, logging_nan_inf_filter=True, logging_steps=500, logging_strategy=IntervalStrategy.STEPS, lr_scheduler_type=SchedulerType.LINEAR, max_grad_norm=1.0, max_steps=-1, metric_for_best_model=None, mp_parameters=, no_cuda=False, num_train_epochs=1.0, output_dir=finetuned, overwrite_output_dir=True, past_index=-1, per_device_eval_batch_size=8, per_device_train_batch_size=1, prediction_loss_only=False, push_to_hub=False, push_to_hub_model_id=None, push_to_hub_organization=None, push_to_hub_token=, remove_unused_columns=True, report_to=['tensorboard'], resume_from_checkpoint=None, run_name=finetuned, save_on_each_node=False, save_steps=500, save_strategy=IntervalStrategy.STEPS, save_total_limit=None, seed=42, sharded_ddp=[], skip_memory_metrics=True, tpu_metrics_debug=False, tpu_num_cores=None, use_legacy_prediction_loop=False, warmup_ratio=0.0, warmup_steps=10, weight_decay=0.0, xpu_backend=None, ) 10/10/2021 00:04:22 - WARNING - main - Process rank: 6, device: cuda:6, n_gpu: 1distributed training: True, 16-bits training: True 10/10/2021 00:04:22 - WARNING - main - Process rank: 4, device: cuda:4, n_gpu: 1distributed training: True, 16-bits training: True 10/10/2021 00:04:22 - WARNING - datasets.builder - Using custom data configuration default-9dc66b5cf9d5f9f5 10/10/2021 00:04:22 - WARNING - datasets.builder - Using custom data configuration default-9dc66b5cf9d5f9f5 10/10/2021 00:04:22 - WARNING - datasets.builder - Reusing dataset csv (/home/user/.cache/huggingface/datasets/csv/default-9dc66b5cf9d5f9f5/0.0.0/9144e0a4e8435090117cea53e6c7537173ef2304525df4a077c435d8ee7828ff) 0%| | 0/2 [00:00<?, ?it/s]10/10/2021 00:04:22 - WARNING - datasets.builder - Using custom data configuration default-9dc66b5cf9d5f9f5 10/10/2021 00:04:22 - WARNING - datasets.builder - Using custom data configuration default-9dc66b5cf9d5f9f5 10/10/2021 00:04:22 - WARNING - datasets.builder - Using custom data configuration default-9dc66b5cf9d5f9f5 10/10/2021 00:04:22 - WARNING - datasets.builder - Reusing dataset csv (/home/user/.cache/huggingface/datasets/csv/default-9dc66b5cf9d5f9f5/0.0.0/9144e0a4e8435090117cea53e6c7537173ef2304525df4a077c435d8ee7828ff) 10/10/2021 00:04:22 - WARNING - datasets.builder - Using custom data configuration default-9dc66b5cf9d5f9f5 10/10/2021 00:04:22 - WARNING - datasets.builder - Using custom data configuration default-9dc66b5cf9d5f9f5 0%| | 0/2 [00:00<?, ?it/s]10/10/2021 00:04:22 - WARNING - datasets.builder - Using custom data configuration default-9dc66b5cf9d5f9f5 10/10/2021 00:04:22 - WARNING - datasets.builder - Reusing dataset csv (/home/user/.cache/huggingface/datasets/csv/default-9dc66b5cf9d5f9f5/0.0.0/9144e0a4e8435090117cea53e6c7537173ef2304525df4a077c435d8ee7828ff) 100%|████████████████████████████████████████████| 2/2 [00:00<00:00, 255.47it/s] 100%|████████████████████████████████████████████| 2/2 [00:00<00:00, 434.98it/s] 100%|████████████████████████████████████████████| 2/2 [00:00<00:00, 356.05it/s] 10/10/2021 00:04:23 - WARNING - datasets.builder - Reusing dataset csv (/home/user/.cache/huggingface/datasets/csv/default-9dc66b5cf9d5f9f5/0.0.0/9144e0a4e8435090117cea53e6c7537173ef2304525df4a077c435d8ee7828ff) 0%| | 0/2 [00:00<?, ?it/s]10/10/2021 00:04:23 - WARNING - datasets.builder - Reusing dataset csv (/home/user/.cache/huggingface/datasets/csv/default-9dc66b5cf9d5f9f5/0.0.0/9144e0a4e8435090117cea53e6c7537173ef2304525df4a077c435d8ee7828ff) 10/10/2021 00:04:23 - WARNING - datasets.builder - Reusing dataset csv (/home/user/.cache/huggingface/datasets/csv/default-9dc66b5cf9d5f9f5/0.0.0/9144e0a4e8435090117cea53e6c7537173ef2304525df4a077c435d8ee7828ff) 0%| | 0/2 [00:00<?, ?it/s]10/10/2021 00:04:23 - WARNING - datasets.builder - Reusing dataset csv (/home/user/.cache/huggingface/datasets/csv/default-9dc66b5cf9d5f9f5/0.0.0/9144e0a4e8435090117cea53e6c7537173ef2304525df4a077c435d8ee7828ff) 100%|████████████████████████████████████████████| 2/2 [00:00<00:00, 287.18it/s] 100%|████████████████████████████████████████████| 2/2 [00:00<00:00, 299.65it/s] [INFO|configuration_utils.py:531] 2021-10-10 00:04:23,014 >> Offline mode: forcing local_files_only=True [INFO|configuration_utils.py:584] 2021-10-10 00:04:23,015 >> loading configuration file https://huggingface.co/gpt2-xl/resolve/main/config.json from cache at /home/user/.cache/huggingface/transformers/d2de8fec009fa9b9196047559bcac6c1f02a9c500718b4346bc516354965b1ca.d684cb2afa3f8c44c73bd67537d9aa5ff6044658793e077d7306ef2e37dd79bd [INFO|configuration_utils.py:621] 2021-10-10 00:04:23,017 >> Model config GPT2Config { "activation_function": "gelu_new", "architectures": [ "GPT2LMHeadModel" ], "attn_pdrop": 0.1, "bos_token_id": 50256, "embd_pdrop": 0.1, "eos_token_id": 50256, "initializer_range": 0.02, "layer_norm_epsilon": 1e-05, "model_type": "gpt2", "n_ctx": 1024, "n_embd": 1600, "n_head": 25, "n_inner": null, "n_layer": 48, "n_positions": 1024, "output_past": true, "reorder_and_upcast_attn": false, "resid_pdrop": 0.1, "scale_attn_by_inverse_layer_idx": false, "scale_attn_weights": true, "summary_activation": null, "summary_first_dropout": 0.1, "summary_proj_to_labels": true, "summary_type": "cls_index", "summary_use_proj": true, "task_specific_params": { "text-generation": { "do_sample": true, "max_length": 50 } }, "transformers_version": "4.12.0.dev0", "use_cache": true, "vocab_size": 50257 }

[INFO|tokenization_auto.py:310] 2021-10-10 00:04:23,017 >> Offline mode: forcing local_files_only=True [INFO|tokenization_auto.py:334] 2021-10-10 00:04:23,017 >> Could not locate the tokenizer configuration file, will try to use the model config instead. [INFO|configuration_utils.py:531] 2021-10-10 00:04:23,017 >> Offline mode: forcing local_files_only=True [INFO|configuration_utils.py:584] 2021-10-10 00:04:23,018 >> loading configuration file https://huggingface.co/gpt2-xl/resolve/main/config.json from cache at /home/user/.cache/huggingface/transformers/d2de8fec009fa9b9196047559bcac6c1f02a9c500718b4346bc516354965b1ca.d684cb2afa3f8c44c73bd67537d9aa5ff6044658793e077d7306ef2e37dd79bd [INFO|configuration_utils.py:621] 2021-10-10 00:04:23,019 >> Model config GPT2Config { "activation_function": "gelu_new", "architectures": [ "GPT2LMHeadModel" ], "attn_pdrop": 0.1, "bos_token_id": 50256, "embd_pdrop": 0.1, "eos_token_id": 50256, "initializer_range": 0.02, "layer_norm_epsilon": 1e-05, "model_type": "gpt2", "n_ctx": 1024, "n_embd": 1600, "n_head": 25, "n_inner": null, "n_layer": 48, "n_positions": 1024, "output_past": true, "reorder_and_upcast_attn": false, "resid_pdrop": 0.1, "scale_attn_by_inverse_layer_idx": false, "scale_attn_weights": true, "summary_activation": null, "summary_first_dropout": 0.1, "summary_proj_to_labels": true, "summary_type": "cls_index", "summary_use_proj": true, "task_specific_params": { "text-generation": { "do_sample": true, "max_length": 50 } }, "transformers_version": "4.12.0.dev0", "use_cache": true, "vocab_size": 50257 }

100%|████████████████████████████████████████████| 2/2 [00:00<00:00, 183.22it/s] [INFO|tokenization_utils_base.py:1629] 2021-10-10 00:04:23,020 >> Offline mode: forcing local_files_only=True 100%|████████████████████████████████████████████| 2/2 [00:00<00:00, 187.21it/s] [INFO|tokenization_utils_base.py:1717] 2021-10-10 00:04:23,023 >> Can't load following files from cache: ['added_tokens_file', 'special_tokens_map_file', 'tokenizer_config_file'] and cannot check if these files are necessary for the tokenizer to operate. [INFO|tokenization_utils_base.py:1742] 2021-10-10 00:04:23,023 >> loading file https://huggingface.co/gpt2-xl/resolve/main/vocab.json from cache at /home/user/.cache/huggingface/transformers/8560a2df03f812b276794ae6935255d0590522553a4c8103155472b07591a21b.c7ed1f96aac49e745788faa77ba0a26a392643a50bb388b9c04ff469e555241f [INFO|tokenization_utils_base.py:1742] 2021-10-10 00:04:23,023 >> loading file https://huggingface.co/gpt2-xl/resolve/main/merges.txt from cache at /home/user/.cache/huggingface/transformers/18fe27e0b70062b3e45fc4e827d5449d9fe85875937594da927e48cb657366d1.5d12962c5ee615a4c803841266e9c3be9a691a924f72d395d3a6c6c81157788b [INFO|tokenization_utils_base.py:1742] 2021-10-10 00:04:23,023 >> loading file https://huggingface.co/gpt2-xl/resolve/main/tokenizer.json from cache at /home/user/.cache/huggingface/transformers/aabb8839163cd911f810ab23f5ae8c966b9b9ea60622c429020611caa389b04b.cf2d0ecb83b6df91b3dbb53f1d1e4c311578bfd3aa0e04934215a49bf9898df0 [INFO|configuration_utils.py:531] 2021-10-10 00:04:23,023 >> Offline mode: forcing local_files_only=True [INFO|configuration_utils.py:584] 2021-10-10 00:04:23,024 >> loading configuration file https://huggingface.co/gpt2-xl/resolve/main/config.json from cache at /home/user/.cache/huggingface/transformers/d2de8fec009fa9b9196047559bcac6c1f02a9c500718b4346bc516354965b1ca.d684cb2afa3f8c44c73bd67537d9aa5ff6044658793e077d7306ef2e37dd79bd [INFO|configuration_utils.py:621] 2021-10-10 00:04:23,025 >> Model config GPT2Config { "activation_function": "gelu_new", "architectures": [ "GPT2LMHeadModel" ], "attn_pdrop": 0.1, "bos_token_id": 50256, "embd_pdrop": 0.1, "eos_token_id": 50256, "initializer_range": 0.02, "layer_norm_epsilon": 1e-05, "model_type": "gpt2", "n_ctx": 1024, "n_embd": 1600, "n_head": 25, "n_inner": null, "n_layer": 48, "n_positions": 1024, "output_past": true, "reorder_and_upcast_attn": false, "resid_pdrop": 0.1, "scale_attn_by_inverse_layer_idx": false, "scale_attn_weights": true, "summary_activation": null, "summary_first_dropout": 0.1, "summary_proj_to_labels": true, "summary_type": "cls_index", "summary_use_proj": true, "task_specific_params": { "text-generation": { "do_sample": true, "max_length": 50 } }, "transformers_version": "4.12.0.dev0", "use_cache": true, "vocab_size": 50257 }

10/10/2021 00:04:23 - WARNING - datasets.builder - Reusing dataset csv (/home/user/.cache/huggingface/datasets/csv/default-9dc66b5cf9d5f9f5/0.0.0/9144e0a4e8435090117cea53e6c7537173ef2304525df4a077c435d8ee7828ff) 100%|████████████████████████████████████████████| 2/2 [00:00<00:00, 351.00it/s] [INFO|modeling_utils.py:1225] 2021-10-10 00:04:23,138 >> Offline mode: forcing local_files_only=True [INFO|modeling_utils.py:1324] 2021-10-10 00:04:23,139 >> loading weights file https://huggingface.co/gpt2-xl/resolve/main/pytorch_model.bin from cache at /home/user/.cache/huggingface/transformers/96569b907e56747ce3e593c6a13d8475b8c733a64aab8af8f602b90d94c4af71.8fbbcdf404c82c5967934d411f1462fa0574d639f2aa398aa3754fced1bb26c0 [INFO|modeling_utils.py:1589] 2021-10-10 00:04:45,762 >> All model checkpoint weights were used when initializing GPT2LMHeadModel.

[INFO|modeling_utils.py:1597] 2021-10-10 00:04:45,763 >> All the weights of GPT2LMHeadModel were initialized from the model checkpoint at gpt2-xl. If your task is similar to the task the model of the checkpoint was trained on, you can already use GPT2LMHeadModel for predictions without further training. 100%|███████████████████████████████████████████| 32/32 [00:08<00:00, 3.98ba/s] 100%|███████████████████████████████████████████| 32/32 [00:08<00:00, 3.96ba/s] 100%|███████████████████████████████████████████| 32/32 [00:08<00:00, 3.96ba/s] 100%|███████████████████████████████████████████| 32/32 [00:08<00:00, 3.93ba/s] 100%|███████████████████████████████████████████| 32/32 [00:08<00:00, 3.81ba/s] 100%|███████████████████████████████████████████| 32/32 [00:08<00:00, 3.98ba/s] 100%|███████████████████████████████████████████| 32/32 [00:08<00:00, 3.84ba/s] 100%|███████████████████████████████████████████| 32/32 [00:08<00:00, 3.88ba/s] 100%|█████████████████████████████████████████████| 8/8 [00:01<00:00, 4.31ba/s] 100%|█████████████████████████████████████████████| 8/8 [00:01<00:00, 4.25ba/s] 100%|█████████████████████████████████████████████| 8/8 [00:01<00:00, 4.27ba/s] 100%|█████████████████████████████████████████████| 8/8 [00:01<00:00, 4.22ba/s] 100%|█████████████████████████████████████████████| 8/8 [00:01<00:00, 4.32ba/s] 100%|█████████████████████████████████████████████| 8/8 [00:01<00:00, 4.05ba/s] 100%|█████████████████████████████████████████████| 8/8 [00:01<00:00, 4.19ba/s] 100%|█████████████████████████████████████████████| 8/8 [00:01<00:00, 4.14ba/s] 100%|███████████████████████████████████████████| 32/32 [00:05<00:00, 6.27ba/s] 100%|███████████████████████████████████████████| 32/32 [00:05<00:00, 6.07ba/s] 100%|███████████████████████████████████████████| 32/32 [00:05<00:00, 6.19ba/s] 100%|███████████████████████████████████████████| 32/32 [00:05<00:00, 6.30ba/s] 100%|███████████████████████████████████████████| 32/32 [00:05<00:00, 6.30ba/s] 100%|███████████████████████████████████████████| 32/32 [00:05<00:00, 6.20ba/s] 100%|█████████████████████████████████████████████| 8/8 [00:01<00:00, 6.23ba/s] 100%|███████████████████████████████████████████| 32/32 [00:05<00:00, 6.14ba/s] 100%|███████████████████████████████████████████| 32/32 [00:05<00:00, 5.78ba/s] 100%|█████████████████████████████████████████████| 8/8 [00:01<00:00, 5.62ba/s] 100%|█████████████████████████████████████████████| 8/8 [00:01<00:00, 5.58ba/s] 100%|█████████████████████████████████████████████| 8/8 [00:01<00:00, 4.72ba/s] [INFO|trainer.py:434] 2021-10-10 00:05:05,020 >> Using amp fp16 backend [2021-10-10 00:05:05,026] [INFO] [logging.py:68:log_dist] [Rank 0] DeepSpeed info: version=0.5.5+cd7967d, git-hash=cd7967d, git-branch=master 100%|█████████████████████████████████████████████| 8/8 [00:01<00:00, 5.02ba/s] 100%|█████████████████████████████████████████████| 8/8 [00:01<00:00, 4.51ba/s] 100%|█████████████████████████████████████████████| 8/8 [00:01<00:00, 4.08ba/s] 100%|█████████████████████████████████████████████| 8/8 [00:02<00:00, 3.88ba/s] [2021-10-10 00:05:14,451] [INFO] [logging.py:68:log_dist] [Rank 0] initializing deepspeed groups [2021-10-10 00:05:14,451] [INFO] [logging.py:68:log_dist] [Rank 0] initializing deepspeed model parallel group with size 1 [2021-10-10 00:05:15,992] [INFO] [logging.py:68:log_dist] [Rank 0] initializing deepspeed expert parallel group with size 1 [2021-10-10 00:05:16,003] [INFO] [logging.py:68:log_dist] [Rank 0] creating expert data parallel process group with ranks: [0, 1, 2, 3, 4, 5, 6, 7] [2021-10-10 00:05:16,014] [INFO] [logging.py:68:log_dist] [Rank 0] creating expert parallel process group with ranks: [0] [2021-10-10 00:05:16,024] [INFO] [logging.py:68:log_dist] [Rank 0] creating expert parallel process group with ranks: [1] [2021-10-10 00:05:16,025] [INFO] [logging.py:68:log_dist] [Rank 0] creating expert parallel process group with ranks: [2] [2021-10-10 00:05:16,035] [INFO] [logging.py:68:log_dist] [Rank 0] creating expert parallel process group with ranks: [3] [2021-10-10 00:05:16,046] [INFO] [logging.py:68:log_dist] [Rank 0] creating expert parallel process group with ranks: [4] [2021-10-10 00:05:16,057] [INFO] [logging.py:68:log_dist] [Rank 0] creating expert parallel process group with ranks: [5] [2021-10-10 00:05:16,057] [INFO] [logging.py:68:log_dist] [Rank 0] creating expert parallel process group with ranks: [6] [2021-10-10 00:05:16,068] [INFO] [logging.py:68:log_dist] [Rank 0] creating expert parallel process group with ranks: [7] [2021-10-10 00:05:18,134] [INFO] [engine.py:204:init] DeepSpeed Flops Profiler Enabled: False Using /home/user/.cache/torch_extensions as PyTorch extensions root... Using /home/user/.cache/torch_extensions as PyTorch extensions root... Using /home/user/.cache/torch_extensions as PyTorch extensions root... Using /home/user/.cache/torch_extensions as PyTorch extensions root... Using /home/user/.cache/torch_extensions as PyTorch extensions root... Using /home/user/.cache/torch_extensions as PyTorch extensions root... Using /home/user/.cache/torch_extensions as PyTorch extensions root... Using /home/user/.cache/torch_extensions as PyTorch extensions root... Detected CUDA files, patching ldflags Emitting ninja build file /home/user/.cache/torch_extensions/fused_adam/build.ninja... Building extension module fused_adam... Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N) ninja: no work to do. Loading extension module fused_adam... Time to load fused_adam op: 0.937584400177002 seconds [2021-10-10 00:05:20,163] [INFO] [engine.py:862:_configure_optimizer] Using DeepSpeed Optimizer param name adamw as basic optimizer Loading extension module fused_adam... Time to load fused_adam op: 0.904491662979126 seconds [2021-10-10 00:05:20,215] [INFO] [engine.py:870:_configure_optimizer] DeepSpeed Basic Optimizer = FusedAdam [2021-10-10 00:05:20,215] [INFO] [utils.py:43:is_zero_supported_optimizer] Checking ZeRO support for optimizer=FusedAdam type=<class 'deepspeed.ops.adam.fused_adam.FusedAdam'> [2021-10-10 00:05:20,215] [INFO] [logging.py:68:log_dist] [Rank 0] Creating fp16 ZeRO stage 2 optimizer [2021-10-10 00:05:20,215] [INFO] [stage2.py:111:init] Reduce bucket size 500000000.0 [2021-10-10 00:05:20,215] [INFO] [stage2.py:112:init] Allgather bucket size 500000000.0 [2021-10-10 00:05:20,215] [INFO] [stage2.py:113:init] CPU Offload: False [2021-10-10 00:05:20,215] [INFO] [stage2.py:114:init] Round robin gradient partitioning: False Using /home/user/.cache/torch_extensions as PyTorch extensions root... Loading extension module fused_adam... Time to load fused_adam op: 0.9040207862854004 seconds Using /home/user/.cache/torch_extensions as PyTorch extensions root... Loading extension module fused_adam... Time to load fused_adam op: 0.904207706451416 seconds Loading extension module fused_adam... Loading extension module fused_adam... Time to load fused_adam op: 1.0050318241119385 seconds Time to load fused_adam op: 0.9040296077728271 seconds Loading extension module fused_adam... Time to load fused_adam op: 1.0055632591247559 seconds Loading extension module fused_adam... Time to load fused_adam op: 0.9035804271697998 seconds Using /home/user/.cache/torch_extensions as PyTorch extensions root... Using /home/user/.cache/torch_extensions as PyTorch extensions root... Using /home/user/.cache/torch_extensions as PyTorch extensions root... Using /home/user/.cache/torch_extensions as PyTorch extensions root... Using /home/user/.cache/torch_extensions as PyTorch extensions root... Using /home/user/.cache/torch_extensions as PyTorch extensions root... Emitting ninja build file /home/user/.cache/torch_extensions/utils/build.ninja... Building extension module utils... Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N) ninja: no work to do. Loading extension module utils... Time to load utils op: 0.9010043144226074 seconds Loading extension module utils... Loading extension module utils... Time to load utils op: 0.904000997543335 seconds Time to load utils op: 0.8038058280944824 seconds Loading extension module utils... Loading extension module utils... Time to load utils op: 0.9041271209716797 seconds Time to load utils op: 0.9041614532470703 seconds Loading extension module utils... Time to load utils op: 0.9038200378417969 seconds Loading extension module utils... Loading extension module utils... Time to load utils op: 0.9039947986602783 seconds Time to load utils op: 0.9037423133850098 seconds Rank: 4 partition count [8] and sizes[(194701400, False)] [W ProcessGroupNCCL.cpp:1569] Rank 4 using best-guess GPU 4 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect.Specify device_ids in barrier() to force use of a particular device. Rank: 5 partition count [8] and sizes[(194701400, False)] [W ProcessGroupNCCL.cpp:1569] Rank 5 using best-guess GPU 5 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect.Specify device_ids in barrier() to force use of a particular device. Rank: 0 partition count [8] and sizes[(194701400, False)] [W ProcessGroupNCCL.cpp:1569] Rank 0 using best-guess GPU 0 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect.Specify device_ids in barrier() to force use of a particular device. Rank: 3 partition count [8] and sizes[(194701400, False)] [W ProcessGroupNCCL.cpp:1569] Rank 3 using best-guess GPU 3 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect.Specify device_ids in barrier() to force use of a particular device. Rank: 6 partition count [8] and sizes[(194701400, False)] [W ProcessGroupNCCL.cpp:1569] Rank 6 using best-guess GPU 6 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect.Specify device_ids in barrier() to force use of a particular device. Rank: 2 partition count [8] and sizes[(194701400, False)] [W ProcessGroupNCCL.cpp:1569] Rank 2 using best-guess GPU 2 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect.Specify device_ids in barrier() to force use of a particular device. Rank: 1 partition count [8] and sizes[(194701400, False)] [W ProcessGroupNCCL.cpp:1569] Rank 1 using best-guess GPU 1 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect.Specify device_ids in barrier() to force use of a particular device. Rank: 7 partition count [8] and sizes[(194701400, False)] [W ProcessGroupNCCL.cpp:1569] Rank 7 using best-guess GPU 7 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect.Specify device_ids in barrier() to force use of a particular device. Using /home/user/.cache/torch_extensions as PyTorch extensions root... No modifications detected for re-loaded extension module utils, skipping build step... Loading extension module utils... Time to load utils op: 0.0008037090301513672 seconds Using /home/user/.cache/torch_extensions as PyTorch extensions root... No modifications detected for re-loaded extension module utils, skipping build step... Loading extension module utils... Using /home/user/.cache/torch_extensions as PyTorch extensions root... Time to load utils op: 0.0009813308715820312 seconds Using /home/user/.cache/torch_extensions as PyTorch extensions root... No modifications detected for re-loaded extension module utils, skipping build step... Loading extension module utils... Using /home/user/.cache/torch_extensions as PyTorch extensions root... No modifications detected for re-loaded extension module utils, skipping build step... Loading extension module utils... Using /home/user/.cache/torch_extensions as PyTorch extensions root... Time to load utils op: 0.0010094642639160156 seconds No modifications detected for re-loaded extension module utils, skipping build step... Loading extension module utils... Time to load utils op: 0.0010919570922851562 seconds Using /home/user/.cache/torch_extensions as PyTorch extensions root... No modifications detected for re-loaded extension module utils, skipping build step... Loading extension module utils... Time to load utils op: 0.0010976791381835938 seconds Time to load utils op: 0.001089334487915039 secondsNo modifications detected for re-loaded extension module utils, skipping build step...

Loading extension module utils... Time to load utils op: 0.0011730194091796875 seconds [2021-10-10 00:05:29,913] [INFO] [utils.py:806:see_memory_usage] Before initializing optimizer states [2021-10-10 00:05:29,913] [INFO] [utils.py:807:see_memory_usage] MA 3.67 GB Max_MA 4.04 GB CA 7.03 GB Max_CA 7 GB [2021-10-10 00:05:29,914] [INFO] [utils.py:815:see_memory_usage] CPU Virtual Memory: used = 67.33 GB, percent = 13.4% [2021-10-10 00:05:29,958] [INFO] [utils.py:806:see_memory_usage] After initializing optimizer states [2021-10-10 00:05:29,959] [INFO] [utils.py:807:see_memory_usage] MA 5.12 GB Max_MA 5.85 GB CA 9.21 GB Max_CA 9 GB [2021-10-10 00:05:29,959] [INFO] [utils.py:815:see_memory_usage] CPU Virtual Memory: used = 67.33 GB, percent = 13.4% [2021-10-10 00:05:29,959] [INFO] [stage2.py:474:init] optimizer state initialized [2021-10-10 00:05:29,992] [INFO] [utils.py:806:see_memory_usage] After initializing ZeRO optimizer [2021-10-10 00:05:29,992] [INFO] [utils.py:807:see_memory_usage] MA 5.12 GB Max_MA 5.12 GB CA 9.21 GB Max_CA 9 GB [2021-10-10 00:05:29,993] [INFO] [utils.py:815:see_memory_usage] CPU Virtual Memory: used = 67.33 GB, percent = 13.4% [2021-10-10 00:05:29,993] [INFO] [logging.py:68:log_dist] [Rank 0] DeepSpeed Final Optimizer = adamw [2021-10-10 00:05:29,993] [INFO] [engine.py:586:_configure_lr_scheduler] DeepSpeed using configured LR scheduler = WarmupLR [2021-10-10 00:05:29,993] [INFO] [logging.py:68:log_dist] [Rank 0] DeepSpeed LR Scheduler = <deepspeed.runtime.lr_schedules.WarmupLR object at 0x7fed99e7b520> [2021-10-10 00:05:29,993] [INFO] [logging.py:68:log_dist] [Rank 0] step=0, skipped=0, lr=[5e-06], mom=[[0.9, 0.999]] [2021-10-10 00:05:29,993] [INFO] [config.py:940:print] DeepSpeedEngine configuration: [2021-10-10 00:05:29,994] [INFO] [config.py:944:print] activation_checkpointing_config { "partition_activations": false, "contiguous_memory_optimization": false, "cpu_checkpointing": false, "number_checkpoints": null, "synchronize_checkpoint_boundary": false, "profile": false } [2021-10-10 00:05:29,994] [INFO] [config.py:944:print] aio_config ................... {'block_size': 1048576, 'queue_depth': 8, 'thread_count': 1, 'single_submit': False, 'overlap_events': True} [2021-10-10 00:05:29,994] [INFO] [config.py:944:print] allreduce_always_fp32 ........ False [2021-10-10 00:05:29,994] [INFO] [config.py:944:print] amp_enabled .................. False [2021-10-10 00:05:29,994] [INFO] [config.py:944:print] amp_params ................... False [2021-10-10 00:05:29,994] [INFO] [config.py:944:print] checkpoint_tag_validation_enabled True [2021-10-10 00:05:29,994] [INFO] [config.py:944:print] checkpoint_tag_validation_fail False [2021-10-10 00:05:29,994] [INFO] [config.py:944:print] curriculum_enabled ........... False [2021-10-10 00:05:29,994] [INFO] [config.py:944:print] curriculum_params ............ False [2021-10-10 00:05:29,994] [INFO] [config.py:944:print] dataloader_drop_last ......... False [2021-10-10 00:05:29,994] [INFO] [config.py:944:print] disable_allgather ............ False [2021-10-10 00:05:29,994] [INFO] [config.py:944:print] dump_state ................... False [2021-10-10 00:05:29,994] [INFO] [config.py:944:print] dynamic_loss_scale_args ...... {'init_scale': 65536, 'scale_window': 1000, 'delayed_shift': 2, 'min_scale': 1} [2021-10-10 00:05:29,994] [INFO] [config.py:944:print] eigenvalue_enabled ........... False [2021-10-10 00:05:29,994] [INFO] [config.py:944:print] eigenvalue_gas_boundary_resolution 1 [2021-10-10 00:05:29,994] [INFO] [config.py:944:print] eigenvalue_layer_name ........ bert.encoder.layer [2021-10-10 00:05:29,994] [INFO] [config.py:944:print] eigenvalue_layer_num ......... 0 [2021-10-10 00:05:29,994] [INFO] [config.py:944:print] eigenvalue_max_iter .......... 100 [2021-10-10 00:05:29,994] [INFO] [config.py:944:print] eigenvalue_stability ......... 1e-06 [2021-10-10 00:05:29,994] [INFO] [config.py:944:print] eigenvalue_tol ............... 0.01 [2021-10-10 00:05:29,994] [INFO] [config.py:944:print] eigenvalue_verbose ........... False [2021-10-10 00:05:29,994] [INFO] [config.py:944:print] elasticity_enabled ........... False [2021-10-10 00:05:29,994] [INFO] [config.py:944:print] flops_profiler_config ........ { "enabled": false, "profile_step": 1, "module_depth": -1, "top_modules": 1, "detailed": true, "output_file": null } [2021-10-10 00:05:29,994] [INFO] [config.py:944:print] fp16_enabled ................. True [2021-10-10 00:05:29,994] [INFO] [config.py:944:print] fp16_master_weights_and_gradients False [2021-10-10 00:05:29,994] [INFO] [config.py:944:print] fp16_mixed_quantize .......... False [2021-10-10 00:05:29,994] [INFO] [config.py:944:print] global_rank .................. 0 [2021-10-10 00:05:29,994] [INFO] [config.py:944:print] gradient_accumulation_steps .. 2 [2021-10-10 00:05:29,995] [INFO] [config.py:944:print] gradient_clipping ............ 1.0 [2021-10-10 00:05:29,995] [INFO] [config.py:944:print] gradient_predivide_factor .... 1.0 [2021-10-10 00:05:29,995] [INFO] [config.py:944:print] initial_dynamic_scale ........ 65536 [2021-10-10 00:05:29,995] [INFO] [config.py:944:print] loss_scale ................... 0 [2021-10-10 00:05:29,995] [INFO] [config.py:944:print] memory_breakdown ............. False [2021-10-10 00:05:29,995] [INFO] [config.py:944:print] optimizer_legacy_fusion ...... False [2021-10-10 00:05:29,995] [INFO] [config.py:944:print] optimizer_name ............... adamw [2021-10-10 00:05:29,995] [INFO] [config.py:944:print] optimizer_params ............. {'lr': 5e-06, 'betas': [0.9, 0.999], 'eps': 1e-08, 'weight_decay': 0.0} [2021-10-10 00:05:29,995] [INFO] [config.py:944:print] pipeline ..................... {'stages': 'auto', 'partition': 'best', 'seed_layers': False, 'activation_checkpoint_interval': 0} [2021-10-10 00:05:29,995] [INFO] [config.py:944:print] pld_enabled .................. False [2021-10-10 00:05:29,995] [INFO] [config.py:944:print] pld_params ................... False [2021-10-10 00:05:29,995] [INFO] [config.py:944:print] prescale_gradients ........... False [2021-10-10 00:05:29,995] [INFO] [config.py:944:print] quantize_change_rate ......... 0.001 [2021-10-10 00:05:29,995] [INFO] [config.py:944:print] quantize_groups .............. 1 [2021-10-10 00:05:29,995] [INFO] [config.py:944:print] quantize_offset .............. 1000 [2021-10-10 00:05:29,995] [INFO] [config.py:944:print] quantize_period .............. 1000 [2021-10-10 00:05:29,995] [INFO] [config.py:944:print] quantize_rounding ............ 0 [2021-10-10 00:05:29,995] [INFO] [config.py:944:print] quantize_start_bits .......... 16 [2021-10-10 00:05:29,995] [INFO] [config.py:944:print] quantize_target_bits ......... 8 [2021-10-10 00:05:29,995] [INFO] [config.py:944:print] quantize_training_enabled .... False [2021-10-10 00:05:29,995] [INFO] [config.py:944:print] quantize_type ................ 0 [2021-10-10 00:05:29,995] [INFO] [config.py:944:print] quantize_verbose ............. False [2021-10-10 00:05:29,995] [INFO] [config.py:944:print] scheduler_name ............... WarmupLR [2021-10-10 00:05:29,995] [INFO] [config.py:944:print] scheduler_params ............. {'warmup_min_lr': 0, 'warmup_max_lr': 5e-06, 'warmup_num_steps': 10} [2021-10-10 00:05:29,995] [INFO] [config.py:944:print] sparse_attention ............. None [2021-10-10 00:05:29,995] [INFO] [config.py:944:print] sparse_gradients_enabled ..... False [2021-10-10 00:05:29,995] [INFO] [config.py:944:print] steps_per_print .............. 2000 [2021-10-10 00:05:29,995] [INFO] [config.py:944:print] tensorboard_enabled .......... False [2021-10-10 00:05:29,995] [INFO] [config.py:944:print] tensorboard_job_name ......... DeepSpeedJobName [2021-10-10 00:05:29,995] [INFO] [config.py:944:print] tensorboard_output_path ...... [2021-10-10 00:05:29,995] [INFO] [config.py:944:print] train_batch_size ............. 16 [2021-10-10 00:05:29,996] [INFO] [config.py:944:print] train_micro_batch_size_per_gpu 1 [2021-10-10 00:05:29,996] [INFO] [config.py:944:print] use_quantizer_kernel ......... False [2021-10-10 00:05:29,996] [INFO] [config.py:944:print] wall_clock_breakdown ......... False [2021-10-10 00:05:29,996] [INFO] [config.py:944:print] world_size ................... 8 [2021-10-10 00:05:29,996] [INFO] [config.py:944:print] zero_allow_untested_optimizer False [2021-10-10 00:05:29,996] [INFO] [config.py:944:print] zero_config .................. { "stage": 2, "contiguous_gradients": true, "reduce_scatter": true, "reduce_bucket_size": 5.000000e+08, "allgather_partitions": true, "allgather_bucket_size": 5.000000e+08, "overlap_comm": true, "load_from_fp32_weights": true, "elastic_checkpoint": true, "offload_param": null, "offload_optimizer": null, "sub_group_size": 1.000000e+09, "prefetch_bucket_size": 5.000000e+07, "param_persistence_threshold": 1.000000e+05, "max_live_parameters": 1.000000e+09, "max_reuse_distance": 1.000000e+09, "gather_fp16_weights_on_model_save": false, "ignore_unused_parameters": true, "round_robin_gradients": false, "legacy_stage1": false } [2021-10-10 00:05:29,996] [INFO] [config.py:944:print] zero_enabled ................. True [2021-10-10 00:05:29,996] [INFO] [config.py:944:print] zero_optimization_stage ...... 2 [2021-10-10 00:05:29,996] [INFO] [config.py:946:print] json = { "fp16": { "enabled": true, "loss_scale": 0, "loss_scale_window": 1000, "initial_scale_power": 16, "hysteresis": 2, "min_loss_scale": 1 }, "optimizer": { "type": "AdamW", "params": { "lr": 5e-06, "betas": [0.9, 0.999], "eps": 1e-08, "weight_decay": 0.0 } }, "scheduler": { "type": "WarmupLR", "params": { "warmup_min_lr": 0, "warmup_max_lr": 5e-06, "warmup_num_steps": 10 } }, "zero_optimization": { "stage": 2, "allgather_partitions": true, "allgather_bucket_size": 5.000000e+08, "overlap_comm": true, "reduce_scatter": true, "reduce_bucket_size": 5.000000e+08, "contiguous_gradients": true, "cpu_offload": false }, "gradient_accumulation_steps": 2, "gradient_clipping": 1.0, "steps_per_print": 2.000000e+03, "train_batch_size": 16, "train_micro_batch_size_per_gpu": 1, "wall_clock_breakdown": false } Using /home/user/.cache/torch_extensions as PyTorch extensions root... No modifications detected for re-loaded extension module utils, skipping build step... Loading extension module utils... Time to load utils op: 0.0005140304565429688 seconds [INFO|trainer.py:1196] 2021-10-10 00:05:29,997 >> Running training [INFO|trainer.py:1197] 2021-10-10 00:05:29,997 >> Num examples = 1081 [INFO|trainer.py:1198] 2021-10-10 00:05:29,997 >> Num Epochs = 1 [INFO|trainer.py:1199] 2021-10-10 00:05:29,997 >> Instantaneous batch size per device = 1 [INFO|trainer.py:1200] 2021-10-10 00:05:29,997 >> Total train batch size (w. parallel, distributed & accumulation) = 16 [INFO|trainer.py:1201] 2021-10-10 00:05:29,997 >> Gradient Accumulation steps = 2 [INFO|trainer.py:1202] 2021-10-10 00:05:29,997 >> Total optimization steps = 68 22%|█████████▍ | 15/68 [00:15<00:53, 1.00s/it][INFO|trainer.py:2243] 2021-10-10 00:05:45,547 >> Running Evaluation [INFO|trainer.py:2245] 2021-10-10 00:05:45,547 >> Num examples = 274 [INFO|trainer.py:2248] 2021-10-10 00:05:45,547 >> Batch size = 8 {'eval_loss': 3.603515625, 'eval_runtime': 4.5688, 'eval_samples_per_second': 59.972, 'eval_steps_per_second': 1.094, 'epoch': 0.22}
44%|██████████████████▉ | 30/68 [00:35<00:38, 1.02s/it][INFO|trainer.py:2243] 2021-10-10 00:06:05,226 >> Running Evaluation [INFO|trainer.py:2245] 2021-10-10 00:06:05,226 >> Num examples = 274 [INFO|trainer.py:2248] 2021-10-10 00:06:05,226 >> Batch size = 8 {'eval_loss': 3.2109375, 'eval_runtime': 4.586, 'eval_samples_per_second': 59.747, 'eval_steps_per_second': 1.09, 'epoch': 0.44}
66%|████████████████████████████▍ | 45/68 [00:54<00:23, 1.02s/it][INFO|trainer.py:2243] 2021-10-10 00:06:24,942 >> Running Evaluation [INFO|trainer.py:2245] 2021-10-10 00:06:24,942 >> Num examples = 274 [INFO|trainer.py:2248] 2021-10-10 00:06:24,942 >> Batch size = 8 {'eval_loss': 3.046875, 'eval_runtime': 4.6037, 'eval_samples_per_second': 59.517, 'eval_steps_per_second': 1.086, 'epoch': 0.66}
88%|█████████████████████████████████████▉ | 60/68 [01:14<00:08, 1.02s/it][INFO|trainer.py:2243] 2021-10-10 00:06:44,684 >> Running Evaluation [INFO|trainer.py:2245] 2021-10-10 00:06:44,684 >> Num examples = 274 [INFO|trainer.py:2248] 2021-10-10 00:06:44,684 >> Batch size = 8 {'eval_loss': 2.966796875, 'eval_runtime': 4.6098, 'eval_samples_per_second': 59.438, 'eval_steps_per_second': 1.085, 'epoch': 0.88}
99%|██████████████████████████████████████████▎| 67/68 [01:26<00:01, 1.17s/it] ... Training completed. Do not forget to share your model on huggingface.co/models =)

{'train_runtime': 87.3723, 'train_samples_per_second': 12.372, 'train_steps_per_second': 0.778, 'train_loss': 3.9335650275735294, 'epoch': 1.0} 100%|███████████████████████████████████████████| 68/68 [01:27<00:00, 1.28s/it] [INFO|trainer.py:1995] 2021-10-10 00:06:57,382 >> Saving model checkpoint to finetuned [INFO|configuration_utils.py:413] 2021-10-10 00:06:57,383 >> Configuration saved in finetuned/config.json [INFO|modeling_utils.py:1041] 2021-10-10 00:07:12,168 >> Model weights saved in finetuned/pytorch_model.bin [INFO|tokenization_utils_base.py:2034] 2021-10-10 00:07:12,169 >> tokenizer config file saved in finetuned/tokenizer_config.json [INFO|tokenization_utils_base.py:2040] 2021-10-10 00:07:12,170 >> Special tokens file saved in finetuned/special_tokens_map.json train metrics epoch = 1.0 train_loss = 3.9336 train_runtime = 0:01:27.37 train_samples = 1081 train_samples_per_second = 12.372 train_steps_per_second = 0.778 10/10/2021 00:07:12 - INFO - main - Evaluate [INFO|trainer.py:2243] 2021-10-10 00:07:12,291 >> Running Evaluation [INFO|trainer.py:2245] 2021-10-10 00:07:12,291 >> Num examples = 274 [INFO|trainer.py:2248] 2021-10-10 00:07:12,291 >> Batch size = 8 100%|█████████████████████████████████████████████| 5/5 [00:04<00:00, 1.18it/s] eval metrics epoch = 1.0 eval_loss = 2.9434 eval_runtime = 0:00:04.59 eval_samples = 274 eval_samples_per_second = 59.676 eval_steps_per_second = 1.089 perplexity = 18.9795 (gpt) user@user-X9DRG-HF:~/transformers/Finetune_GPTNEO_GPTJ6B/finetuning_repo$

pip install deepspeed Requirement already satisfied: deepspeed in /home/user/gpt/lib/python3.8/site-packages (0.5.5+cd7967d) Requirement already satisfied: triton in /home/user/gpt/lib/python3.8/site-packages (from deepspeed) (1.1.0) Requirement already satisfied: psutil in /home/user/gpt/lib/python3.8/site-packages (from deepspeed) (5.8.0) Requirement already satisfied: numpy in /home/user/gpt/lib/python3.8/site-packages (from deepspeed) (1.21.2) Requirement already satisfied: tensorboardX==1.8 in /home/user/gpt/lib/python3.8/site-packages (from deepspeed) (1.8) Requirement already satisfied: ninja in /home/user/gpt/lib/python3.8/site-packages (from deepspeed) (1.10.2.2) Requirement already satisfied: tqdm in /home/user/gpt/lib/python3.8/site-packages (from deepspeed) (4.62.3) Requirement already satisfied: packaging in /home/user/gpt/lib/python3.8/site-packages (from deepspeed) (21.0) Requirement already satisfied: torch in /home/user/gpt/lib/python3.8/site-packages (from deepspeed) (1.9.1+cu111) Requirement already satisfied: filelock in /home/user/gpt/lib/python3.8/site-packages (from triton->deepspeed) (3.3.0) Requirement already satisfied: protobuf>=3.2.0 in /home/user/gpt/lib/python3.8/site-packages (from tensorboardX==1.8->deepspeed) (3.18.1) Requirement already satisfied: six in /home/user/gpt/lib/python3.8/site-packages (from tensorboardX==1.8->deepspeed) (1.16.0) Requirement already satisfied: pyparsing>=2.0.2 in /home/user/gpt/lib/python3.8/site-packages (from packaging->deepspeed) (2.4.7) Requirement already satisfied: typing-extensions in /home/user/gpt/lib/python3.8/site-packages (from torch->deepspeed) (3.10.0.2)

jeffra commented 3 years ago

Hi @GPTDGX, can you try two things and share the output? 1) nvidia-smi topo -m 2) Set NCCL_DEBUG=info and download and run this all-reduce benchmark across your GPUs.

GPTDGX commented 3 years ago

Hi Jeff. Thanks for help. Here is info: (gpt) user@user-X9DRG-HF:~/Downloads$ python all_red_bench_v2.py [2021-10-21 13:13:02,586] [INFO] [distributed.py:36:init_distributed] Not using the DeepSpeed or torch.distributed launchers, attempting to detect MPI environment...

WARNING: There is at least non-excluded one OpenFabrics device found, but there are no active ports detected (or Open MPI was unable to use them). This is most certainly not what you wanted. Check your cables, subnet manager configuration, etc. The openib BTL will be ignored for this job.

Local host: user-X9DRG-HF

[2021-10-21 13:13:03,486] [INFO] [distributed.py:83:mpi_discovery] Discovered MPI settings of world_rank=0, local_rank=0, world_size=1, master_addr=192.168.1.128, master_port=29500 [2021-10-21 13:13:03,486] [INFO] [distributed.py:46:init_distributed] Initializing torch distributed with backend: nccl 0 data size: 1.0 GB tput_avg (Gbps): 285126.90625 busbw_avg (Gbps): 0.0

(gpt) user@user-X9DRG-HF:~/nccl-tests$ nvidia-smi topo -m GPU0 GPU1 GPU2 GPU3 GPU4 GPU5 GPU6 GPU7 mlx5_0 mlx5_1 mlx5_2 mlx5_3 CPU Affinity NUMA Affinity GPU0 X NV1 NV1 NV2 NV2 SYS SYS SYS PIX PHB SYS SYS 0-19,40-59 0 GPU1 NV1 X NV2 NV1 SYS NV2 SYS SYS PIX PHB SYS SYS 0-19,40-59 0 GPU2 NV1 NV2 X NV2 SYS SYS NV1 SYS PHB PIX SYS SYS 0-19,40-59 0 GPU3 NV2 NV1 NV2 X SYS SYS SYS NV1 PHB PIX SYS SYS 0-19,40-59 0 GPU4 NV2 SYS SYS SYS X NV1 NV1 NV2 SYS SYS PIX PHB 20-39,60-79 1 GPU5 SYS NV2 SYS SYS NV1 X NV2 NV1 SYS SYS PIX PHB 20-39,60-79 1 GPU6 SYS SYS NV1 SYS NV1 NV2 X NV2 SYS SYS PHB PIX 20-39,60-79 1 GPU7 SYS SYS SYS NV1 NV2 NV1 NV2 X SYS SYS PHB PIX 20-39,60-79 1 mlx5_0 PIX PIX PHB PHB SYS SYS SYS SYS X PHB SYS SYS
mlx5_1 PHB PHB PIX PIX SYS SYS SYS SYS PHB X SYS SYS
mlx5_2 SYS SYS SYS SYS PIX PIX PHB PHB SYS SYS X PHB
mlx5_3 SYS SYS SYS SYS PHB PHB PIX PIX SYS SYS PHB X

Legend:

X = Self SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI) NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU) PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge) PIX = Connection traversing at most a single PCIe bridge NV# = Connection traversing a bonded set of # NVLinks (gpt) user@user-X9DRG-HF:~/nccl-tests$

GPTDGX commented 3 years ago

... also some nccl-test: (gpt) user@user-X9DRG-HF:~/nccl-tests$ ./build/all_reduce_perf -b 8 -e 1128M -f 2 -g 8

nThread 1 nGpus 8 minBytes 8 maxBytes 1182793728 step: 2(factor) warmup iters: 5 iters: 20 validation: 1

#

Using devices

Rank 0 Pid 4550 on user-X9DRG-HF device 0 [0x06] Tesla V100-SXM2-32GB

Rank 1 Pid 4550 on user-X9DRG-HF device 1 [0x07] Tesla V100-SXM2-32GB

Rank 2 Pid 4550 on user-X9DRG-HF device 2 [0x0a] Tesla V100-SXM2-32GB

Rank 3 Pid 4550 on user-X9DRG-HF device 3 [0x0b] Tesla V100-SXM2-32GB

Rank 4 Pid 4550 on user-X9DRG-HF device 4 [0x85] Tesla V100-SXM2-32GB

Rank 5 Pid 4550 on user-X9DRG-HF device 5 [0x86] Tesla V100-SXM2-32GB

Rank 6 Pid 4550 on user-X9DRG-HF device 6 [0x89] Tesla V100-SXM2-32GB

Rank 7 Pid 4550 on user-X9DRG-HF device 7 [0x8a] Tesla V100-SXM2-32GB

#

out-of-place in-place

size count type redop time algbw busbw error time algbw busbw error

(B) (elements) (us) (GB/s) (GB/s) (us) (GB/s) (GB/s)

       8             2     float     sum    35.11    0.00    0.00  2e-07    32.69    0.00    0.00  1e-07
      16             4     float     sum    34.98    0.00    0.00  6e-08    33.77    0.00    0.00  6e-08
      32             8     float     sum    33.61    0.00    0.00  6e-08    33.80    0.00    0.00  6e-08
      64            16     float     sum    41.73    0.00    0.00  6e-08    32.58    0.00    0.00  6e-08
     128            32     float     sum    33.72    0.00    0.01  6e-08    32.82    0.00    0.01  6e-08
     256            64     float     sum    33.31    0.01    0.01  6e-08    32.69    0.01    0.01  6e-08
     512           128     float     sum    39.16    0.01    0.02  6e-08    32.68    0.02    0.03  6e-08
    1024           256     float     sum    36.79    0.03    0.05  2e-07    32.94    0.03    0.05  2e-07
    2048           512     float     sum    38.81    0.05    0.09  2e-07    34.48    0.06    0.10  2e-07
    4096          1024     float     sum    34.08    0.12    0.21  2e-07    35.58    0.12    0.20  2e-07
    8192          2048     float     sum    34.01    0.24    0.42  2e-07    33.00    0.25    0.43  2e-07
   16384          4096     float     sum    36.05    0.45    0.80  2e-07    33.22    0.49    0.86  2e-07
   32768          8192     float     sum    35.33    0.93    1.62  2e-07    34.09    0.96    1.68  2e-07
   65536         16384     float     sum    40.29    1.63    2.85  2e-07    36.69    1.79    3.13  2e-07
  131072         32768     float     sum    45.35    2.89    5.06  2e-07    43.89    2.99    5.23  2e-07
  262144         65536     float     sum    54.75    4.79    8.38  5e-07    60.25    4.35    7.61  5e-07
  524288        131072     float     sum    67.39    7.78   13.62  5e-07    69.77    7.51   13.15  5e-07
 1048576        262144     float     sum    84.98   12.34   21.59  5e-07    83.63   12.54   21.94  5e-07
 2097152        524288     float     sum    108.0   19.42   33.99  5e-07    170.9   12.27   21.47  5e-07
 4194304       1048576     float     sum    167.6   25.03   43.81  5e-07    166.7   25.15   44.02  5e-07
 8388608       2097152     float     sum    216.4   38.77   67.84  5e-07    218.8   38.33   67.08  5e-07
16777216       4194304     float     sum    303.1   55.35   96.86  5e-07    308.9   54.32   95.05  5e-07
33554432       8388608     float     sum    504.9   66.45  116.29  5e-07    502.1   66.82  116.94  5e-07
67108864      16777216     float     sum    926.8   72.41  126.72  5e-07    927.8   72.34  126.59  5e-07

134217728 33554432 float sum 1793.0 74.86 131.00 5e-07 1793.7 74.83 130.95 5e-07 268435456 67108864 float sum 3558.0 75.45 132.03 5e-07 3553.5 75.54 132.20 5e-07 536870912 134217728 float sum 7033.7 76.33 133.57 5e-07 7042.9 76.23 133.40 5e-07 1073741824 268435456 float sum 14025 76.56 133.98 5e-07 14038 76.49 133.85 5e-07

Out of bounds values : 0 OK

Avg bus bandwidth : 37.979

# (gpt) user@user-X9DRG-HF:~/nccl-tests$

GPTDGX commented 3 years ago

sorry for bold/big font - not sure why it is go that way - pasting directly from ubuntu...

GPTDGX commented 3 years ago

(gpt) user@user-X9DRG-HF:~/Downloads$ python all_red_bench_v2.py [2021-10-21 15:39:32,310] [INFO] [distributed.py:36:init_distributed] Not using the DeepSpeed or torch.distributed launchers, attempting to detect MPI environment...

WARNING: There is at least non-excluded one OpenFabrics device found, but there are no active ports detected (or Open MPI was unable to use them). This is most certainly not what you wanted. Check your cables, subnet manager configuration, etc. The openib BTL will be ignored for this job.

Local host: user-X9DRG-HF

[2021-10-21 15:39:33,408] [INFO] [distributed.py:83:mpi_discovery] Discovered MPI settings of world_rank=0, local_rank=0, world_size=1, master_addr=192.168.1.128, master_port=29500 [2021-10-21 15:39:33,409] [INFO] [distributed.py:46:init_distributed] Initializing torch distributed with backend: nccl 0 data size: 1.0 GB tput_avg (Gbps): 340758.9375 busbw_avg (Gbps): 0.0

jeffra commented 3 years ago

Gotcha, thanks for this info. It seems you're on something like a DGX-1. nvidia-smi is definitely showing several NVLink paths so that's good.

Can you run NCCL_DEBUG=info deepspeed all_red_bench_v2.py. We want to launch the training across multiple GPUs and see the NCCL logs to see what device it's seeing at runtime. The deepspeed launcher will by default run this across all 8 gpus on your machine. Alternatively you can use torch.distributed.launch do launch this test across multiple GPUs but it's a bit more involved than the deepspeed launcher.

GPTDGX commented 3 years ago

..yes - DGX-1 - here is info..

(gpt) user@user-X9DRG-HF:~$ cd Downloads/ (gpt) user@user-X9DRG-HF:~/Downloads$ NCCL_DEBUG=info deepspeed all_red_bench_v2.py [2021-10-21 18:52:18,014] [WARNING] [runner.py:122:fetch_hostfile] Unable to find hostfile, will proceed with training with local resources only. [2021-10-21 18:52:18,438] [INFO] [runner.py:360:main] cmd = /home/user/gpt/bin/python -u -m deepspeed.launcher.launch --world_info=eyJsb2NhbGhvc3QiOiBbMCwgMSwgMiwgMywgNCwgNSwgNiwgN119 --master_addr=127.0.0.1 --master_port=29500 all_red_bench_v2.py [2021-10-21 18:52:19,289] [INFO] [launch.py:73:main] 0 NCCL_DEBUG info [2021-10-21 18:52:19,289] [INFO] [launch.py:80:main] WORLD INFO DICT: {'localhost': [0, 1, 2, 3, 4, 5, 6, 7]} [2021-10-21 18:52:19,290] [INFO] [launch.py:86:main] nnodes=1, num_local_procs=8, node_rank=0 [2021-10-21 18:52:19,290] [INFO] [launch.py:101:main] global_rank_mapping=defaultdict(<class 'list'>, {'localhost': [0, 1, 2, 3, 4, 5, 6, 7]}) [2021-10-21 18:52:19,290] [INFO] [launch.py:102:main] dist_world_size=8 [2021-10-21 18:52:19,290] [INFO] [launch.py:104:main] Setting CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 [2021-10-21 18:52:20,448] [INFO] [distributed.py:46:init_distributed] Initializing torch distributed with backend: nccl [2021-10-21 18:52:20,460] [INFO] [distributed.py:46:init_distributed] Initializing torch distributed with backend: nccl [2021-10-21 18:52:20,460] [INFO] [distributed.py:46:init_distributed] Initializing torch distributed with backend: nccl [2021-10-21 18:52:20,460] [INFO] [distributed.py:46:init_distributed] Initializing torch distributed with backend: nccl [2021-10-21 18:52:20,500] [INFO] [distributed.py:46:init_distributed] Initializing torch distributed with backend: nccl [2021-10-21 18:52:20,500] [INFO] [distributed.py:46:init_distributed] Initializing torch distributed with backend: nccl [2021-10-21 18:52:20,504] [INFO] [distributed.py:46:init_distributed] Initializing torch distributed with backend: nccl [2021-10-21 18:52:20,504] [INFO] [distributed.py:46:init_distributed] Initializing torch distributed with backend: nccl 0 data size: 1.0 GB user-X9DRG-HF:3349:3349 [0] NCCL INFO Bootstrap : Using [0]enp1s0f1:192.168.1.128<0> user-X9DRG-HF:3349:3349 [0] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation user-X9DRG-HF:3349:3349 [0] NCCL INFO NET/IB : No device found. user-X9DRG-HF:3349:3349 [0] NCCL INFO NET/Socket : Using [0]enp1s0f1:192.168.1.128<0> user-X9DRG-HF:3349:3349 [0] NCCL INFO Using network Socket NCCL version 2.7.8+cuda11.1 user-X9DRG-HF:3350:3350 [1] NCCL INFO Bootstrap : Using [0]enp1s0f1:192.168.1.128<0> user-X9DRG-HF:3350:3350 [1] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation user-X9DRG-HF:3353:3353 [4] NCCL INFO Bootstrap : Using [0]enp1s0f1:192.168.1.128<0> user-X9DRG-HF:3352:3352 [3] NCCL INFO Bootstrap : Using [0]enp1s0f1:192.168.1.128<0> user-X9DRG-HF:3351:3351 [2] NCCL INFO Bootstrap : Using [0]enp1s0f1:192.168.1.128<0> user-X9DRG-HF:3353:3353 [4] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation user-X9DRG-HF:3351:3351 [2] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation user-X9DRG-HF:3352:3352 [3] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation user-X9DRG-HF:3356:3356 [6] NCCL INFO Bootstrap : Using [0]enp1s0f1:192.168.1.128<0> user-X9DRG-HF:3355:3355 [5] NCCL INFO Bootstrap : Using [0]enp1s0f1:192.168.1.128<0> user-X9DRG-HF:3356:3356 [6] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation user-X9DRG-HF:3355:3355 [5] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation user-X9DRG-HF:3359:3359 [7] NCCL INFO Bootstrap : Using [0]enp1s0f1:192.168.1.128<0> user-X9DRG-HF:3359:3359 [7] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation user-X9DRG-HF:3353:3353 [4] NCCL INFO NET/IB : No device found. user-X9DRG-HF:3353:3353 [4] NCCL INFO NET/Socket : Using [0]enp1s0f1:192.168.1.128<0> user-X9DRG-HF:3353:3353 [4] NCCL INFO Using network Socket user-X9DRG-HF:3352:3352 [3] NCCL INFO NET/IB : No device found. user-X9DRG-HF:3352:3352 [3] NCCL INFO NET/Socket : Using [0]enp1s0f1:192.168.1.128<0> user-X9DRG-HF:3352:3352 [3] NCCL INFO Using network Socket user-X9DRG-HF:3359:3359 [7] NCCL INFO NET/IB : No device found. user-X9DRG-HF:3359:3359 [7] NCCL INFO NET/Socket : Using [0]enp1s0f1:192.168.1.128<0> user-X9DRG-HF:3359:3359 [7] NCCL INFO Using network Socket user-X9DRG-HF:3351:3351 [2] NCCL INFO NET/IB : No device found. user-X9DRG-HF:3351:3351 [2] NCCL INFO NET/Socket : Using [0]enp1s0f1:192.168.1.128<0> user-X9DRG-HF:3351:3351 [2] NCCL INFO Using network Socket user-X9DRG-HF:3350:3350 [1] NCCL INFO NET/IB : No device found. user-X9DRG-HF:3350:3350 [1] NCCL INFO NET/Socket : Using [0]enp1s0f1:192.168.1.128<0> user-X9DRG-HF:3350:3350 [1] NCCL INFO Using network Socket user-X9DRG-HF:3355:3355 [5] NCCL INFO NET/IB : No device found. user-X9DRG-HF:3355:3355 [5] NCCL INFO NET/Socket : Using [0]enp1s0f1:192.168.1.128<0> user-X9DRG-HF:3355:3355 [5] NCCL INFO Using network Socket user-X9DRG-HF:3356:3356 [6] NCCL INFO NET/IB : No device found. user-X9DRG-HF:3356:3356 [6] NCCL INFO NET/Socket : Using [0]enp1s0f1:192.168.1.128<0> user-X9DRG-HF:3356:3356 [6] NCCL INFO Using network Socket user-X9DRG-HF:3355:3910 [5] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 8/8/64 user-X9DRG-HF:3353:3905 [4] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 8/8/64 user-X9DRG-HF:3355:3910 [5] NCCL INFO Trees [0] 6/-1/-1->5->1|1->5->6/-1/-1 [1] 6/-1/-1->5->1|1->5->6/-1/-1 [2] 1/-1/-1->5->6|6->5->1/-1/-1 [3] 1/-1/-1->5->6|6->5->1/-1/-1 [4] 4/-1/-1->5->7|7->5->4/-1/-1 [5] 7/-1/-1->5->4|4->5->7/-1/-1 [6] 6/-1/-1->5->1|1->5->6/-1/-1 [7] 6/-1/-1->5->1|1->5->6/-1/-1 [8] 1/-1/-1->5->6|6->5->1/-1/-1 [9] 1/-1/-1->5->6|6->5->1/-1/-1 [10] 4/-1/-1->5->7|7->5->4/-1/-1 [11] 7/-1/-1->5->4|4->5->7/-1/-1 user-X9DRG-HF:3355:3910 [5] NCCL INFO Setting affinity for GPU 5 to ffff,f00000ff,fff00000 user-X9DRG-HF:3353:3905 [4] NCCL INFO Trees [0] -1/-1/-1->4->7|7->4->-1/-1/-1 [1] -1/-1/-1->4->7|7->4->-1/-1/-1 [2] 7/-1/-1->4->0|0->4->7/-1/-1 [3] 7/-1/-1->4->0|0->4->7/-1/-1 [4] 6/-1/-1->4->5|5->4->6/-1/-1 [5] 5/-1/-1->4->6|6->4->5/-1/-1 [6] -1/-1/-1->4->7|7->4->-1/-1/-1 [7] -1/-1/-1->4->7|7->4->-1/-1/-1 [8] 7/-1/-1->4->0|0->4->7/-1/-1 [9] 7/-1/-1->4->0|0->4->7/-1/-1 [10] 6/-1/-1->4->5|5->4->6/-1/-1 [11] 5/-1/-1->4->6|6->4->5/-1/-1 user-X9DRG-HF:3353:3905 [4] NCCL INFO Setting affinity for GPU 4 to ffff,f00000ff,fff00000 user-X9DRG-HF:3359:3907 [7] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 8/8/64 user-X9DRG-HF:3349:3904 [0] NCCL INFO Channel 00/12 : 0 3 2 1 5 6 7 4 user-X9DRG-HF:3359:3907 [7] NCCL INFO Trees [0] 4/-1/-1->7->6|6->7->4/-1/-1 [1] 4/-1/-1->7->6|6->7->4/-1/-1 [2] 6/-1/-1->7->4|4->7->6/-1/-1 [3] 6/-1/-1->7->4|4->7->6/-1/-1 [4] 5/-1/-1->7->3|3->7->5/-1/-1 [5] 3/-1/-1->7->5|5->7->3/-1/-1 [6] 4/-1/-1->7->6|6->7->4/-1/-1 [7] 4/-1/-1->7->6|6->7->4/-1/-1 [8] 6/-1/-1->7->4|4->7->6/-1/-1 [9] 6/-1/-1->7->4|4->7->6/-1/-1 [10] 5/-1/-1->7->3|3->7->5/-1/-1 [11] 3/-1/-1->7->5|5->7->3/-1/-1 user-X9DRG-HF:3349:3904 [0] NCCL INFO Channel 01/12 : 0 3 2 1 5 6 7 4 user-X9DRG-HF:3359:3907 [7] NCCL INFO Setting affinity for GPU 7 to ffff,f00000ff,fff00000 user-X9DRG-HF:3349:3904 [0] NCCL INFO Channel 02/12 : 0 4 7 6 5 1 2 3 user-X9DRG-HF:3349:3904 [0] NCCL INFO Channel 03/12 : 0 4 7 6 5 1 2 3 user-X9DRG-HF:3349:3904 [0] NCCL INFO Channel 04/12 : 0 1 3 7 5 4 6 2 user-X9DRG-HF:3349:3904 [0] NCCL INFO Channel 05/12 : 0 2 6 4 5 7 3 1 user-X9DRG-HF:3349:3904 [0] NCCL INFO Channel 06/12 : 0 3 2 1 5 6 7 4 user-X9DRG-HF:3349:3904 [0] NCCL INFO Channel 07/12 : 0 3 2 1 5 6 7 4 user-X9DRG-HF:3356:3911 [6] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 8/8/64 user-X9DRG-HF:3349:3904 [0] NCCL INFO Channel 08/12 : 0 4 7 6 5 1 2 3 user-X9DRG-HF:3349:3904 [0] NCCL INFO Channel 09/12 : 0 4 7 6 5 1 2 3 user-X9DRG-HF:3349:3904 [0] NCCL INFO Channel 10/12 : 0 1 3 7 5 4 6 2 user-X9DRG-HF:3349:3904 [0] NCCL INFO Channel 11/12 : 0 2 6 4 5 7 3 1 user-X9DRG-HF:3356:3911 [6] NCCL INFO Trees [0] 7/-1/-1->6->5|5->6->7/-1/-1 [1] 7/-1/-1->6->5|5->6->7/-1/-1 [2] 5/-1/-1->6->7|7->6->5/-1/-1 [3] 5/-1/-1->6->7|7->6->5/-1/-1 [4] 2/-1/-1->6->4|4->6->2/-1/-1 [5] 4/-1/-1->6->2|2->6->4/-1/-1 [6] 7/-1/-1->6->5|5->6->7/-1/-1 [7] 7/-1/-1->6->5|5->6->7/-1/-1 [8] 5/-1/-1->6->7|7->6->5/-1/-1 [9] 5/-1/-1->6->7|7->6->5/-1/-1 [10] 2/-1/-1->6->4|4->6->2/-1/-1 [11] 4/-1/-1->6->2|2->6->4/-1/-1 user-X9DRG-HF:3350:3909 [1] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 8/8/64 user-X9DRG-HF:3352:3906 [3] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 8/8/64 user-X9DRG-HF:3356:3911 [6] NCCL INFO Setting affinity for GPU 6 to ffff,f00000ff,fff00000 user-X9DRG-HF:3350:3909 [1] NCCL INFO Trees [0] 5/-1/-1->1->2|2->1->5/-1/-1 [1] 5/-1/-1->1->2|2->1->5/-1/-1 [2] 2/-1/-1->1->5|5->1->2/-1/-1 [3] 2/-1/-1->1->5|5->1->2/-1/-1 [4] 3/-1/-1->1->0|0->1->3/-1/-1 [5] -1/-1/-1->1->3|3->1->-1/-1/-1 [6] 5/-1/-1->1->2|2->1->5/-1/-1 [7] 5/-1/-1->1->2|2->1->5/-1/-1 [8] 2/-1/-1->1->5|5->1->2/-1/-1 [9] 2/-1/-1->1->5|5->1->2/-1/-1 [10] 3/-1/-1->1->0|0->1->3/-1/-1 [11] -1/-1/-1->1->3|3->1->-1/-1/-1 user-X9DRG-HF:3349:3904 [0] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 8/8/64 user-X9DRG-HF:3352:3906 [3] NCCL INFO Trees [0] 2/-1/-1->3->0|0->3->2/-1/-1 [1] 2/-1/-1->3->0|0->3->2/-1/-1 [2] -1/-1/-1->3->2|2->3->-1/-1/-1 [3] -1/-1/-1->3->2|2->3->-1/-1/-1 [4] 7/-1/-1->3->1|1->3->7/-1/-1 [5] 1/-1/-1->3->7|7->3->1/-1/-1 [6] 2/-1/-1->3->0|0->3->2/-1/-1 [7] 2/-1/-1->3->0|0->3->2/-1/-1 [8] -1/-1/-1->3->2|2->3->-1/-1/-1 [9] -1/-1/-1->3->2|2->3->-1/-1/-1 [10] 7/-1/-1->3->1|1->3->7/-1/-1 [11] 1/-1/-1->3->7|7->3->1/-1/-1 user-X9DRG-HF:3351:3908 [2] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 8/8/64 user-X9DRG-HF:3352:3906 [3] NCCL INFO Setting affinity for GPU 3 to 0fffff00,000fffff user-X9DRG-HF:3350:3909 [1] NCCL INFO Setting affinity for GPU 1 to 0fffff00,000fffff user-X9DRG-HF:3349:3904 [0] NCCL INFO Trees [0] 3/-1/-1->0->-1|-1->0->3/-1/-1 [1] 3/-1/-1->0->-1|-1->0->3/-1/-1 [2] 4/-1/-1->0->-1|-1->0->4/-1/-1 [3] 4/-1/-1->0->-1|-1->0->4/-1/-1 [4] 1/-1/-1->0->-1|-1->0->1/-1/-1 [5] 2/-1/-1->0->-1|-1->0->2/-1/-1 [6] 3/-1/-1->0->-1|-1->0->3/-1/-1 [7] 3/-1/-1->0->-1|-1->0->3/-1/-1 [8] 4/-1/-1->0->-1|-1->0->4/-1/-1 [9] 4/-1/-1->0->-1|-1->0->4/-1/-1 [10] 1/-1/-1->0->-1|-1->0->1/-1/-1 [11] 2/-1/-1->0->-1|-1->0->2/-1/-1 user-X9DRG-HF:3349:3904 [0] NCCL INFO Setting affinity for GPU 0 to 0fffff00,000fffff user-X9DRG-HF:3351:3908 [2] NCCL INFO Trees [0] 1/-1/-1->2->3|3->2->1/-1/-1 [1] 1/-1/-1->2->3|3->2->1/-1/-1 [2] 3/-1/-1->2->1|1->2->3/-1/-1 [3] 3/-1/-1->2->1|1->2->3/-1/-1 [4] -1/-1/-1->2->6|6->2->-1/-1/-1 [5] 6/-1/-1->2->0|0->2->6/-1/-1 [6] 1/-1/-1->2->3|3->2->1/-1/-1 [7] 1/-1/-1->2->3|3->2->1/-1/-1 [8] 3/-1/-1->2->1|1->2->3/-1/-1 [9] 3/-1/-1->2->1|1->2->3/-1/-1 [10] -1/-1/-1->2->6|6->2->-1/-1/-1 [11] 6/-1/-1->2->0|0->2->6/-1/-1 user-X9DRG-HF:3351:3908 [2] NCCL INFO Setting affinity for GPU 2 to 0fffff00,000fffff user-X9DRG-HF:3355:3910 [5] NCCL INFO Channel 00 : 5[86000] -> 6[89000] via P2P/IPC user-X9DRG-HF:3353:3905 [4] NCCL INFO Channel 00 : 4[85000] -> 0[6000] via P2P/IPC user-X9DRG-HF:3359:3907 [7] NCCL INFO Channel 00 : 7[8a000] -> 4[85000] via P2P/IPC user-X9DRG-HF:3356:3911 [6] NCCL INFO Channel 00 : 6[89000] -> 7[8a000] via P2P/IPC user-X9DRG-HF:3350:3909 [1] NCCL INFO Channel 00 : 1[7000] -> 5[86000] via P2P/IPC user-X9DRG-HF:3352:3906 [3] NCCL INFO Channel 00 : 3[b000] -> 2[a000] via P2P/IPC user-X9DRG-HF:3349:3904 [0] NCCL INFO Channel 00 : 0[6000] -> 3[b000] via P2P/IPC user-X9DRG-HF:3351:3908 [2] NCCL INFO Channel 00 : 2[a000] -> 1[7000] via P2P/IPC user-X9DRG-HF:3353:3905 [4] NCCL INFO Channel 00 : 4[85000] -> 7[8a000] via P2P/IPC user-X9DRG-HF:3355:3910 [5] NCCL INFO Channel 00 : 5[86000] -> 1[7000] via P2P/IPC user-X9DRG-HF:3359:3907 [7] NCCL INFO Channel 00 : 7[8a000] -> 6[89000] via P2P/IPC user-X9DRG-HF:3356:3911 [6] NCCL INFO Channel 00 : 6[89000] -> 5[86000] via P2P/IPC user-X9DRG-HF:3350:3909 [1] NCCL INFO Channel 00 : 1[7000] -> 2[a000] via P2P/IPC user-X9DRG-HF:3352:3906 [3] NCCL INFO Channel 00 : 3[b000] -> 0[6000] via P2P/IPC user-X9DRG-HF:3351:3908 [2] NCCL INFO Channel 00 : 2[a000] -> 3[b000] via P2P/IPC user-X9DRG-HF:3353:3905 [4] NCCL INFO Channel 01 : 4[85000] -> 0[6000] via P2P/IPC user-X9DRG-HF:3355:3910 [5] NCCL INFO Channel 01 : 5[86000] -> 6[89000] via P2P/IPC user-X9DRG-HF:3349:3904 [0] NCCL INFO Channel 01 : 0[6000] -> 3[b000] via P2P/IPC user-X9DRG-HF:3359:3907 [7] NCCL INFO Channel 01 : 7[8a000] -> 4[85000] via P2P/IPC user-X9DRG-HF:3356:3911 [6] NCCL INFO Channel 01 : 6[89000] -> 7[8a000] via P2P/IPC user-X9DRG-HF:3350:3909 [1] NCCL INFO Channel 01 : 1[7000] -> 5[86000] via P2P/IPC user-X9DRG-HF:3352:3906 [3] NCCL INFO Channel 01 : 3[b000] -> 2[a000] via P2P/IPC user-X9DRG-HF:3351:3908 [2] NCCL INFO Channel 01 : 2[a000] -> 1[7000] via P2P/IPC user-X9DRG-HF:3353:3905 [4] NCCL INFO Channel 01 : 4[85000] -> 7[8a000] via P2P/IPC user-X9DRG-HF:3355:3910 [5] NCCL INFO Channel 01 : 5[86000] -> 1[7000] via P2P/IPC user-X9DRG-HF:3359:3907 [7] NCCL INFO Channel 01 : 7[8a000] -> 6[89000] via P2P/IPC user-X9DRG-HF:3356:3911 [6] NCCL INFO Channel 01 : 6[89000] -> 5[86000] via P2P/IPC user-X9DRG-HF:3351:3908 [2] NCCL INFO Channel 01 : 2[a000] -> 3[b000] via P2P/IPC user-X9DRG-HF:3350:3909 [1] NCCL INFO Channel 01 : 1[7000] -> 2[a000] via P2P/IPC user-X9DRG-HF:3352:3906 [3] NCCL INFO Channel 01 : 3[b000] -> 0[6000] via P2P/IPC user-X9DRG-HF:3353:3905 [4] NCCL INFO Channel 02 : 4[85000] -> 7[8a000] via P2P/IPC user-X9DRG-HF:3355:3910 [5] NCCL INFO Channel 02 : 5[86000] -> 1[7000] via P2P/IPC user-X9DRG-HF:3349:3904 [0] NCCL INFO Channel 02 : 0[6000] -> 4[85000] via P2P/IPC user-X9DRG-HF:3359:3907 [7] NCCL INFO Channel 02 : 7[8a000] -> 6[89000] via P2P/IPC user-X9DRG-HF:3356:3911 [6] NCCL INFO Channel 02 : 6[89000] -> 5[86000] via P2P/IPC user-X9DRG-HF:3351:3908 [2] NCCL INFO Channel 02 : 2[a000] -> 3[b000] via P2P/IPC user-X9DRG-HF:3350:3909 [1] NCCL INFO Channel 02 : 1[7000] -> 2[a000] via P2P/IPC user-X9DRG-HF:3352:3906 [3] NCCL INFO Channel 02 : 3[b000] -> 0[6000] via P2P/IPC user-X9DRG-HF:3353:3905 [4] NCCL INFO Channel 02 : 4[85000] -> 0[6000] via P2P/IPC user-X9DRG-HF:3352:3906 [3] NCCL INFO Channel 02 : 3[b000] -> 2[a000] via P2P/IPC user-X9DRG-HF:3355:3910 [5] NCCL INFO Channel 02 : 5[86000] -> 6[89000] via P2P/IPC user-X9DRG-HF:3359:3907 [7] NCCL INFO Channel 02 : 7[8a000] -> 4[85000] via P2P/IPC user-X9DRG-HF:3356:3911 [6] NCCL INFO Channel 02 : 6[89000] -> 7[8a000] via P2P/IPC user-X9DRG-HF:3351:3908 [2] NCCL INFO Channel 02 : 2[a000] -> 1[7000] via P2P/IPC user-X9DRG-HF:3350:3909 [1] NCCL INFO Channel 02 : 1[7000] -> 5[86000] via P2P/IPC user-X9DRG-HF:3349:3904 [0] NCCL INFO Channel 03 : 0[6000] -> 4[85000] via P2P/IPC user-X9DRG-HF:3352:3906 [3] NCCL INFO Channel 03 : 3[b000] -> 0[6000] via P2P/IPC user-X9DRG-HF:3353:3905 [4] NCCL INFO Channel 03 : 4[85000] -> 7[8a000] via P2P/IPC user-X9DRG-HF:3355:3910 [5] NCCL INFO Channel 03 : 5[86000] -> 1[7000] via P2P/IPC user-X9DRG-HF:3356:3911 [6] NCCL INFO Channel 03 : 6[89000] -> 5[86000] via P2P/IPC user-X9DRG-HF:3359:3907 [7] NCCL INFO Channel 03 : 7[8a000] -> 6[89000] via P2P/IPC user-X9DRG-HF:3350:3909 [1] NCCL INFO Channel 03 : 1[7000] -> 2[a000] via P2P/IPC user-X9DRG-HF:3351:3908 [2] NCCL INFO Channel 03 : 2[a000] -> 3[b000] via P2P/IPC user-X9DRG-HF:3352:3906 [3] NCCL INFO Channel 03 : 3[b000] -> 2[a000] via P2P/IPC user-X9DRG-HF:3353:3905 [4] NCCL INFO Channel 03 : 4[85000] -> 0[6000] via P2P/IPC user-X9DRG-HF:3355:3910 [5] NCCL INFO Channel 03 : 5[86000] -> 6[89000] via P2P/IPC user-X9DRG-HF:3356:3911 [6] NCCL INFO Channel 03 : 6[89000] -> 7[8a000] via P2P/IPC user-X9DRG-HF:3359:3907 [7] NCCL INFO Channel 03 : 7[8a000] -> 4[85000] via P2P/IPC user-X9DRG-HF:3350:3909 [1] NCCL INFO Channel 03 : 1[7000] -> 5[86000] via P2P/IPC user-X9DRG-HF:3351:3908 [2] NCCL INFO Channel 03 : 2[a000] -> 1[7000] via P2P/IPC user-X9DRG-HF:3349:3904 [0] NCCL INFO Channel 04 : 0[6000] -> 1[7000] via P2P/IPC user-X9DRG-HF:3352:3906 [3] NCCL INFO Channel 04 : 3[b000] -> 7[8a000] via P2P/IPC user-X9DRG-HF:3353:3905 [4] NCCL INFO Channel 04 : 4[85000] -> 6[89000] via P2P/IPC user-X9DRG-HF:3356:3911 [6] NCCL INFO Channel 04 : 6[89000] -> 2[a000] via P2P/IPC user-X9DRG-HF:3355:3910 [5] NCCL INFO Channel 04 : 5[86000] -> 4[85000] via P2P/IPC user-X9DRG-HF:3359:3907 [7] NCCL INFO Channel 04 : 7[8a000] -> 5[86000] via P2P/IPC user-X9DRG-HF:3350:3909 [1] NCCL INFO Channel 04 : 1[7000] -> 3[b000] via P2P/IPC user-X9DRG-HF:3351:3908 [2] NCCL INFO Channel 04 : 2[a000] -> 0[6000] via P2P/IPC user-X9DRG-HF:3353:3905 [4] NCCL INFO Channel 04 : 4[85000] -> 5[86000] via P2P/IPC user-X9DRG-HF:3351:3908 [2] NCCL INFO Channel 04 : 2[a000] -> 6[89000] via P2P/IPC user-X9DRG-HF:3352:3906 [3] NCCL INFO Channel 04 : 3[b000] -> 1[7000] via P2P/IPC user-X9DRG-HF:3356:3911 [6] NCCL INFO Channel 04 : 6[89000] -> 4[85000] via P2P/IPC user-X9DRG-HF:3355:3910 [5] NCCL INFO Channel 04 : 5[86000] -> 7[8a000] via P2P/IPC user-X9DRG-HF:3359:3907 [7] NCCL INFO Channel 04 : 7[8a000] -> 3[b000] via P2P/IPC user-X9DRG-HF:3350:3909 [1] NCCL INFO Channel 04 : 1[7000] -> 0[6000] via P2P/IPC user-X9DRG-HF:3351:3908 [2] NCCL INFO Channel 05 : 2[a000] -> 6[89000] via P2P/IPC user-X9DRG-HF:3353:3905 [4] NCCL INFO Channel 05 : 4[85000] -> 5[86000] via P2P/IPC user-X9DRG-HF:3352:3906 [3] NCCL INFO Channel 05 : 3[b000] -> 1[7000] via P2P/IPC user-X9DRG-HF:3356:3911 [6] NCCL INFO Channel 05 : 6[89000] -> 4[85000] via P2P/IPC user-X9DRG-HF:3349:3904 [0] NCCL INFO Channel 05 : 0[6000] -> 2[a000] via P2P/IPC user-X9DRG-HF:3355:3910 [5] NCCL INFO Channel 05 : 5[86000] -> 7[8a000] via P2P/IPC user-X9DRG-HF:3359:3907 [7] NCCL INFO Channel 05 : 7[8a000] -> 3[b000] via P2P/IPC user-X9DRG-HF:3350:3909 [1] NCCL INFO Channel 05 : 1[7000] -> 0[6000] via P2P/IPC user-X9DRG-HF:3351:3908 [2] NCCL INFO Channel 05 : 2[a000] -> 0[6000] via P2P/IPC user-X9DRG-HF:3353:3905 [4] NCCL INFO Channel 05 : 4[85000] -> 6[89000] via P2P/IPC user-X9DRG-HF:3350:3909 [1] NCCL INFO Channel 05 : 1[7000] -> 3[b000] via P2P/IPC user-X9DRG-HF:3352:3906 [3] NCCL INFO Channel 05 : 3[b000] -> 7[8a000] via P2P/IPC user-X9DRG-HF:3356:3911 [6] NCCL INFO Channel 05 : 6[89000] -> 2[a000] via P2P/IPC user-X9DRG-HF:3355:3910 [5] NCCL INFO Channel 05 : 5[86000] -> 4[85000] via P2P/IPC user-X9DRG-HF:3359:3907 [7] NCCL INFO Channel 05 : 7[8a000] -> 5[86000] via P2P/IPC user-X9DRG-HF:3349:3904 [0] NCCL INFO Channel 06 : 0[6000] -> 3[b000] via P2P/IPC user-X9DRG-HF:3350:3909 [1] NCCL INFO Channel 06 : 1[7000] -> 5[86000] via P2P/IPC user-X9DRG-HF:3351:3908 [2] NCCL INFO Channel 06 : 2[a000] -> 1[7000] via P2P/IPC user-X9DRG-HF:3353:3905 [4] NCCL INFO Channel 06 : 4[85000] -> 0[6000] via P2P/IPC user-X9DRG-HF:3352:3906 [3] NCCL INFO Channel 06 : 3[b000] -> 2[a000] via P2P/IPC user-X9DRG-HF:3356:3911 [6] NCCL INFO Channel 06 : 6[89000] -> 7[8a000] via P2P/IPC user-X9DRG-HF:3355:3910 [5] NCCL INFO Channel 06 : 5[86000] -> 6[89000] via P2P/IPC user-X9DRG-HF:3359:3907 [7] NCCL INFO Channel 06 : 7[8a000] -> 4[85000] via P2P/IPC user-X9DRG-HF:3353:3905 [4] NCCL INFO Channel 06 : 4[85000] -> 7[8a000] via P2P/IPC user-X9DRG-HF:3350:3909 [1] NCCL INFO Channel 06 : 1[7000] -> 2[a000] via P2P/IPC user-X9DRG-HF:3351:3908 [2] NCCL INFO Channel 06 : 2[a000] -> 3[b000] via P2P/IPC user-X9DRG-HF:3352:3906 [3] NCCL INFO Channel 06 : 3[b000] -> 0[6000] via P2P/IPC user-X9DRG-HF:3356:3911 [6] NCCL INFO Channel 06 : 6[89000] -> 5[86000] via P2P/IPC user-X9DRG-HF:3355:3910 [5] NCCL INFO Channel 06 : 5[86000] -> 1[7000] via P2P/IPC user-X9DRG-HF:3359:3907 [7] NCCL INFO Channel 06 : 7[8a000] -> 6[89000] via P2P/IPC user-X9DRG-HF:3353:3905 [4] NCCL INFO Channel 07 : 4[85000] -> 0[6000] via P2P/IPC user-X9DRG-HF:3349:3904 [0] NCCL INFO Channel 07 : 0[6000] -> 3[b000] via P2P/IPC user-X9DRG-HF:3351:3908 [2] NCCL INFO Channel 07 : 2[a000] -> 1[7000] via P2P/IPC user-X9DRG-HF:3350:3909 [1] NCCL INFO Channel 07 : 1[7000] -> 5[86000] via P2P/IPC user-X9DRG-HF:3352:3906 [3] NCCL INFO Channel 07 : 3[b000] -> 2[a000] via P2P/IPC user-X9DRG-HF:3356:3911 [6] NCCL INFO Channel 07 : 6[89000] -> 7[8a000] via P2P/IPC user-X9DRG-HF:3355:3910 [5] NCCL INFO Channel 07 : 5[86000] -> 6[89000] via P2P/IPC user-X9DRG-HF:3359:3907 [7] NCCL INFO Channel 07 : 7[8a000] -> 4[85000] via P2P/IPC user-X9DRG-HF:3353:3905 [4] NCCL INFO Channel 07 : 4[85000] -> 7[8a000] via P2P/IPC user-X9DRG-HF:3351:3908 [2] NCCL INFO Channel 07 : 2[a000] -> 3[b000] via P2P/IPC user-X9DRG-HF:3350:3909 [1] NCCL INFO Channel 07 : 1[7000] -> 2[a000] via P2P/IPC user-X9DRG-HF:3352:3906 [3] NCCL INFO Channel 07 : 3[b000] -> 0[6000] via P2P/IPC user-X9DRG-HF:3356:3911 [6] NCCL INFO Channel 07 : 6[89000] -> 5[86000] via P2P/IPC user-X9DRG-HF:3355:3910 [5] NCCL INFO Channel 07 : 5[86000] -> 1[7000] via P2P/IPC user-X9DRG-HF:3359:3907 [7] NCCL INFO Channel 07 : 7[8a000] -> 6[89000] via P2P/IPC user-X9DRG-HF:3353:3905 [4] NCCL INFO Channel 08 : 4[85000] -> 7[8a000] via P2P/IPC user-X9DRG-HF:3349:3904 [0] NCCL INFO Channel 08 : 0[6000] -> 4[85000] via P2P/IPC user-X9DRG-HF:3351:3908 [2] NCCL INFO Channel 08 : 2[a000] -> 3[b000] via P2P/IPC user-X9DRG-HF:3350:3909 [1] NCCL INFO Channel 08 : 1[7000] -> 2[a000] via P2P/IPC user-X9DRG-HF:3352:3906 [3] NCCL INFO Channel 08 : 3[b000] -> 0[6000] via P2P/IPC user-X9DRG-HF:3356:3911 [6] NCCL INFO Channel 08 : 6[89000] -> 5[86000] via P2P/IPC user-X9DRG-HF:3355:3910 [5] NCCL INFO Channel 08 : 5[86000] -> 1[7000] via P2P/IPC user-X9DRG-HF:3359:3907 [7] NCCL INFO Channel 08 : 7[8a000] -> 6[89000] via P2P/IPC user-X9DRG-HF:3352:3906 [3] NCCL INFO Channel 08 : 3[b000] -> 2[a000] via P2P/IPC user-X9DRG-HF:3353:3905 [4] NCCL INFO Channel 08 : 4[85000] -> 0[6000] via P2P/IPC user-X9DRG-HF:3351:3908 [2] NCCL INFO Channel 08 : 2[a000] -> 1[7000] via P2P/IPC user-X9DRG-HF:3356:3911 [6] NCCL INFO Channel 08 : 6[89000] -> 7[8a000] via P2P/IPC user-X9DRG-HF:3355:3910 [5] NCCL INFO Channel 08 : 5[86000] -> 6[89000] via P2P/IPC user-X9DRG-HF:3359:3907 [7] NCCL INFO Channel 08 : 7[8a000] -> 4[85000] via P2P/IPC user-X9DRG-HF:3350:3909 [1] NCCL INFO Channel 08 : 1[7000] -> 5[86000] via P2P/IPC user-X9DRG-HF:3352:3906 [3] NCCL INFO Channel 09 : 3[b000] -> 0[6000] via P2P/IPC user-X9DRG-HF:3349:3904 [0] NCCL INFO Channel 09 : 0[6000] -> 4[85000] via P2P/IPC user-X9DRG-HF:3353:3905 [4] NCCL INFO Channel 09 : 4[85000] -> 7[8a000] via P2P/IPC user-X9DRG-HF:3351:3908 [2] NCCL INFO Channel 09 : 2[a000] -> 3[b000] via P2P/IPC user-X9DRG-HF:3356:3911 [6] NCCL INFO Channel 09 : 6[89000] -> 5[86000] via P2P/IPC user-X9DRG-HF:3355:3910 [5] NCCL INFO Channel 09 : 5[86000] -> 1[7000] via P2P/IPC user-X9DRG-HF:3359:3907 [7] NCCL INFO Channel 09 : 7[8a000] -> 6[89000] via P2P/IPC user-X9DRG-HF:3350:3909 [1] NCCL INFO Channel 09 : 1[7000] -> 2[a000] via P2P/IPC user-X9DRG-HF:3352:3906 [3] NCCL INFO Channel 09 : 3[b000] -> 2[a000] via P2P/IPC user-X9DRG-HF:3353:3905 [4] NCCL INFO Channel 09 : 4[85000] -> 0[6000] via P2P/IPC user-X9DRG-HF:3351:3908 [2] NCCL INFO Channel 09 : 2[a000] -> 1[7000] via P2P/IPC user-X9DRG-HF:3356:3911 [6] NCCL INFO Channel 09 : 6[89000] -> 7[8a000] via P2P/IPC user-X9DRG-HF:3355:3910 [5] NCCL INFO Channel 09 : 5[86000] -> 6[89000] via P2P/IPC user-X9DRG-HF:3359:3907 [7] NCCL INFO Channel 09 : 7[8a000] -> 4[85000] via P2P/IPC user-X9DRG-HF:3350:3909 [1] NCCL INFO Channel 09 : 1[7000] -> 5[86000] via P2P/IPC user-X9DRG-HF:3352:3906 [3] NCCL INFO Channel 10 : 3[b000] -> 7[8a000] via P2P/IPC user-X9DRG-HF:3349:3904 [0] NCCL INFO Channel 10 : 0[6000] -> 1[7000] via P2P/IPC user-X9DRG-HF:3353:3905 [4] NCCL INFO Channel 10 : 4[85000] -> 6[89000] via P2P/IPC user-X9DRG-HF:3351:3908 [2] NCCL INFO Channel 10 : 2[a000] -> 0[6000] via P2P/IPC user-X9DRG-HF:3356:3911 [6] NCCL INFO Channel 10 : 6[89000] -> 2[a000] via P2P/IPC user-X9DRG-HF:3355:3910 [5] NCCL INFO Channel 10 : 5[86000] -> 4[85000] via P2P/IPC user-X9DRG-HF:3359:3907 [7] NCCL INFO Channel 10 : 7[8a000] -> 5[86000] via P2P/IPC user-X9DRG-HF:3350:3909 [1] NCCL INFO Channel 10 : 1[7000] -> 3[b000] via P2P/IPC user-X9DRG-HF:3351:3908 [2] NCCL INFO Channel 10 : 2[a000] -> 6[89000] via P2P/IPC user-X9DRG-HF:3353:3905 [4] NCCL INFO Channel 10 : 4[85000] -> 5[86000] via P2P/IPC user-X9DRG-HF:3356:3911 [6] NCCL INFO Channel 10 : 6[89000] -> 4[85000] via P2P/IPC user-X9DRG-HF:3352:3906 [3] NCCL INFO Channel 10 : 3[b000] -> 1[7000] via P2P/IPC user-X9DRG-HF:3355:3910 [5] NCCL INFO Channel 10 : 5[86000] -> 7[8a000] via P2P/IPC user-X9DRG-HF:3359:3907 [7] NCCL INFO Channel 10 : 7[8a000] -> 3[b000] via P2P/IPC user-X9DRG-HF:3350:3909 [1] NCCL INFO Channel 10 : 1[7000] -> 0[6000] via P2P/IPC user-X9DRG-HF:3351:3908 [2] NCCL INFO Channel 11 : 2[a000] -> 6[89000] via P2P/IPC user-X9DRG-HF:3353:3905 [4] NCCL INFO Channel 11 : 4[85000] -> 5[86000] via P2P/IPC user-X9DRG-HF:3349:3904 [0] NCCL INFO Channel 11 : 0[6000] -> 2[a000] via P2P/IPC user-X9DRG-HF:3356:3911 [6] NCCL INFO Channel 11 : 6[89000] -> 4[85000] via P2P/IPC user-X9DRG-HF:3352:3906 [3] NCCL INFO Channel 11 : 3[b000] -> 1[7000] via P2P/IPC user-X9DRG-HF:3355:3910 [5] NCCL INFO Channel 11 : 5[86000] -> 7[8a000] via P2P/IPC user-X9DRG-HF:3359:3907 [7] NCCL INFO Channel 11 : 7[8a000] -> 3[b000] via P2P/IPC user-X9DRG-HF:3350:3909 [1] NCCL INFO Channel 11 : 1[7000] -> 0[6000] via P2P/IPC user-X9DRG-HF:3351:3908 [2] NCCL INFO Channel 11 : 2[a000] -> 0[6000] via P2P/IPC user-X9DRG-HF:3350:3909 [1] NCCL INFO Channel 11 : 1[7000] -> 3[b000] via P2P/IPC user-X9DRG-HF:3353:3905 [4] NCCL INFO Channel 11 : 4[85000] -> 6[89000] via P2P/IPC user-X9DRG-HF:3356:3911 [6] NCCL INFO Channel 11 : 6[89000] -> 2[a000] via P2P/IPC user-X9DRG-HF:3352:3906 [3] NCCL INFO Channel 11 : 3[b000] -> 7[8a000] via P2P/IPC user-X9DRG-HF:3355:3910 [5] NCCL INFO Channel 11 : 5[86000] -> 4[85000] via P2P/IPC user-X9DRG-HF:3359:3907 [7] NCCL INFO Channel 11 : 7[8a000] -> 5[86000] via P2P/IPC user-X9DRG-HF:3349:3904 [0] NCCL INFO 12 coll channels, 16 p2p channels, 2 p2p channels per peer user-X9DRG-HF:3349:3904 [0] NCCL INFO comm 0x7f0484002e10 rank 0 nranks 8 cudaDev 0 busId 6000 - Init COMPLETE user-X9DRG-HF:3350:3909 [1] NCCL INFO 12 coll channels, 16 p2p channels, 2 p2p channels per peer user-X9DRG-HF:3349:3349 [0] NCCL INFO Launch mode Parallel user-X9DRG-HF:3350:3909 [1] NCCL INFO comm 0x7f7410002e10 rank 1 nranks 8 cudaDev 1 busId 7000 - Init COMPLETE user-X9DRG-HF:3351:3908 [2] NCCL INFO 12 coll channels, 16 p2p channels, 2 p2p channels per peer user-X9DRG-HF:3353:3905 [4] NCCL INFO 12 coll channels, 16 p2p channels, 2 p2p channels per peer user-X9DRG-HF:3353:3905 [4] NCCL INFO comm 0x7fe010002e10 rank 4 nranks 8 cudaDev 4 busId 85000 - Init COMPLETE user-X9DRG-HF:3351:3908 [2] NCCL INFO comm 0x7f8208002e10 rank 2 nranks 8 cudaDev 2 busId a000 - Init COMPLETE user-X9DRG-HF:3356:3911 [6] NCCL INFO 12 coll channels, 16 p2p channels, 2 p2p channels per peer user-X9DRG-HF:3356:3911 [6] NCCL INFO comm 0x7facb0002e10 rank 6 nranks 8 cudaDev 6 busId 89000 - Init COMPLETE user-X9DRG-HF:3352:3906 [3] NCCL INFO 12 coll channels, 16 p2p channels, 2 p2p channels per peer user-X9DRG-HF:3352:3906 [3] NCCL INFO comm 0x7fcc90002e10 rank 3 nranks 8 cudaDev 3 busId b000 - Init COMPLETE user-X9DRG-HF:3355:3910 [5] NCCL INFO 12 coll channels, 16 p2p channels, 2 p2p channels per peer user-X9DRG-HF:3359:3907 [7] NCCL INFO 12 coll channels, 16 p2p channels, 2 p2p channels per peer user-X9DRG-HF:3355:3910 [5] NCCL INFO comm 0x7f6008002e10 rank 5 nranks 8 cudaDev 5 busId 86000 - Init COMPLETE user-X9DRG-HF:3359:3907 [7] NCCL INFO comm 0x7fa2b8002e10 rank 7 nranks 8 cudaDev 7 busId 8a000 - Init COMPLETE tput_avg (Gbps): 1150.963623046875 busbw_avg (Gbps): 1007.0931396484375 (gpt) user@user-X9DRG-HF:~/Downloads$

jeffra commented 3 years ago

@GPTDGX for the above all-reduce test did it use nvlink, can you test via nvidia-smi nvlink -gt r? If it did not, can you try running the same all-reduce test with the torch launcher instead (torch.distributed.launch)? I am wondering if there's a strange environment issue when the sub-processes are created or something.

GPTDGX commented 3 years ago

I did run - "nvidia-smi nvlink -gt r" during test and it shows zero activity (have done if before on other system - so I know how it suppose to look when you have activity). Will try to run with torch.distributed.launch and report later.

jeffra commented 3 years ago

I had actually not used this option in nvidia-smi before, it's pretty cool though. I did try it out on a machine of ours with nvlink, I checked it before and after an all-reduce test launched with deepspeed across 8 GPUs. It seemed to report roughly the number of bytes transfered that I expected from the test.

Do let us know if you happen to run with torch.distributed.launch and what the outcome is.