Open wangfakang opened 6 months ago
friendly ping @nathanw-mlc @nv-rborkar
We used CUDA_VISIBLE_DEVICES=0,4,2,6,1,5,3,7
to work around a bug in NCCL that causes NIC port usage conflict at a specific tensor-parallel size. We fixed the bug and using the ascending order mapping should yield the same performance.
We used
CUDA_VISIBLE_DEVICES=0,4,2,6,1,5,3,7
to work around a bug in NCCL that causes NIC port usage conflict at a specific tensor-parallel size. We fixed the bug and using the ascending order mapping should yield the same performance.
@erhoo82 Thank you for your reply. Can you explain why used CUDA_VISIBLE_DEVICES=0,4,2,6,1,5,3,7
can work around it? or is there a PR related to NCCL repair? Thank you.
@pgmpablo157321 @hiwotadese @nv-rborkar @erhoo82 have any updates? and another question that why we need to disable NVLS
and CUMEM
feature ?
Why is the value of
CUDA_VISIBLE_DEVICES
not configured in ascending order? For example,CUDA_VISIBLE-DEVICES=0,1,2,3,4,5,6,7
better suited for PXN?https://github.com/mlcommons/training_results_v3.1/blob/5b62935b6baecd018180cb3100e65fa90ef7ac98/Azure%2BNVIDIA/benchmarks/gpt3/implementations/pytorch/config_common.sh#L1