mlcommons / training_results_v3.1

This repository contains the results and code for the MLPerf™ Training v3.1 benchmark.
https://mlcommons.org/benchmarks/training
Apache License 2.0
16 stars 10 forks source link

some question about CUDA_VISIBLE_DEVICES #11

Open wangfakang opened 1 month ago

wangfakang commented 1 month ago

Why is the value of CUDA_VISIBLE_DEVICES not configured in ascending order? For example, CUDA_VISIBLE-DEVICES=0,1,2,3,4,5,6,7 better suited for PXN?

https://github.com/mlcommons/training_results_v3.1/blob/5b62935b6baecd018180cb3100e65fa90ef7ac98/Azure%2BNVIDIA/benchmarks/gpt3/implementations/pytorch/config_common.sh#L1

wangfakang commented 1 month ago

friendly ping @nathanw-mlc @nv-rborkar

erhoo82 commented 1 month ago

We used CUDA_VISIBLE_DEVICES=0,4,2,6,1,5,3,7 to work around a bug in NCCL that causes NIC port usage conflict at a specific tensor-parallel size. We fixed the bug and using the ascending order mapping should yield the same performance.

wangfakang commented 1 month ago

We used CUDA_VISIBLE_DEVICES=0,4,2,6,1,5,3,7 to work around a bug in NCCL that causes NIC port usage conflict at a specific tensor-parallel size. We fixed the bug and using the ascending order mapping should yield the same performance.

@erhoo82 Thank you for your reply. Can you explain why used CUDA_VISIBLE_DEVICES=0,4,2,6,1,5,3,7 can work around it? or is there a PR related to NCCL repair? Thank you.

wangfakang commented 1 month ago

@pgmpablo157321 @hiwotadese @nv-rborkar @erhoo82 have any updates? and another question that why we need to disable NVLS and CUMEM feature ?

https://github.com/mlcommons/training_results_v3.1/blob/5b62935b6baecd018180cb3100e65fa90ef7ac98/Azure%2BNVIDIA/benchmarks/gpt3/implementations/pytorch/config_common.sh#L18-L19