microsoft / mscclpp

MSCCL++: A GPU-driven communication stack for scalable AI applications
MIT License
245 stars 36 forks source link

[Bug] Is there a known bug with `Driver Version: 535.129.03` which cases `MscclppAllReduce3` to hang? #260

Open saeedmaleki opened 8 months ago

saeedmaleki commented 8 months ago

Hi MSCCL++ team,

Do you know if Driver Version: 535.129.03 has a bug that makes AllReduce3 to timeout?

Thanks, --Saeed

Binyang2014 commented 8 months ago

Hmm... not tested based on this version. Azure hpc image using driver 535.86.10 and doesn't have this issue. https://github.com/Azure/azhpc-images/blob/63e5eaa23de69ccc1c6e6a52dff29037c88e96d4/ubuntu/common/install_nvidiagpudriver.sh#L16-L19

saeedmaleki commented 8 months ago

thanks @Binyang2014! Debugging this issue with nvidia.

chhwang commented 7 months ago

Hi @saeedmaleki, is this issue resolved on your end? 535.154.05 is working good on my env.

saeedmaleki commented 6 months ago

it definitely still happens, i think this is a non-deterministic bug. NVIDIA couldn't reproduce it either. so maybe we could ignore it for now.

chhwang commented 6 months ago

Actually, I can occasionally reproduce this bug. @Binyang2014 @aashaka please be aware.