you use nccl in the distributed training, my problem is do you use nccl in pytorch or do you install nccl
seperately?And how do you set your environment variable?I am queite confused about it.Thanks very much!I meet the following problem when i use two machine to run the code.
INFO NET/Plugin : No plugin found (libnccl-net.so)
2.NCCL error in: /pytorch/torch/lib/c10d/ProcessGroupNCCL.cpp:400, unhandled cuda error.
3.NCCL INFO NET/IB : No device found
Actually we use a docker within a cloud environment. The docker itself is a self-compiled PyTorch environment with NCCL installed, so I am not sure about how to install it manually. Maybe you could refer to the official document from Nvidia. Sorry for the inconvenience.
I have listed the environment variable used in the code in the README.md.
you use nccl in the distributed training, my problem is do you use nccl in pytorch or do you install nccl seperately?And how do you set your environment variable?I am queite confused about it.Thanks very much!I meet the following problem when i use two machine to run the code.