Distributed training problem

meijieru / AtomNAS

[ICLR 2020]: 'AtomNAS: Fine-Grained End-to-End Neural Architecture Search'

Other

224 stars 21 forks source link

Distributed training problem #2

Open julycetc opened 4 years ago

julycetc commented 4 years ago

you use nccl in the distributed training, my problem is do you use nccl in pytorch or do you install nccl seperately?And how do you set your environment variable?I am queite confused about it.Thanks very much!I meet the following problem when i use two machine to run the code.

INFO NET/Plugin : No plugin found (libnccl-net.so) 2.NCCL error in: /pytorch/torch/lib/c10d/ProcessGroupNCCL.cpp:400, unhandled cuda error. 3.NCCL INFO NET/IB : No device found

meijieru commented 4 years ago

Actually we use a docker within a cloud environment. The docker itself is a self-compiled PyTorch environment with NCCL installed, so I am not sure about how to install it manually. Maybe you could refer to the official document from Nvidia. Sorry for the inconvenience.
I have listed the environment variable used in the code in the README.md.