Hi,
when I try to train a torch nngraph model on two titan x GPUs using DataParallelTable I get a warning saying
warning: could not load nccl, falling back to default communication.
I installed nccl in my home directory and added nccl/lib path to LD_LIBRARY_PATH. But I still get the same warning.
Moreover after starting the training goes on for a sufficiently time(duration varies) before the training just freezes. I can see that the GPU memory is occupied (~10 point something GBs out of 12) and cpu utilization ~100%.(While training was proceeding, it was ~170%) and the GPU temperature is low(~23C to 34C). (While training was proceeding, it was ~74C-85C).
Can anyone tell me why this could be happening?
Hi, when I try to train a torch nngraph model on two titan x GPUs using DataParallelTable I get a warning saying warning: could not load nccl, falling back to default communication. I installed nccl in my home directory and added nccl/lib path to LD_LIBRARY_PATH. But I still get the same warning. Moreover after starting the training goes on for a sufficiently time(duration varies) before the training just freezes. I can see that the GPU memory is occupied (~10 point something GBs out of 12) and cpu utilization ~100%.(While training was proceeding, it was ~170%) and the GPU temperature is low(~23C to 34C). (While training was proceeding, it was ~74C-85C). Can anyone tell me why this could be happening?