torch / torch7

http://torch.ch
Other
9k stars 2.38k forks source link

Training code freezes. GPU memory occupied. GPU is idle. #1075

Closed arthitag closed 7 years ago

arthitag commented 7 years ago

Hi, when I try to train a torch nngraph model on two titan x GPUs using DataParallelTable I get a warning saying warning: could not load nccl, falling back to default communication. I installed nccl in my home directory and added nccl/lib path to LD_LIBRARY_PATH. But I still get the same warning. Moreover after starting the training goes on for a sufficiently time(duration varies) before the training just freezes. I can see that the GPU memory is occupied (~10 point something GBs out of 12) and cpu utilization ~100%.(While training was proceeding, it was ~170%) and the GPU temperature is low(~23C to 34C). (While training was proceeding, it was ~74C-85C). Can anyone tell me why this could be happening?