Training getting stopped after sometime

abhinavkaul95 commented 3 years ago

Hi @piaozhx,

While trying to train the model, after sometime, training is getting stopped and I am getting the following error at the end:

/usr/lib/python3.6/multiprocessing/semaphore_tracker.py:143: UserWarning: semaphore_tracker: There appear to be 32 leaked semaphores to clean up at shutdown
  len(cache))

There are some more errors in between as well, but this was the error at the last. The complete logs of the training are attached with the issue. logs_training.txt

hbb1 commented 3 years ago

@abhinavkaul95 I am the author @hbb1, not @piaozhx . Please don't bother him 🤣.

According to the logs_training.txt, the error is caused by CUDA out of memory.

    result = self.forward(*input, **kwargs)
  File "/media/disk/user/abhinav/LBYLNet/env/lib/python3.6/site-packages/apex-0.1-py3.6-linux-x86_64.egg/apex/parallel/optimized_sync_batchnorm.py", line 85, in forward
    return SyncBatchnormFunction.apply(input, z, self.weight, self.bias, self.running_mean, self.running_var, self.eps, self.training or not self.track_running_stats, exponential_average_factor, self.process_group, channel_last, self.fuse_relu)
  File "/media/disk/user/abhinav/LBYLNet/env/lib/python3.6/site-packages/apex-0.1-py3.6-linux-x86_64.egg/apex/parallel/optimized_sync_batchnorm_kernel.py", line 64, in forward
    out = syncbn.batchnorm_forward(input, mean, inv_std, weight, bias)
RuntimeError: CUDA out of memory. Tried to allocate 16.00 MiB (GPU 0; 7.80 GiB total capacity; 3.80 GiB already allocated; 17.50 MiB free; 3.82 GiB reserved in total by PyTorch) (malloc at /pytorch/c10/cuda/CUDACachingAllocator.cpp:289)

Make sure you have sufficient GPU memory or adjust the batch size. In my configuration, I use 2 TITANX and batch size 64.

abhinavkaul95 commented 3 years ago

Sure @hbb1, I think I was just blindly commenting on the issues instead of seeing who was solving them.. :rofl: :rofl: :rofl:

I have made the batch size smaller and that seems to have solved the issue. I will let you know if the issue persists, which is unlikely though. Thanks. :smiley: :+1:

svip-lab / LBYLNet

Training getting stopped after sometime #2