UserWarning: resource_tracker: There appear to be 14 leaked semaphore objects to clean up at shutdown

locuslab / deq

[NeurIPS'19] Deep Equilibrium Models

MIT License

724 stars 80 forks source link

UserWarning: resource_tracker: There appear to be 14 leaked semaphore objects to clean up at shutdown #29

Closed yj373 closed 1 year ago

yj373 commented 1 year ago

Hello,

I am trying to train a MDEQ on the image classification task. Here is the command I used to train the image classifier python tools/cls_train.py --cfg experiments/cifar/cls_mdeq_TINY.yaml. Everything works fine during the pretraining stage, but when actual training starts, I get an error UserWarning: resource_tracker: There appear to be 14 leaked semaphore objects to clean up at shutdown and the training terminates. I have tried decreasing the BATCH_SIZE_PER_GPU to 16 but still cannot solve the issue. Can anyone help me with this problem? Thanks!

yj373 commented 1 year ago

I find that this error happens at loss.backward() right after finishing pretrainig (factor = 0.0 and deq_steps = 0). The backward_hook is called once and the returned result[''result] is a tensor with the shape of [64, 30720,1]. The model is trained on a machine using 2 GeForce RTX 2080Ti GPUs. The configuration file I used is cls_mdeq_TINY.yaml.

Gsunshine commented 1 year ago

Hi,

Thank you for your feedback! We will release a library and a model zoo for DEQs later (with systematically designed code and verified implementations). Hopefully, this can help solve the training issues.

Before that, you might refer to the DEQ-Flow's code to implement your model. Or you can use phantom grad's code to train your MDEQ.

Please wait for our release!

Thanks!

Zhengyang

jerrybai1995 commented 1 year ago

Hi @yj373 ,

What version of pytorch are you using?

It seems that the backwawrd hook is causing the problem. If the issue still persists, I suggest that you revert to the custom backward pass approach to use the implicit differentiation. An example here: https://github.com/locuslab/mdeq/blob/master/lib/models/mdeq_forward_backward.py#L32

yj373 commented 1 year ago

Thank you for the reply! I am using torch 1.8.1+cu101. Actually, I followed the suggestion from @Gsunshine and trained the MDEQ model using phantom grad's code. And it turns out to work fine. Thanks again!