Closed yj373 closed 1 year ago
I find that this error happens at loss.backward()
right after finishing pretrainig (factor = 0.0 and deq_steps = 0). The backward_hook is called once and the returned result[''result]
is a tensor with the shape of [64, 30720,1]. The model is trained on a machine using 2 GeForce RTX 2080Ti GPUs. The configuration file I used is cls_mdeq_TINY.yaml.
Hi,
Thank you for your feedback! We will release a library and a model zoo for DEQs later (with systematically designed code and verified implementations). Hopefully, this can help solve the training issues.
Before that, you might refer to the DEQ-Flow's code to implement your model. Or you can use phantom grad's code to train your MDEQ.
Please wait for our release!
Thanks!
Zhengyang
Hi @yj373 ,
What version of pytorch are you using?
It seems that the backwawrd hook is causing the problem. If the issue still persists, I suggest that you revert to the custom backward pass approach to use the implicit differentiation. An example here: https://github.com/locuslab/mdeq/blob/master/lib/models/mdeq_forward_backward.py#L32
Thank you for the reply! I am using torch 1.8.1+cu101. Actually, I followed the suggestion from @Gsunshine and trained the MDEQ model using phantom grad's code. And it turns out to work fine. Thanks again!
Hello,
I am trying to train a MDEQ on the image classification task. Here is the command I used to train the image classifier
python tools/cls_train.py --cfg experiments/cifar/cls_mdeq_TINY.yaml
. Everything works fine during the pretraining stage, but when actual training starts, I get an errorUserWarning: resource_tracker: There appear to be 14 leaked semaphore objects to clean up at shutdown
and the training terminates. I have tried decreasing the BATCH_SIZE_PER_GPU to 16 but still cannot solve the issue. Can anyone help me with this problem? Thanks!