Pytorch training error when using the MSG3D config

cannon281 commented 1 month ago

Hi,

I have been using pyskl framework with the specified conda environment to train posec3d and stgcnn++. Training and inference works fine. However when I tried the MSG3D config (configs/msg3d/msg3d_pyskl_ntu60_xsub_hrnet) as soon as training starts, pytorch throws an error regarding inplace operation in the model structure. I have experimented by setting Relu activations in msg3d with inplace=False without much success, any help is much appreciated.

RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.cuda.FloatTensor [32, 192, 25, 85]], which is output 0 of ReluBackward0, is at version 1; expected version 0 instead. Hint: the backtrace further above shows the operation that failed to compute its gradient. The variable in question was changed in there or anywhere later. Good luck! ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 5051) of binary: /opt/conda/envs/pyskl/bin/python

HenryMantilla commented 1 month ago

Hi, I was facing the same problem today, what worked for me was changing these lines in the msg3d_utils.py file. Line 139, Line 232 and finally Line 316. Basically replace all "something1 += something2" by "something1 = something1 + something2"

cannon281 commented 1 month ago

Hi @HenryMantilla Thanks for the help, it was similar to what you mentioned, for me changing the code below made it work

file /pyskl/models/gcns/utils/msg3d_utils.py b/pyskl/models/gcns/utils/msg3d_utils.py

from line 232, changing

out += res
return self.act(out)

to this

out_res = out + res
return self.act(out_res)

kennymckormick / pyskl

Pytorch training error when using the MSG3D config #242