CUDA error while training syncnet

NikitaKononov commented 2 years ago

Hello! I didn't make any changes to the code, but I have troubles with syncnet training Filelists are available, data is available too This error on first checkpoint save:

Saved checkpoint: check/checkpoint_step000000001.pth Traceback (most recent call last): File "color_syncnet_train.py", line 279, in nepochs=hparams.nepochs) File "color_syncnet_train.py", line 161, in train loss = cosine_loss(a, v, y) File "color_syncnet_train.py", line 136, in cosine_loss loss = logloss(d.unsqueeze(1), y) File "/home/kadochnikova/.conda/envs/lipsync/lib/python3.6/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl return forward_call(*input, **kwargs) File "/home/kadochnikova/.conda/envs/lipsync/lib/python3.6/site-packages/torch/nn/modules/loss.py", line 612, in forward return F.binary_cross_entropy(input, target, weight=self.weight, reduction=self.reduction) File "/home/kadochnikova/.conda/envs/lipsync/lib/python3.6/site-packages/torch/nn/functional.py", line 2893, in binary_cross_entropy return torch._C._nn.binary_cross_entropy(input, target, weight, reduction_enum) RuntimeError: CUDA error: device-side assert triggered srun: error: hpe: task 0: Exited with exit code 1

I can't find the error, can you please suggest me, what is the trouble? Thanks!

NikitaKononov commented 2 years ago

def cosine_loss(a, v, y): d = nn.functional.cosine_similarity(a, v) loss = logloss(d.unsqueeze(1), y)

return loss

in this function

ghost commented 2 years ago

you should use BCEWithLogitsLoss to deal with negative score

hannarud commented 2 years ago

Hi @primepake @NikitaKononov! I faced the same error triggered while trying to train color_syncnet_train.py. It happens because in models/conv2.py @primepake changed ReLU activation to PReLU. ReLU never gave us negative values (in original wav2lip repo) so using BCELoss was fine. But PReLu gives negative values, so we really need to use some other loss function. I guess, there is no suggestions by far? @NikitaKononov - which loss function did you use in the end? Were you able to train the expert discriminator?

primepake / wav2lip_288x288

CUDA error while training syncnet #13