I don't know what happened during training

kevin00616 commented 1 year ago

Describe the bug

I tried to restore the training with speechbrain/spkrec-ecapa-voxceleb with ECAPA-TDNN. I tried to use one GPU and Multiple GPUs to train. However, the computer will crash due to unknown reasons and can only be solved by restarting the computer. When I tried to use one GPU to train,It can last for 3 to 4 hours before crashing. But When I tried to use Multiple GPUs to train,He might crash in less than half an hour or even less than 10 minutes. I have absolutely no clue what happened. Can anyone tell me what happened or give me some advice?

Expected behaviour

Normal training.

To Reproduce

No response

Environment Details

GPUs:Tesla V100 *2 SpeechBrain:0.5.14 python:3.7.16 pytorch:1.12.1 Cuda:11.4

Relevant Log Output

No response

Additional Context

No response

wntg commented 1 year ago

Hello,did you use VoxCeleb/SpeakerRce recipe? When I run this recipe, 8 GPUs will cost 5 hours. I think this is unusual.

TParcollet commented 1 year ago

@Adel-Moumen is the fit batch properly wrapping with no_sync in this recipe?

Adel-Moumen commented 9 months ago

@Adel-Moumen is the fit batch properly wrapping with no_sync in this recipe?

Actually I don't think this issue is related to the fit_batch function, but I'll need the exact error message to get a better idea (maybe linked to our old ckpt issue?). Could you please share the error message @kevin00616 ?

asumagic commented 3 months ago

Closing since not reproducible/no new details and likely a setup or hardware issue.

speechbrain / speechbrain