Closed kevin00616 closed 3 months ago
Hello,did you use VoxCeleb/SpeakerRce recipe? When I run this recipe, 8 GPUs will cost 5 hours. I think this is unusual.
@Adel-Moumen is the fit batch properly wrapping with no_sync in this recipe?
@Adel-Moumen is the fit batch properly wrapping with no_sync in this recipe?
Actually I don't think this issue is related to the fit_batch
function, but I'll need the exact error message to get a better idea (maybe linked to our old ckpt issue?). Could you please share the error message @kevin00616 ?
Closing since not reproducible/no new details and likely a setup or hardware issue.
Describe the bug
I tried to restore the training with speechbrain/spkrec-ecapa-voxceleb with ECAPA-TDNN. I tried to use one GPU and Multiple GPUs to train. However, the computer will crash due to unknown reasons and can only be solved by restarting the computer. When I tried to use one GPU to train,It can last for 3 to 4 hours before crashing. But When I tried to use Multiple GPUs to train,He might crash in less than half an hour or even less than 10 minutes. I have absolutely no clue what happened. Can anyone tell me what happened or give me some advice?
Expected behaviour
Normal training.
To Reproduce
No response
Environment Details
GPUs:Tesla V100 *2 SpeechBrain:0.5.14 python:3.7.16 pytorch:1.12.1 Cuda:11.4
Relevant Log Output
No response
Additional Context
No response