Open agonzalezd opened 2 years ago
Hmm I haven't run across that error before. Sorry, I don't think I'll be of much help here.
It somehow spawns multiple processes on a single GPU, but only on one of them... I am launching the training on 4 GPUs. Three of the GPUs spawn each a single process, but one of them spawns 4 processes. If changing to 3 GPUs, the same happens: one of the GPUs spawns 3 processes. I cannot find a bug in the code that forces this behaviour....
Hello.
I am having an issue when using your code. If I try to resume a training from a checkpoint when having more than one GPU (I am using docker containers) I get the following error:
But when starting from scratch or using a single GPU this error does not show and the training goes flawlessly.
I must add that I have checked that the GPUs were completely free when launching the training.
Any advice on this issue?
Thanks in advance.