Closed simonefrancia closed 4 years ago
Hi! That trick was to avoid CUDA errors when training with a multi-GPU multi-process setting using fork
.
At least in our settings, when a CUDA API was called before forking (which happens if there are more than 2 visible GPUs), CUDA API must not be called up to the time of forking, otherwise subsequent CUDA usages in the subprocesses will fail. More context here.
This only applies when the subprocesses uses the fork
method, and it won't be a problem if spawn
is used. I remember having an issue with locating the correct module when I used spawn
, but it should be possible to fix the current script to use spawn
, and it's how torch.multiprocessing.spawn
does the same job.
Thank you! It's a little bit difficult for me to understand, so if you have any other link , i would be grateful to have them.
If you're unfamiliar with forking (as a process management concept of Unix-like operating systems) and want to know more, I'd suggest you take a course on OS, like this. But it was just a trick, and you shouldn't worry too much about this if works for you. I'll close this!
Hi, watching your code I saw a interesting thing:
In this case you call a subprocess from the main process to check how many cuda devices are available. My question is : what is the difference between your version from doing this command, for example:
I think you did so because they are different in some way, but I don't know in what.
Thanks