About cuda devices - Githubissues

openai / gpt-2-output-dataset

Dataset of GPT-2 outputs for research in detection, biases, and more

MIT License

1.93k stars 548 forks source link

About cuda devices #13

Closed simonefrancia closed 4 years ago

simonefrancia commented 4 years ago

Hi, watching your code I saw a interesting thing:

num_workers = int(subprocess.check_output(['python3', '-c', 'import torch; print(torch.cuda.device_count())']))

In this case you call a subprocess from the main process to check how many cuda devices are available. My question is : what is the difference between your version from doing this command, for example:

if torch.cuda.is_available():
   num_workers = int( torch.cuda.device_count() )

I think you did so because they are different in some way, but I don't know in what.

Thanks

jongwook commented 4 years ago

Hi! That trick was to avoid CUDA errors when training with a multi-GPU multi-process setting using fork.

At least in our settings, when a CUDA API was called before forking (which happens if there are more than 2 visible GPUs), CUDA API must not be called up to the time of forking, otherwise subsequent CUDA usages in the subprocesses will fail. More context here.

This only applies when the subprocesses uses the fork method, and it won't be a problem if spawn is used. I remember having an issue with locating the correct module when I used spawn, but it should be possible to fix the current script to use spawn, and it's how torch.multiprocessing.spawn does the same job.

simonefrancia commented 4 years ago

Thank you! It's a little bit difficult for me to understand, so if you have any other link , i would be grateful to have them.

jongwook commented 4 years ago

If you're unfamiliar with forking (as a process management concept of Unix-like operating systems) and want to know more, I'd suggest you take a course on OS, like this. But it was just a trick, and you shouldn't worry too much about this if works for you. I'll close this!