Multi GPU training(data parallel)

SunQpark commented 6 years ago

This PR handles two multi processing. The first is multi-CPU for data loading, and the second is using multi-GPU(data parallel).

Multi-CPU is simply done by adding n_cpu argument in the config file and pasing it as num_workers option of pytorch native DataLoader.

Multi-GPU can be controlled with n_gpuoption in config file, which replaces the previous gpu index and cuda option. Training without gpu is still possible with "n_gpu": 0

Specifying the GPU indices to use is possible externally with the environmental variable CUDA_VISIBLE_DEVICES=0,1. I considered adding GPU indices into config file instead of num_GPU option and setting that on the train.py, but that would save GPU indices to the checkpoint, which can be problematic when resuming.

Tested on 3 machines, my laptop: pytorch 0.4.1, no GPU server1: pytorch 0.4.1, 8 Tesla K80, cuda 9.1 server2: pytorch 0.4.0, 4 GTX 1080, cuda 9.0

It worked fine for all of conditions I have tested, but I heard that one of my friend saying that giving non-zero value to the num_workers option raised exception for her machine. So, please tell me if anything goes wrong.

I'll update the README file later

borgesa commented 6 years ago

I agree with your reasoning for using "n_gpu" instead of indices. Could you maybe add a one sentence summary of this reasoning in README.md, maybe in addition to either a reference or snippet of how the "CUDA_VISIBLE_DEVICES" prefix is to be used?

Will try running your code.

SunQpark commented 6 years ago

Oh yes I will add that, thank you for quick response. I just got a new idea that adding it to a command line option, such as --device 0,1,2,3. How do you think it would be?

borgesa commented 6 years ago

That sounds like a nice solution. I often have to select GPUs, as more people are using the main server I work on.

Speaking of which: there seems to be a problem with that server right now, so I have not yet run your code. Currently waiting for a reply from the IT department. Will do the checks you request ASAP (but might be tomorrow) and report back.

borgesa commented 6 years ago

Hi,

Did some testing. Looks good.

GPUs

The example model works well on a server with 4 x Nvidia Quadro P6000.

Played around with the settings and argparse arguments. Warning messages seems to be handled nicely (e.g. when less GPUs are available than what config.json specify).

CPU threads

Seems to work (although I have not monitored threads).

Maybe rename "n_cpu" to "n_cpu_threads" or "n_cpu_workers"? Maybe the latter, as some people are familiar with "num_workers".

SunQpark commented 6 years ago

I'm glad that test could be done fast. Let's just call it num_workers as the original DataLoader does. Now I think this PR is stable enough to be merged.

victoresque / pytorch-template

Multi GPU training(data parallel) #28

GPUs

CPU threads