Closed SunQpark closed 6 years ago
I agree with your reasoning for using "n_gpu" instead of indices. Could you maybe add a one sentence summary of this reasoning in README.md, maybe in addition to either a reference or snippet of how the "CUDA_VISIBLE_DEVICES" prefix is to be used?
Will try running your code.
Oh yes I will add that, thank you for quick response.
I just got a new idea that adding it to a command line option, such as --device 0,1,2,3
.
How do you think it would be?
That sounds like a nice solution. I often have to select GPUs, as more people are using the main server I work on.
Speaking of which: there seems to be a problem with that server right now, so I have not yet run your code. Currently waiting for a reply from the IT department. Will do the checks you request ASAP (but might be tomorrow) and report back.
Hi,
Did some testing. Looks good.
The example model works well on a server with 4 x Nvidia Quadro P6000.
Played around with the settings and argparse arguments. Warning messages seems to be handled nicely (e.g. when less GPUs are available than what config.json specify).
Seems to work (although I have not monitored threads).
Maybe rename "n_cpu" to "n_cpu_threads" or "n_cpu_workers"? Maybe the latter, as some people are familiar with "num_workers".
I'm glad that test could be done fast.
Let's just call it num_workers
as the original DataLoader does.
Now I think this PR is stable enough to be merged.
This PR handles two multi processing. The first is multi-CPU for data loading, and the second is using multi-GPU(data parallel).
Multi-CPU
is simply done by addingn_cpu
argument in the config file and pasing it asnum_workers
option of pytorch nativeDataLoader
.Multi-GPU
can be controlled withn_gpu
option in config file, which replaces the previous gpu index and cuda option. Training without gpu is still possible with"n_gpu": 0
Specifying the GPU indices to use is possible externally with the environmental variable
CUDA_VISIBLE_DEVICES=0,1
. I considered adding GPU indices into config file instead ofnum_GPU
option and setting that on thetrain.py
, but that would save GPU indices to the checkpoint, which can be problematic when resuming.Tested on 3 machines, my laptop: pytorch 0.4.1, no GPU server1: pytorch 0.4.1, 8 Tesla K80, cuda 9.1 server2: pytorch 0.4.0, 4 GTX 1080, cuda 9.0
It worked fine for all of conditions I have tested, but I heard that one of my friend saying that giving non-zero value to the
num_workers
option raised exception for her machine. So, please tell me if anything goes wrong.I'll update the README file later