Closed lidongke closed 4 years ago
I wouldn't recommend using Python's native multiprocessing.Queue()
, especially when you're using GPUs. If you wish to parallelize data, DataParallel
or its distributed version is a better option: https://pytorch.org/tutorials/intermediate/ddp_tutorial.html
I see that DistributedDataParallel can use model parallel with multiprocess. In your SLM-Lab, i use 'global_nets' with sharing memory in multiprocess, i guess i can use DistributedDataParallel instead of sharing memory when i am using GPUs, am i right?
that's right, I think DataParallel suffices if you're not doing multi-node distributed training; but you'd have to write custom code to do that.
But I see that DataParallel can only use with single process and multi GPU,am i right? https://pytorch.org/tutorials/intermediate/ddp_tutorial.html.
DataParallel is single-process, multi-thread, and only works on a single machine, while DistributedDataParallel is multi-process and works for both single- and multi- machine training
And if i want to use DataParallel or DistributedDataParallel for test, is it easily for me to change that?
ok then you'd need the distributed version. I'm not sure how easy is it for you to change, and at this rate it's out of the scope of what SLM Lab does.
Hi~
After the issue #421 , I changed your code to async sampling and training. Now i have a subprocess(P1) created by the main process(P2) . P2 run with the env and sampling data , then P2 give the data to P1 by multiprocessing.Queue(), P1 get the data to replay buffer and training.Due to i am using "shared" mode, the global nets will be optimized by training in P2 and it also can be used by sampling in P1. Here i can use async sampling and training with CPU, i test that is correct. But i still want to increase the training speed.So that i want to change the training to GPU. First, i get the CUDA initialization error , i refactor my code and use 'spawn' start method. After that i get this error:
@kengz Could u please give me any help about that?