Closed YanjieZe closed 1 year ago
@YanjieZe, the DDP code is based on this example from PyTorch.
Do you mean why the train script hard-coded for 1 node? This is bad engineering on my part. Ideally, you should be able to spawn any number of nodes across machines, where each node makes use of multiple GPUs. But my code just handles 1 node at a time.
Hope this helps!
Thank you Mohit! The blog solves my confusion perfectly. Great thanks again for your response in time and nice open source code :)
Hi Mohit, I am using your code recently and trying to do the multi-GPU training. But I find that the multi-gpu and DDP usage in your code seems a bit strange for me to understand.
Specifically, in
train.py
, you use the following code to call multi-process and also use the multi GPU:while then, in
run_seed
function, you actually create agents for each process respectively, and thus it seems that the agents between different gpus are not sharing params, weights, or gradients, if I do not understand wrongly. And this seems to make multi-gpu training nonsense.Do I understand correctly? Could you possibly give some explanations?
Great thanks!