Open isalirezag opened 5 years ago
@SethHWeidman - Can you review and see if this topic should be added into the Data Parallel tutorial?
I think the overall tutorial would benefit being a bit re-worked, I am a complete beginner in data parallelism and I don't understand a thing in the Getting started
tutorial.
What should I write instead of localhost
and 12355
in :
os.environ['MASTER_ADDR'] = 'localhost'
os.environ['MASTER_PORT'] = '12355'
what does it do and do I even need to include it in my code ?
Also the setup(rank, world)
function takes rank
and world
as argument but it's never clrearly explained what rank
and world
are even though it seems super important based on what I read in the Writing distributed application with pytorch
tutorial (that I also don't understand much). What rank should I pass and what is a rank ?
Same questions goes for the demo_basic(rank, world_size)
function ...
@johncwok I am facing the same issue right now, a beginner in DDP based training :\ , if you have figured this out can you please explain in detail regarding MASTER_ADDR, MASTER_PORT, rank and world
??
I understand world_size which basically means number_of_nodes*num_gpus_in_each_node
but what about rank and the ADDR, PORT??
I am using the below references for setup
Any help would be highly appreciated, Thanks!
in this tutorial for data parallel (link) it can be useful if you can add how to handle loss function for the case that we are using multiple gpus. usually naive way will cause unbalance gpu memory usage