pytorch / tutorials

PyTorch tutorials.
https://pytorch.org/tutorials/
BSD 3-Clause "New" or "Revised" License
8.1k stars 4.03k forks source link

improve pytorch tutorial for Data Parallelism #553

Open isalirezag opened 5 years ago

isalirezag commented 5 years ago

in this tutorial for data parallel (link) it can be useful if you can add how to handle loss function for the case that we are using multiple gpus. usually naive way will cause unbalance gpu memory usage

jlin27 commented 5 years ago

@SethHWeidman - Can you review and see if this topic should be added into the Data Parallel tutorial?

CNelias commented 4 years ago

I think the overall tutorial would benefit being a bit re-worked, I am a complete beginner in data parallelism and I don't understand a thing in the Getting started tutorial.

What should I write instead of localhost and 12355 in :

os.environ['MASTER_ADDR'] = 'localhost' os.environ['MASTER_PORT'] = '12355' what does it do and do I even need to include it in my code ?

Also the setup(rank, world) function takes rank and world as argument but it's never clrearly explained what rank and world are even though it seems super important based on what I read in the Writing distributed application with pytorch tutorial (that I also don't understand much). What rank should I pass and what is a rank ? Same questions goes for the demo_basic(rank, world_size) function ...

rushi-the-neural-arch commented 2 years ago

@johncwok I am facing the same issue right now, a beginner in DDP based training :\ , if you have figured this out can you please explain in detail regarding MASTER_ADDR, MASTER_PORT, rank and world??

I understand world_size which basically means number_of_nodes*num_gpus_in_each_node but what about rank and the ADDR, PORT??

I am using the below references for setup

  1. https://github.com/Lyken17/Efficient-PyTorch/blob/master/main.py
  2. https://yangkky.github.io/2019/07/08/distributed-pytorch-tutorial.html
  3. https://github.com/yangkky/distributed_tutorial/blob/master/src/mnist-mixed.py
  4. https://github.com/Chanakya-School-of-AI/pytorch-tutorials/blob/master/distributed_data_parallel/dist_train.py

Any help would be highly appreciated, Thanks!