pytorch-labs / gpt-fast

Simple and efficient pytorch-native transformer text generation in <1000 LOC of python.
BSD 3-Clause "New" or "Revised" License
5.36k stars 485 forks source link

Set cuda device before init_process_group #56

Closed yifuwang closed 6 months ago

yifuwang commented 6 months ago

To leverage the low latency intra-node comm in c10d (https://github.com/pytorch/pytorch/pull/114001), torch.cuda.set_device() needs to be invoked before init_process_group().

carmocca commented 6 months ago

Hi @yifuwang!

Is this change something that you would recommend in general? In every resource online set_device has been put after init_process_group():

Could you elaborate why is this necessary with ENABLE_INTRA_NODE_COMM and what are the differences (if any) of setting the device before or after?

Thank you!

yifuwang commented 6 months ago

Hey @carmocca,

Is this change something that you would recommend in general?

Without ENABLE_INTRA_NODE_COMM, I don't think it matters so long as you set the correct device before the first collective. There are instances of set_device before init_process_group in the links you posted (e.g. https://pytorch.org/docs/stable/distributed.html#launch-utility and https://pytorch.org/docs/stable/generated/torch.nn.parallel.DistributedDataParallel.html#torch.nn.parallel.DistributedDataParallel).

Could you elaborate why is this necessary with ENABLE_INTRA_NODE_COMM

Techniquely it's not a hard requirement. It's just that the feature is still new and experimental, and we're still figuring out the UX. Curious if this constraint is causing issues in your project aside from inconvenience. Thanks

carmocca commented 6 months ago

Let me run Lightning's CI with the order changed to see if any issues pop up.