Add multi node training guide for XPU device

pyg-team / pytorch_geometric

Graph Neural Network Library for PyTorch

https://pyg.org

MIT License

20.52k stars 3.57k forks source link

Add multi node training guide for XPU device #9464

Open zhouyu5 opened 4 days ago

zhouyu5 commented 4 days ago

📚 Describe the documentation issue

Currently, training_benchmark_xpu.py only support training with multiple XPU device, but only on the single node. If user want to try it on multiple node, each node with multiple XPU device, this script may need some minor modification. However, it's non trivial to make it work, I would like to submit a PR to improve the user experience when they want to launch multi-node multi-XPU training.

Suggest a potential alternative/fix

To my knowledge, the following files need modification

training_benchmark_xpu.py: need to modify the get_dist_params() function, which is used to initialize the DDP process group.
README.md: need to give a detail guide on how to setup environment and launch the multi-node training.

zhouyu5 commented 4 days ago

@rusty1s @DamianSzwichtenberg Could you please share some of your thoughts on this proposal?

DamianSzwichtenberg commented 3 days ago

@zhouyu5 I think you may first start with new example at examples/multi_gpu, similar to distributed_sampling_multinode.py - you have a bit more freedom there.

zhouyu5 commented 3 days ago

@zhouyu5 I think you may first start with new example at examples/multi_gpu, similar to distributed_sampling_multinode.py - you have a bit more freedom there.

@DamianSzwichtenberg Sounds good. I could provide a distributed_sampling_multinode_xpu.py file. Another question, if I want to reuse some code under benchmark folder, like benchmark/multi_gpu/training/common.py, what is the proper way, directly copy them to example folder, or import them from distributed_sampling_multinode_xpu.py file, or any other suggestions?

DamianSzwichtenberg commented 2 days ago

@zhouyu5 I think you may first start with new example at examples/multi_gpu, similar to distributed_sampling_multinode.py - you have a bit more freedom there.

@DamianSzwichtenberg Sounds good. I could provide a distributed_sampling_multinode_xpu.py file. Another question, if I want to reuse some code under benchmark folder, like benchmark/multi_gpu/training/common.py, what is the proper way, directly copy them to example folder, or import them from distributed_sampling_multinode_xpu.py file, or any other suggestions?

I would go for a copy unless you feel that some pieces may become building blocks for users' solutions, then we can put some code to utils and import from there. Anyway, let's start with an example and adjust later. 😉

zhouyu5 commented 2 days ago

@DamianSzwichtenberg Thanks for your input, it make sense. Will give a example later.