Open zhouyu5 opened 4 months ago
@rusty1s @DamianSzwichtenberg Could you please share some of your thoughts on this proposal?
@zhouyu5 I think you may first start with new example at examples/multi_gpu, similar to distributed_sampling_multinode.py
- you have a bit more freedom there.
@zhouyu5 I think you may first start with new example at examples/multi_gpu, similar to
distributed_sampling_multinode.py
- you have a bit more freedom there.
@DamianSzwichtenberg Sounds good. I could provide a distributed_sampling_multinode_xpu.py
file.
Another question, if I want to reuse some code under benchmark
folder, like benchmark/multi_gpu/training/common.py, what is the proper way, directly copy them to example
folder, or import them from distributed_sampling_multinode_xpu.py
file, or any other suggestions?
@zhouyu5 I think you may first start with new example at examples/multi_gpu, similar to
distributed_sampling_multinode.py
- you have a bit more freedom there.@DamianSzwichtenberg Sounds good. I could provide a
distributed_sampling_multinode_xpu.py
file. Another question, if I want to reuse some code underbenchmark
folder, like benchmark/multi_gpu/training/common.py, what is the proper way, directly copy them toexample
folder, or import them fromdistributed_sampling_multinode_xpu.py
file, or any other suggestions?
I would go for a copy unless you feel that some pieces may become building blocks for users' solutions, then we can put some code to utils and import from there. Anyway, let's start with an example and adjust later. 😉
@DamianSzwichtenberg Thanks for your input, it make sense. Will give a example later.
@DamianSzwichtenberg Please check PR #9490 , thanks.
📚 Describe the documentation issue
Currently, training_benchmark_xpu.py only support training with multiple XPU device, but only on the single node. If user want to try it on multiple node, each node with multiple XPU device, this script may need some minor modification. However, it's non trivial to make it work, I would like to submit a PR to improve the user experience when they want to launch multi-node multi-XPU training.
Suggest a potential alternative/fix
To my knowledge, the following files need modification
get_dist_params()
function, which is used to initialize the DDP process group.