Closed andreyvelich closed 4 months ago
/assign @tenzen-y @kuizhiqing
[APPROVALNOTIFIER] This PR is APPROVED
This pull-request has been approved by: andreyvelich, diegolovison
The full list of commands accepted by this bot can be found here.
The pull request process is described here
/lgtm
A mino change would be better for me that the index in the diagrams can count from 0, since the index for worker and for data in memory are begin with 0. It would be OK for me if you are just keep it, it's my personal taste indeed.
I agree with you @kuizhiqing, even NCCL ranks start from 0: https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/usage/collectives.html Will change it.
Thanks for review @diegolovison! I addressed your comments.
/hold cancel /assign @kuizhiqing @tenzen-y @johnugeorge @hbelmiro @johnugeorge
@andreyvelich: GitHub didn't allow me to assign the following users: hbelmiro.
Note that only kubeflow members with read permissions, repo collaborators and people who have commented on this issue/PR can be assigned. Additionally, issues/PRs can only have 10 assignees at the same time. For more information please see the contributor guide
Related: https://github.com/kubeflow/training-operator/issues/1998
I added 2 diagrams for PyTorch distributed training with ring all-reduce algorithm and Tensorflow distributed training with PS.
It gives an idea what is the Training Operator responsibility.
Please take a look @kubeflow/wg-training-leads @diegolovison @hbelmiro @kubeflow/release-team
/hold for review