kubeflow / website

Kubeflow's public website
Creative Commons Attribution 4.0 International
145 stars 752 forks source link

Training: Add Distributed Training Diagrams #3678

Closed andreyvelich closed 4 months ago

andreyvelich commented 4 months ago

Related: https://github.com/kubeflow/training-operator/issues/1998

I added 2 diagrams for PyTorch distributed training with ring all-reduce algorithm and Tensorflow distributed training with PS.

It gives an idea what is the Training Operator responsibility.

Please take a look @kubeflow/wg-training-leads @diegolovison @hbelmiro @kubeflow/release-team

/hold for review

andreyvelich commented 4 months ago

/assign @tenzen-y @kuizhiqing

google-oss-prow[bot] commented 4 months ago

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: andreyvelich, diegolovison

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files: - ~~[content/en/docs/components/training/OWNERS](https://github.com/kubeflow/website/blob/master/content/en/docs/components/training/OWNERS)~~ [andreyvelich] Approvers can indicate their approval by writing `/approve` in a comment Approvers can cancel approval by writing `/approve cancel` in a comment
terrytangyuan commented 4 months ago

/lgtm

andreyvelich commented 4 months ago

A mino change would be better for me that the index in the diagrams can count from 0, since the index for worker and for data in memory are begin with 0. It would be OK for me if you are just keep it, it's my personal taste indeed.

I agree with you @kuizhiqing, even NCCL ranks start from 0: https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/usage/collectives.html Will change it.

andreyvelich commented 4 months ago

Thanks for review @diegolovison! I addressed your comments.

andreyvelich commented 4 months ago

/hold cancel /assign @kuizhiqing @tenzen-y @johnugeorge @hbelmiro @johnugeorge

google-oss-prow[bot] commented 4 months ago

@andreyvelich: GitHub didn't allow me to assign the following users: hbelmiro.

Note that only kubeflow members with read permissions, repo collaborators and people who have commented on this issue/PR can be assigned. Additionally, issues/PRs can only have 10 assignees at the same time. For more information please see the contributor guide

In response to [this](https://github.com/kubeflow/website/pull/3678#issuecomment-1948595410): >/hold cancel >/assign @kuizhiqing @tenzen-y @johnugeorge @hbelmiro @johnugeorge Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.