Add pipeline launcher components for other distributed training jobs

Jeffwan commented 4 years ago

In order to leverage different training operators in kubeflow pipeline, it would be better to provide high level launcher components as an abstraction to invoke training jobs.

katib-launcher and launcher are launcher componets for katib and tf-operator. We definitely need more similar components for PyTorch, MxNet, MPI and XGBoost, etc.

https://github.com/kubeflow/pipelines/tree/master/components/kubeflow

Ark-kun commented 4 years ago

What do you think about having generic launcher components that receive resolved serialized TaskSpec (or container image + command-line) and launch the given component.

What do you think about syntax like this?

MyLauncher = load_component(...)
with dsl.use_launcher(MyLauncher(num_workers=10)):
    launched_task = XGBoostTrainer(training_data=..., num_trees=500)

or

MyLauncher = load_component(...)
launcher_for_train = MyLauncher(
    num_workers=10,
    task=XGBoostTrainer(training_data=..., num_trees=500),
)

stale[bot] commented 3 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

stale[bot] commented 3 years ago

This issue has been automatically closed because it has not had recent activity. Please comment "/reopen" to reopen it.

Jeffwan commented 3 years ago

/reopen

k8s-ci-robot commented 3 years ago

@Jeffwan: Reopened this issue.

In response to [this](https://github.com/kubeflow/pipelines/issues/3445#issuecomment-729089982): >/reopen Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.

midhun1998 commented 3 years ago

Hi @Jeffwan and @Ark-kun . I would like to contribute to this issue. Please let me know how I can be of any help. :)

stale[bot] commented 3 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

wangli1426 commented 3 years ago

Any update to this feature?

I believe it would be great that Kuebflow pipelien can provide a generic launcher that creates CRD and manages the lifespan of a CRD, like MPIJob, PyTorchJobs, etc.

This requirement can be partially satisfied by using Kitlab Expeirment. However, as far as I know, there are some clear drawback of this approach:

It is possible that the launcher pod failed, but the CRD is still running.
An experiment can have multiple trials, each of which can be a CRD, like MPIJobs.

Thus, it is desirable to have a GenericLauncher in Kubeflow Pipeline, and an operator to manage the life span of the launcher pod and the created CRDs.

jalola commented 3 years ago

Hi, I am also looking for this feature, especially Pytorch, the PR for it seems pausing for some time https://github.com/kubeflow/pipelines/pull/5170

I could run the distributed training using PytorchJob (created by ResourceOp), this way has a disadvantage that it does not show the logs in the UI of the pipeline, it only shows the logs of the job controller not the worker container.

@ca-scribner please help continue the PR, thanks a lot.

wangli1426 commented 3 years ago

@jalola Thanks for the info. Do you mind to share an example on how to define a PytorchJob with the help of ResourceOp? Thanks in advance.

jalola commented 3 years ago

@wangli1426 The simple example of ResourceOp: https://github.com/kubeflow/pipelines/blob/master/samples/core/resource_ops/resource_ops.py

For the PytorchJob: https://github.com/kubeflow/pytorch-operator/blob/master/examples/mnist/v1/pytorch_job_mnist_nccl.yaml You can make it as json code

Remember to set on_success_condition, example: success_condition='status.replicaStatuses.Worker.succeeded==3,status.replicaStatuses.Chief.succeeded==1' https://github.com/kubeflow/pipelines/blob/master/samples/contrib/e2e-mnist/mnist-pipeline.ipynb

midhun1998 commented 3 years ago

Hi @jalola . Just wondering how can we stream all worker logs(when no of workers > 1) into pipeline log console? Or were you looking for just the logs of chief? Do you have any idea in mind?

jalola commented 3 years ago

I only know they have the client sdk to get logs Example: https://github.com/kubeflow/pytorch-operator/blob/4aeb6503162465766476519339d3285f75ffe03e/sdk/python/examples/kubeflow-pytorchjob-sdk.ipynb

API: https://github.com/kubeflow/pytorch-operator/blob/master/sdk/python/docs/PyTorchJobClient.md#get_logs

But I don't know how to show the logs to a component of pipeline.

ca-scribner commented 3 years ago

Sorry I let this slip from my mind and now I don’t have a good way to test. The changes requested were minor though and the code in the PR is working still if that helps. Maybe you could finish it off

On Fri, Jun 25, 2021 at 04:10 Hung Nguyen @.***> wrote:

I only know they have the client sdk to get logs Example:

https://github.com/kubeflow/pytorch-operator/blob/4aeb6503162465766476519339d3285f75ffe03e/sdk/python/examples/kubeflow-pytorchjob-sdk.ipynb

API: https://github.com/kubeflow/pytorch-operator/blob/master/sdk/python/docs/PyTorchJobClient.md#get_logs

But I don't know how to show the logs to a component of pipeline.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/kubeflow/pipelines/issues/3445#issuecomment-868309454, or unsubscribe https://github.com/notifications/unsubscribe-auth/ALPFPIZO2FDXMZNOOKW2P2LTUQ2XFANCNFSM4MBNOYRQ .

Ark-kun commented 2 years ago

But I don't know how to show the logs to a component of pipeline.

You could just print them.

jalola commented 2 years ago

But I don't know how to show the logs to a component of pipeline.

You could just print them.

I am using k8s_client API (Watch and read_namespaced_pod_log) to stream the logs from training pod. This one works. PyTorchJobClient get_logs(follow=True) does not stream line by line of the logs but the whole logs (when the training finishes).

@Ark-kun Another trouble I find when using launch_crd is that: on the Kubeflow pipeline, if users "terminate" the run of the pipeline, only the training controller pod (which is the launch_crd) is deleted, the distributed training pod will continue running. What do you think? You can give some advise, I may implement to the https://github.com/kubeflow/pipelines/pull/5170

stale[bot] commented 2 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

dkmiller commented 1 year ago

Hi everyone, I'm quite interested in this as well. Is there any progress towards built-in support for distributed training jobs in pipelines?

bhack commented 6 months ago

Is this still in the roadmap?

github-actions[bot] commented 4 days ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

kubeflow / pipelines

Add pipeline launcher components for other distributed training jobs #3445