kubeflow / training-operator

Distributed ML Training and Fine-Tuning on Kubernetes
https://www.kubeflow.org/docs/components/training
Apache License 2.0
1.56k stars 680 forks source link

Support mpi jobs in universal operator #1345

Closed Jeffwan closed 2 years ago

Jeffwan commented 3 years ago

We've migrated pytorch, mxnet(byteps), xgboost (lightbgm). The next step is to consider moving mpi apis and controllers to this repo. (for people who are not familiar with the background, check https://docs.google.com/document/d/1x1JPDQfDMIbnoQRftDH1IzGU0qvHGSU4W6Jl4rJLPhI/edit#heading=h.e33ufidnl8z6)

The reason we didn't make it in the first draft is the code structure of mpi-operator is a little bit different from rest operators.

However, we do see strong needs to adopt it in universal operator so users can easily leverage it to run horovod and horovod elastic jobs. Another big motivation is DGL support can be built on top of MPI-operator. This would be a big gain for graph users.

/cc @kubeflow/wg-training-leads @ryantd @carmark

Jeffwan commented 3 years ago

/cc @alculquicondor

gaocegege commented 3 years ago

cc @zw0610

Jeffwan commented 3 years ago

@zw0610 had some discussion with @carmark and @ryantd in the past. Do you have the conclusion? BTW, what's the plan for v2 MPI? Is it ok to adopt v2 directly?

zw0610 commented 3 years ago

On the last community meeting, I've explained that internally we prefer using the APIs that are already production-ready. From my understanding, the v2 API as long as its controller is still under completion and has not been widely adopted by the community users. Merging v2 version directly may not benefit many users who persist with v1 API.

alculquicondor commented 3 years ago

I estimate the v2 controller and API will be ready in a month at most. Then we can produce a release.

The new controller has more coverage at unit, integration and e2e level, so it's in good track to be production-ready :)

alculquicondor commented 3 years ago

Also, I'm interested in helping with the migration of mpi-operator to the universal operator, but only after the v2 release. I don't think I would be able to implement everything myself, but I can at the very least help with reviews.

johnugeorge commented 3 years ago

How about getting v2 API directly in the universal operator while we continue using v1 from current MPI repo? This way, both will be isolated and switching from v1 to v2 will be easy?

alculquicondor commented 3 years ago

I vote yes to that. We can migrate only the v2 API. But we are not ready yet.

Jeffwan commented 3 years ago

@johnugeorge

@zw0610 proposed to merge two version in this repo. Since mpi uses different API versions, it's ok for them to coexist. There're lots of v1 users as I know and they will not immediately migrate to v2 because of both operator and api changes

alculquicondor commented 3 years ago

One problem I see is that we don't have a conversion webhook. The v2 controller can't support v1 APIs, and even if it could, it cannot handle the missing resources that jobs created with a v1 controller wouldn't have (like the workers Service). I think it's reasonable to only have support for v2. If users don't want to upgrade yet, they can keep using the separate v1 controller.

Jeffwan commented 3 years ago

One problem I see is that we don't have a conversion webhook. The v2 controller can't support v1 APIs, and even if it could, it cannot handle the missing resources that jobs created with a v1 controller wouldn't have (like the workers Service). I think it's reasonable to only have support for v2. If users don't want to upgrade yet, they can keep using the separate v1 controller.

I mean two apis + two controllers. Just treat mpiv2 as a separate framework. In that case, we don't necessarily need conversions in that case.

alculquicondor commented 3 years ago

I don't see much value from adding the two controllers, considering the extra effort.

alculquicondor commented 3 years ago

I think we should take this as an opportunity to encourage users to upgrade and simplify their infrastructure.

gaocegege commented 3 years ago

Is there any progress? Recently some users expect that horovod can be supported in the universal operator.

gaocegege commented 3 years ago

/cc @qiankunli

carmark commented 2 years ago

Our team have been working on this integration, hope that we can submit a draft PR next week.

alculquicondor commented 2 years ago

Are the integration and E2E tests being migrated as well?

terrytangyuan commented 2 years ago

https://github.com/kubeflow/tf-operator/pull/1413#issuecomment-925824047

From @alculquicondor

That sounds like a maintenance nightmare. The new controller supports the same features as the old controller and more, it's more robust and has greater coverage. I don't see why we should migrate the old controller. If people really want to use it, they can refer to the implementation in the old repository.

Let's continue this conversion here.

Is everyone okay with ONLY migrating the NEW V2 controller? I can see things will be hard to maintain if we migrate two versions. The old version will be deprecated at some point in the future anyways so we will have to come up with a migration/deprecation plan and schedule. Any functionality we are missing in v2?

alculquicondor commented 2 years ago

First reference question: I suppose the first kubeflow version to have the migrated controller would be at least 1.5. When is it supposed to be released?

zw0610 commented 2 years ago

#1413 (comment)

From @alculquicondor

That sounds like a maintenance nightmare. The new controller supports the same features as the old controller and more, it's more robust and has greater coverage. I don't see why we should migrate the old controller. If people really want to use it, they can refer to the implementation in the old repository.

Let's continue this conversion here.

Is everyone okay with ONLY migrating the NEW V2 controller? I can see things will be hard to maintain if we migrate two versions. The old version will be deprecated at some point in the future anyways so we will have to come up with a migration/deprecation plan and schedule. Any functionality we are missing in v2?

No. I still see many users using the v1 API. Forcing users to adopt v2 API seems not a good choice for our internal users and external customers. As I brought in community meeting before, it will be great to provide solid user cases for v2 API at scale, which I can use as examples to our users showing the v2 controller is production ready.

zw0610 commented 2 years ago

Meanwhile, I think one of the benefits we use one-manager-multi-contorller mode is we can support multiple controllers in one repository. To me, mpi-controller v1 and mpi-controller v2 are just two controllers, which does not need to be maintained by the same group of contributors. I do not fully make sense of the exclusive decision to keep one instead of both.

terrytangyuan commented 2 years ago

Seems like we are facing the chicken or the egg problem. We need adoptions to show production readiness but production readiness comes from a wider range of testing from users.

@alculquicondor You mentioned the maintenance efforts for two controllers. I think at least @zw0610's team can help maintain v1 controller. Any other concerns?

alculquicondor commented 2 years ago

No. I still see many users using the v1 API.

They can still use the v1 API. We just have to provide a conversion webhook. That seems like a much better direction to me. Still, without the conversion webhook, the jobs mostly work, except for a couple of fields.

alculquicondor commented 2 years ago

Depending on the date for the kubeflow 1.5 release, we can have this plan:

terrytangyuan commented 2 years ago

FYI I added this to Oct 6th AutoML and Training WG Meeting agenda to discuss. cc @andreyvelich

andreyvelich commented 2 years ago

Thank you @terrytangyuan. @alculquicondor @zw0610 Please try to attend this meeting. This is the meeting notes: https://bit.ly/2PWVCkV

alculquicondor commented 2 years ago

Thanks, I'll be there

stale[bot] commented 2 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.