kubeflow / training-operator

Distributed ML Training and Fine-Tuning on Kubernetes
https://www.kubeflow.org/docs/components/training
Apache License 2.0
1.59k stars 690 forks source link

refactor the TfJob to use Informer and Controller #206

Closed wackxu closed 6 years ago

wackxu commented 6 years ago

Kubernetes provides many useful tools that we can use to develop our project, such as sharedInformer, clientSet, lister and so on. Using these make our project more readable and good performance. and we also need versioned the API and it is easy for us to implement backwards compatible and make the code more specification. we can refactor the project similar with example https://github.com/kubernetes/sample-controller. I have done something work. How do you think about it? @jlewi

jlewi commented 6 years ago

In general, I'd much prefer to reuse existing well maintained libraries then custom code. So I'm all for replacing custom code with a more established library.

Are their specific changes you are thinking of making?

gaocegege commented 6 years ago

Hi we(caicloud) are implementing the controller for kubeflow which is similar to kubernetes/sample-controller. And we are glad to file a proposal for the new design soon.

wackxu commented 6 years ago

@gaocegege I am so sorry that I have almost done it and I will commit a PR ASAP

wackxu commented 6 years ago

@jlewi We can make the change step by step,The first phases main include follows: 1、refactor the spec package to the kubernetes standard API, make the CRD definition versioned and it is easy for us to maintenance and make the code backwards compatible. 2、import code-generator to help us generate sharedInformer, lister, clientSet and so on.Then we will use these to refactor the controller. 3、Use the sharedInformer to help us list and watch CRDs, It can cache the CRD data, so it is more Efficient and It can reduce the code line number and more readable. 4、replace the client to use the generate client and so on.

ddysher commented 6 years ago

@wackxu nice work!

Are you targeting at refactoring the code base only, or do you also plan to change the API?

jlewi commented 6 years ago

That's awesome.

Are you planning on submitting a single PR or splitting it up into multiple PRs?

On Dec 9, 2017 7:08 AM, "Deyuan Deng" notifications@github.com wrote:

@wackxu https://github.com/wackxu nice work!

Are you targeting at refactoring the code base only, or do you also plan to change the API?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/tensorflow/k8s/issues/206#issuecomment-350463674, or mute the thread https://github.com/notifications/unsubscribe-auth/AAvcA0foys2IdSOYNPkxd8K8i0rUHvSrks5s-oZLgaJpZM4Q6iaY .

gaocegege commented 6 years ago

@wackxu It doesn't matter, it is awesome :smile:

jlewi commented 6 years ago

Reopening this because it was autoclosed but I think there are PRs still pending related to this.

gaocegege commented 6 years ago

234 refactors the logic but there are some other things should be done to migrate to the new controller design.

We implemented a simple controller https://github.com/caicloud/kubeflow-controller and glad to help the upstream to do the rest work.

jlewi commented 6 years ago

@gaocegege Merging the caicloud work seems like a great idea; I think the caiacloud controller has many advantages over the current implementation. A couple questions/comments

  1. https://github.com/caicloud/kubeflow-controller seems to distinguish between a local job and non local job. What's the difference?

  2. How does the caicloud controller determine when a job is done?

  3. Could you explain the proposed design? Here are some questions, but I think it might better if you could write up an explanation that could eventually be included in our docs as opposed to answering the questions directly.

    • It looks like its event driven. Can you explain this? For example is the controller handling events for the sub resources (e.g. pods and services) and not just TfJob events?

    • It looks like events are created by comparing current status to expectations. Can you explain how this works in terms of deciding whether a pod needs to be created?

    • How do you deal with the fact that that workers and pods are stateful?

  4. Looks like the caicloud controller is directly managing pods. I think that is probably better than using job controllers (see #45).

  5. Looks like caicloud controller is publishing Kubernetes events which is definitely something we should do

wackxu commented 6 years ago

@jlewi @gaocegege caicloud kubeflow-controller is similar with kubernetes job controller, It is a better implementation. I think we can do it in the next PR and I am very glad to help review it. The main change is so big and I think we should first merge https://github.com/tensorflow/k8s/pull/234, https://github.com/tensorflow/k8s/pull/234 is mainly change the structure of controller and cmd pkg to keep style with kubernetes controller. The next PR we can fork on the reconcile logic like caicloud kubeflow-controller .

ghost commented 6 years ago

@jlewi @wackxu

Thanks for your review, I will write a design doc ASAP and file an issue to discuss the differences between caicloud controller and the upstream controller. And I agree with @wackxu that we should focus on #234 and keep moving.

https://github.com/caicloud/kubeflow-controller seems to distinguish between a local job and non-local job. What's the difference?

There are some changes in caicloud TFJob CRD, and @DjangoPeng is summarising them and he will file an issue to discuss it.

How does the caicloud controller determine when a job is done?

Now it is simple, the controller checks the statuses of the workers, if all workers are succeeded, we think the TFJob is succeeded.

Could you explain the proposed design? Here are some questions, but I think it might better if you could write up an explanation that could eventually be included in our docs as opposed to answering the questions directly.

Yeah, I will write a design doc and post in this issue.

Looks like the caicloud controller is directly managing pods. I think that is probably better than using job controllers (see #45).

We have an internal discussion, too in https://github.com/caicloud/kubeflow-controller/issues/71. Now we have not a conclusion and welcome comments :-)

Personally, I think if most of the use cases could be handled by job, then job is better since we do not have to manage all the lifecycle of the pods, which is implemented in job controller.

Looks like caicloud controller is publishing Kubernetes events which is definitely something we should do

Yeah, and I think @zjj2wry are implementing the feature for tensorflow/k8s: https://github.com/tensorflow/k8s/pull/260.

Now we have limited support in caicloud controller and the recorder records the creation of pods and services.

For example, events when we create 4 workers and 2 PS:

Events:
  Type    Reason            Age                From                 Message
  ----    ------            ----               ----                 -------
  Normal  SuccessfulCreate  33s                kubeflow-controller  Created service: dist-training-job-worker-0-d88wg
  Normal  SuccessfulCreate  32s                kubeflow-controller  Created service: dist-training-job-worker-1-hwl26
  Normal  SuccessfulCreate  32s                kubeflow-controller  Created service: dist-training-job-worker-2-jx7g7
  Normal  SuccessfulCreate  32s                kubeflow-controller  Created service: dist-training-job-worker-3-gdb66
  Normal  SuccessfulCreate  32s                kubeflow-controller  Created service: dist-training-job-ps-0-nlfdp
  Normal  SuccessfulCreate  32s                kubeflow-controller  Created service: dist-training-job-ps-1-jw7c4
  Normal  SuccessfulCreate  32s                kubeflow-controller  Created pod: dist-training-job-2j9br
  Normal  SuccessfulCreate  32s                kubeflow-controller  Created pod: dist-training-job-pm85l
  Normal  SuccessfulCreate  31s                kubeflow-controller  Created pod: dist-training-job-5rgvh
  Normal  SuccessfulCreate  29s (x3 over 31s)  kubeflow-controller  (combined from similar events): Created pod: dist-training-job-n48d7

And we could support more events if we need.

DjangoPeng commented 6 years ago

@jlewi Thanks for your reply. I'd like to give a supplement.

There are some changes in caicloud TFJob CRD, and @DjangoPeng is summarising them and he will file an issue to discuss it.

In caicloud Clever platform, we support not only TensorFlow jobs but also batch job, service and others. Considering the open source work, we'll eventually combine TFJob CRD of ourselves and upstream. It's necessary to file an issue to discuss the details, and I'll do it after my company work.

How does the caicloud controller determine when a job is done?

The design and lifecycle of local job is determinate and easy-to-understand, because local job is similar with single batch job. But distributed job is more complicated and the status is up to user. So, we have to discuss the strategy and logic.

Looks like the caicloud controller is directly managing pods. I think that is probably better than using job controllers (see #45).

We're discussing the selection. Personally speaking, I prefer Pod. Ref: https://github.com/caicloud/kubeflow-controller/issues/71#issuecomment-355056365

gaocegege commented 6 years ago

https://github.com/gaocegege/kubeflow-controller/blob/a54021d250c6e532f3a6e32be32e37e1249c0780/docs/design_doc.md#implementation

I added some explanation about the implementation, and as wackxu@ said it is simliar to job controller in Kubernetes.

Before #234 merged, I will fix some issues in our own implementation. And I will file a PR to contribute the implementation to the upstream after #234 if you think it is generally acceptable :-)

jlewi commented 6 years ago

In caicloud Clever platform, we support not only TensorFlow jobs but also batch job, service and others. Considering the open source work, we'll eventually combine TFJob CRD of ourselves and upstream. It's necessary to file an issue to discuss the details, and I'll do it after my company work.

Are you planning on using a single CRD or having different CRDs for different types of jobs?

@DjangoPeng so is your plan to upstream to tensorflow/k8s any changes/improvements needed by Caicloud and then just use the TfJob controller in tensorflow/k8s? Do you forsee a need to maintain a downstream fork in caicloud/kubeflow-controller long term?

gaocegege commented 6 years ago

@jlewi

Sorry for the late reply.

We hope we could use the upstream controller eventually. And we will discuss about the changes and try to merge the changes in CRD into the upstream, or we will rethink about the CRD spec design of ours if we can not come to an agreement.

As README in caicloud/kubeflow-controller said, it is just temporary. And we hope the controller could help to accelerate the development of tensorflow/k8s.

jlewi commented 6 years ago

Sounds great thanks.

jlewi commented 6 years ago

Is there more work to be done to use Informer and Controller?

gaocegege commented 6 years ago

I think the work is done, and other things are in other issues, e.g. #314

gaocegege commented 6 years ago

I am closing the issue since we have already finished it. Thanks for your awesome contribution again @wackxu