Closed wackxu closed 6 years ago
In general, I'd much prefer to reuse existing well maintained libraries then custom code. So I'm all for replacing custom code with a more established library.
Are their specific changes you are thinking of making?
Hi we(caicloud) are implementing the controller for kubeflow which is similar to kubernetes/sample-controller. And we are glad to file a proposal for the new design soon.
@gaocegege I am so sorry that I have almost done it and I will commit a PR ASAP
@jlewi We can make the change step by step,The first phases main include follows:
1、refactor the spec
package to the kubernetes standard API, make the CRD definition versioned and it is easy for us to maintenance and make the code backwards compatible.
2、import code-generator to help us generate sharedInformer
, lister
, clientSet
and so on.Then we will use these to refactor the controller
.
3、Use the sharedInformer
to help us list and watch CRDs, It can cache the CRD data, so it is more Efficient and It can reduce the code line number and more readable.
4、replace the client to use the generate client and so on.
@wackxu nice work!
Are you targeting at refactoring the code base only, or do you also plan to change the API?
That's awesome.
Are you planning on submitting a single PR or splitting it up into multiple PRs?
On Dec 9, 2017 7:08 AM, "Deyuan Deng" notifications@github.com wrote:
@wackxu https://github.com/wackxu nice work!
Are you targeting at refactoring the code base only, or do you also plan to change the API?
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/tensorflow/k8s/issues/206#issuecomment-350463674, or mute the thread https://github.com/notifications/unsubscribe-auth/AAvcA0foys2IdSOYNPkxd8K8i0rUHvSrks5s-oZLgaJpZM4Q6iaY .
@wackxu It doesn't matter, it is awesome :smile:
Reopening this because it was autoclosed but I think there are PRs still pending related to this.
We implemented a simple controller https://github.com/caicloud/kubeflow-controller and glad to help the upstream to do the rest work.
@gaocegege Merging the caicloud work seems like a great idea; I think the caiacloud controller has many advantages over the current implementation. A couple questions/comments
https://github.com/caicloud/kubeflow-controller seems to distinguish between a local job and non local job. What's the difference?
How does the caicloud controller determine when a job is done?
Could you explain the proposed design? Here are some questions, but I think it might better if you could write up an explanation that could eventually be included in our docs as opposed to answering the questions directly.
It looks like its event driven. Can you explain this? For example is the controller handling events for the sub resources (e.g. pods and services) and not just TfJob events?
It looks like events are created by comparing current status to expectations. Can you explain how this works in terms of deciding whether a pod needs to be created?
How do you deal with the fact that that workers and pods are stateful?
Looks like the caicloud controller is directly managing pods. I think that is probably better than using job controllers (see #45).
Looks like caicloud controller is publishing Kubernetes events which is definitely something we should do
@jlewi @gaocegege caicloud kubeflow-controller is similar with kubernetes job controller, It is a better implementation. I think we can do it in the next PR and I am very glad to help review it. The main change is so big and I think we should first merge https://github.com/tensorflow/k8s/pull/234, https://github.com/tensorflow/k8s/pull/234 is mainly change the structure of controller
and cmd pkg
to keep style with kubernetes controller. The next PR we can fork on the reconcile logic like caicloud kubeflow-controller .
@jlewi @wackxu
Thanks for your review, I will write a design doc ASAP and file an issue to discuss the differences between caicloud controller and the upstream controller. And I agree with @wackxu that we should focus on #234 and keep moving.
https://github.com/caicloud/kubeflow-controller seems to distinguish between a local job and non-local job. What's the difference?
There are some changes in caicloud TFJob CRD, and @DjangoPeng is summarising them and he will file an issue to discuss it.
How does the caicloud controller determine when a job is done?
Now it is simple, the controller checks the statuses of the workers, if all workers are succeeded, we think the TFJob is succeeded.
Could you explain the proposed design? Here are some questions, but I think it might better if you could write up an explanation that could eventually be included in our docs as opposed to answering the questions directly.
Yeah, I will write a design doc and post in this issue.
Looks like the caicloud controller is directly managing pods. I think that is probably better than using job controllers (see #45).
We have an internal discussion, too in https://github.com/caicloud/kubeflow-controller/issues/71. Now we have not a conclusion and welcome comments :-)
Personally, I think if most of the use cases could be handled by job, then job is better since we do not have to manage all the lifecycle of the pods, which is implemented in job controller.
Looks like caicloud controller is publishing Kubernetes events which is definitely something we should do
Yeah, and I think @zjj2wry are implementing the feature for tensorflow/k8s: https://github.com/tensorflow/k8s/pull/260.
Now we have limited support in caicloud controller and the recorder records the creation of pods and services.
For example, events when we create 4 workers and 2 PS:
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal SuccessfulCreate 33s kubeflow-controller Created service: dist-training-job-worker-0-d88wg
Normal SuccessfulCreate 32s kubeflow-controller Created service: dist-training-job-worker-1-hwl26
Normal SuccessfulCreate 32s kubeflow-controller Created service: dist-training-job-worker-2-jx7g7
Normal SuccessfulCreate 32s kubeflow-controller Created service: dist-training-job-worker-3-gdb66
Normal SuccessfulCreate 32s kubeflow-controller Created service: dist-training-job-ps-0-nlfdp
Normal SuccessfulCreate 32s kubeflow-controller Created service: dist-training-job-ps-1-jw7c4
Normal SuccessfulCreate 32s kubeflow-controller Created pod: dist-training-job-2j9br
Normal SuccessfulCreate 32s kubeflow-controller Created pod: dist-training-job-pm85l
Normal SuccessfulCreate 31s kubeflow-controller Created pod: dist-training-job-5rgvh
Normal SuccessfulCreate 29s (x3 over 31s) kubeflow-controller (combined from similar events): Created pod: dist-training-job-n48d7
And we could support more events if we need.
@jlewi Thanks for your reply. I'd like to give a supplement.
There are some changes in caicloud TFJob CRD, and @DjangoPeng is summarising them and he will file an issue to discuss it.
In caicloud Clever platform, we support not only TensorFlow jobs but also batch job, service and others. Considering the open source work, we'll eventually combine TFJob CRD of ourselves and upstream. It's necessary to file an issue to discuss the details, and I'll do it after my company work.
How does the caicloud controller determine when a job is done?
The design and lifecycle of local job is determinate and easy-to-understand, because local job is similar with single batch job. But distributed job is more complicated and the status is up to user. So, we have to discuss the strategy and logic.
Looks like the caicloud controller is directly managing pods. I think that is probably better than using job controllers (see #45).
We're discussing the selection. Personally speaking, I prefer Pod. Ref: https://github.com/caicloud/kubeflow-controller/issues/71#issuecomment-355056365
I added some explanation about the implementation, and as wackxu@ said it is simliar to job controller in Kubernetes.
Before #234 merged, I will fix some issues in our own implementation. And I will file a PR to contribute the implementation to the upstream after #234 if you think it is generally acceptable :-)
In caicloud Clever platform, we support not only TensorFlow jobs but also batch job, service and others. Considering the open source work, we'll eventually combine TFJob CRD of ourselves and upstream. It's necessary to file an issue to discuss the details, and I'll do it after my company work.
Are you planning on using a single CRD or having different CRDs for different types of jobs?
@DjangoPeng so is your plan to upstream to tensorflow/k8s any changes/improvements needed by Caicloud and then just use the TfJob controller in tensorflow/k8s? Do you forsee a need to maintain a downstream fork in caicloud/kubeflow-controller long term?
@jlewi
Sorry for the late reply.
We hope we could use the upstream controller eventually. And we will discuss about the changes and try to merge the changes in CRD into the upstream, or we will rethink about the CRD spec design of ours if we can not come to an agreement.
As README in caicloud/kubeflow-controller said, it is just temporary. And we hope the controller could help to accelerate the development of tensorflow/k8s.
Sounds great thanks.
Is there more work to be done to use Informer and Controller?
I think the work is done, and other things are in other issues, e.g. #314
I am closing the issue since we have already finished it. Thanks for your awesome contribution again @wackxu
Kubernetes provides many useful tools that we can use to develop our project, such as sharedInformer, clientSet, lister and so on. Using these make our project more readable and good performance. and we also need versioned the API and it is easy for us to implement backwards compatible and make the code more specification. we can refactor the project similar with example https://github.com/kubernetes/sample-controller. I have done something work. How do you think about it? @jlewi