Closed Jeffwan closed 4 years ago
Issue-Label Bot is automatically applying the labels:
Label | Probability |
---|---|
bug | 0.72 |
Please mark this comment with :thumbsup: or :thumbsdown: to give our bot feedback! Links: app homepage, dashboard and code for this bot.
Issue-Label Bot is automatically applying the labels:
Label | Probability |
---|---|
bug | 0.72 |
Please mark this comment with :thumbsup: or :thumbsdown: to give our bot feedback! Links: app homepage, dashboard and code for this bot.
/kind/improvement
@Jeffwan Yeah. It is a potential problem. When I implement the logic here, dynamic member is not supported in distributed training. Now we can see the trend. Thus I think we should support it.
Thanks.
@gaocegege Yes. We can at least support this case in reconciler logic first.
Yeah. I agree on it. Thanks for your contribution! :tada: :+1:
PR has been merged. Close the issue
/assign @Jeffwan
@Jeffwan @gaocegege @johnugeorge @terrytangyuan would like to know your idea on this topic
Currently in #58 implementation, we don't consider that whether the framework supports dynamic scale down, or whether the replicaType can be scaled down. We just scale down if index is out of range. In other words, if the framework doesn't support dynamic scale down and the user still decreases the replica number, our controller logic will have no warning or error and will directly scale down the replicas. This might lead to the failure of the job, but we leave the responsibility to the user. Is it correct?
Cause I am migrating the tf-operator with kubeflow/common, but we have our own scale down logic in this PR
I am wondering that do we need to consider if rtype==worker
or if config. enableDynamicWorker == true
? If we don't, I can simply use the reconcilePod
function in common library and no need to use current one.
@ChanYiLin
I toke last week off and just come back this week and see this.
Yes, That's true. This is a kind of corresponding change like scale up. We didn't scale down replicas in the past. The framework level should be able to provide the ability, while, as you say, this should be controller by the user and controller needs to move job to desired state.
I am wondering that do we need to consider if
rtype==worker
You mean in where?
config.enableDynamicWorker
will be only available in TF, right? If we need some customization, probably has to override the reconcile logic.
I think long term speaking, we should have dynamic interface or types extracted in kubeflow/common to adapt to different frameworks. TF and Torch have the support but in different manner, it doesn't have very unified standard we can easily support. for short term, we can either override common implementation or reuse some methods.
If user changed replica to a smaller value, controller won't delete those redundant pods and services.
In the reconcile logic, we always set scope slice size based on value of
replicas
. https://github.com/kubeflow/common/blob/64f943084a050ac84b50ce4f49b6b6e085b247e6/pkg/controller.v1/common/pod.go#L221Let's say we have index 0, 1, 2 pods and user change replicas from 3 to 2. When we iterate to pod with index 2. In
GetPodSlices
methods, we only print warn logs for these cases.https://github.com/kubeflow/common/blob/64f943084a050ac84b50ce4f49b6b6e085b247e6/pkg/controller.v1/common/pod.go#L232
ReconcilePods doesn't even get a chance to make actions to these pods. index 2 pod will be left there.
https://github.com/kubeflow/common/blob/64f943084a050ac84b50ce4f49b6b6e085b247e6/pkg/controller.v1/common/pod.go#L283