Job controller doesn't reconcile scale down changes

kubeflow / common

Common APIs and libraries shared by other Kubeflow operator repositories.

Apache License 2.0

51 stars 73 forks source link

Job controller doesn't reconcile scale down changes #59

Closed Jeffwan closed 4 years ago

Jeffwan commented 4 years ago

If user changed replica to a smaller value, controller won't delete those redundant pods and services.

In the reconcile logic, we always set scope slice size based on value of replicas. https://github.com/kubeflow/common/blob/64f943084a050ac84b50ce4f49b6b6e085b247e6/pkg/controller.v1/common/pod.go#L221

Let's say we have index 0, 1, 2 pods and user change replicas from 3 to 2. When we iterate to pod with index 2. In GetPodSlices methods, we only print warn logs for these cases.

https://github.com/kubeflow/common/blob/64f943084a050ac84b50ce4f49b6b6e085b247e6/pkg/controller.v1/common/pod.go#L232

ReconcilePods doesn't even get a chance to make actions to these pods. index 2 pod will be left there.

https://github.com/kubeflow/common/blob/64f943084a050ac84b50ce4f49b6b6e085b247e6/pkg/controller.v1/common/pod.go#L283

issue-label-bot[bot] commented 4 years ago

Issue-Label Bot is automatically applying the labels:

Label	Probability
bug	0.72

Please mark this comment with :thumbsup: or :thumbsdown: to give our bot feedback! Links: app homepage, dashboard and code for this bot.

kf-label-bot-dev[bot] commented 4 years ago

Issue-Label Bot is automatically applying the labels:

Label	Probability
bug	0.72

Please mark this comment with :thumbsup: or :thumbsdown: to give our bot feedback! Links: app homepage, dashboard and code for this bot.

Jeffwan commented 4 years ago

/kind/improvement

gaocegege commented 4 years ago

@Jeffwan Yeah. It is a potential problem. When I implement the logic here, dynamic member is not supported in distributed training. Now we can see the trend. Thus I think we should support it.

Thanks.

Jeffwan commented 4 years ago

@gaocegege Yes. We can at least support this case in reconciler logic first.

gaocegege commented 4 years ago

Yeah. I agree on it. Thanks for your contribution! :tada: :+1:

Jeffwan commented 4 years ago

PR has been merged. Close the issue

Jeffwan commented 4 years ago

/assign @Jeffwan

ChanYiLin commented 4 years ago

@Jeffwan @gaocegege @johnugeorge @terrytangyuan would like to know your idea on this topic

Currently in #58 implementation, we don't consider that whether the framework supports dynamic scale down, or whether the replicaType can be scaled down. We just scale down if index is out of range. In other words, if the framework doesn't support dynamic scale down and the user still decreases the replica number, our controller logic will have no warning or error and will directly scale down the replicas. This might lead to the failure of the job, but we leave the responsibility to the user. Is it correct?

Cause I am migrating the tf-operator with kubeflow/common, but we have our own scale down logic in this PR I am wondering that do we need to consider if rtype==worker or if config. enableDynamicWorker == true ? If we don't, I can simply use the reconcilePod function in common library and no need to use current one.

Jeffwan commented 4 years ago

@ChanYiLin

I toke last week off and just come back this week and see this.

Yes, That's true. This is a kind of corresponding change like scale up. We didn't scale down replicas in the past. The framework level should be able to provide the ability, while, as you say, this should be controller by the user and controller needs to move job to desired state.

I am wondering that do we need to consider if rtype==worker

You mean in where? config.enableDynamicWorker will be only available in TF, right? If we need some customization, probably has to override the reconcile logic.

I think long term speaking, we should have dynamic interface or types extracted in kubeflow/common to adapt to different frameworks. TF and Torch have the support but in different manner, it doesn't have very unified standard we can easily support. for short term, we can either override common implementation or reuse some methods.