kubernetes / cloud-provider

cloud-provider defines the shared interfaces which Kubernetes cloud providers implement. These interfaces allow various controllers to integrate with any cloud provider in a pluggable fashion. Also serves as an issue tracker for SIG Cloud Provider.
Apache License 2.0
244 stars 110 forks source link

kube-controller-manager -> cloud-controller-manager HA migration: KEP + alpha implementation #11

Open andrewsykim opened 5 years ago

andrewsykim commented 5 years ago

We need a KEP outlining how we intend to migrate existing clusters from using the kube-controller-manager to the cloud-controller-manager for the cloud provider specific parts of Kubernetes.

At KubeCON NA 2018, we discussed grouping the existing cloud controllers under 1 leader election that is shared by the kube-controller-manager and the cloud-controller-manager. For single node control planes this is not needed, but for HA control planes we need a mechanism to ensure that not more than 1 kube-controller-manager or cloud-controller-manager is running the set of cloud controllers in a cluster.

andrewsykim commented 5 years ago

/assign @mcrute

@mcrute is working on the initial design for this.

andrewsykim commented 5 years ago

cc @cheftako

mcrute commented 5 years ago

Here's a first draft. There's plenty more to be done but getting this out there for discussion.

andrewsykim commented 5 years ago

Thanks @mcrute!

cheftako commented 5 years ago

Thanks @mcrute https://github.com/mcrute! I would like us to also discuss as part of this how we do a better job of running Controllers in HA environments. Currently we do not utilize HA well as part of this. If we could get rid of the kill process when leader election is lost, then we could get much better utilization in HA. The problem has been that Controllers tend to kick of goroutines (and similar asynchronous processing). The problem is that Controller actions may not be idempotent. So we end up with mutations from something other than the main controller thread which did not get shut down (or at least shut down in a timely manner). One thought for this could be to attach an election token (or similar) to mutations. If the mutator is no longer leader, then the write is refused and the mutator is notified that they are no longer the leader (and should stop). While I believe is more than we need for the KCM->CCM migration, I would like us to consider it as where we are going. It would be good for us to make sure we are generally heading in that direction.

On Fri, Mar 1, 2019 at 10:41 AM Andrew Sy Kim notifications@github.com wrote:

Thanks @mcrute https://github.com/mcrute!

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/kubernetes/cloud-provider/issues/11#issuecomment-468767347, or mute the thread https://github.com/notifications/unsubscribe-auth/AA53A-drkWbYBQ3TMM_J5azU7dY8Qhoyks5vSXRtgaJpZM4bD0lq .

andrewsykim commented 5 years ago

/milestone v1.15 /priority critical-urgent

andrewsykim commented 5 years ago

/assign

andrewsykim commented 5 years ago

This is going to slip into the next release since we couldn't get the KEP reviewed in time for the KEP deadline. Further discussions happening for this in https://github.com/kubernetes/enhancements/pull/979 & https://github.com/kubernetes/kubernetes/pull/77878, hoping to have an implementable KEP in time for v1.16.

fejta-bot commented 5 years ago

Issues go stale after 90d of inactivity. Mark the issue as fresh with /remove-lifecycle stale. Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta. /lifecycle stale

cheftako commented 5 years ago

/remove-lifecycle stale

andrewsykim commented 4 years ago

/assign @yastij

fejta-bot commented 4 years ago

Issues go stale after 90d of inactivity. Mark the issue as fresh with /remove-lifecycle stale. Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta. /lifecycle stale

cheftako commented 4 years ago

/remove-lifecycle stale

cheftako commented 4 years ago

/lifecycle frozen