giantswarm / kvm-operator-node-controller

Manages Kubernetes nodes based on kvm-operator (host cluster) state.
Apache License 2.0
1 stars 1 forks source link

Initial PoC version #1

Closed r7vme closed 6 years ago

r7vme commented 7 years ago

This controller runs in host cluster as a pod.

It does 3 things: 1) Check guest cluster for NotReady node 2) Check correspondong pod in host cluster 3) Delete NotReady node if no pod host cluster

This is working version that was tested.

Regarding architecture:

Controller and provider implemented like "out-of-tree" kubernetes cloud controller manager. See about cloud-controller-manager [1]. I would love to keep controller and provider as close as possible to Kubernetes recommendations. In the future this potentially can become external cloud provider (e.g. create Load Balancers/Persistent volumes).

1 - https://kubernetes.io/docs/tasks/administer-cluster/running-cloud-controller

Part of: giantswarm/k8scloudconfig#162 giantswarm/giantswarm#1830

r7vme commented 7 years ago

@xh3b4sd thanks for comments. initially this supposed to run in guest cluster.

To get bigger picture you can check how node controller implemented in k8s. My controller is mostly copy-paste with removed pieces.

In short, controller just initializes provider and asks if node still exists. I thought that it will also make sense to implement the same way as our cloud is host cluster.

But looks like following upstream architecture is not the best idea in our case.

Current approach:

xh3b4sd commented 7 years ago
  • Run this controller as pod in HOST cluster
  • so based on available pods in guest cluster namespace it will make a decision about removing NotReady node or not.

I like this. I am just wondering if we shouldn't then just add this functionality to the KVM operator. Looks like we have to get the guest cluster access anyway working. In the KVM operator we know when nodes are removed and updated. So we could then just remove them from the guest cluster k8s API. We would then have this in one place and we would not have to guess if it is a good idea to remove a node now or not. We would have the real event which is necessary to make this decision immediately.

r7vme commented 7 years ago

Hmmm, so yes looks like this should be a part of KVM operator. I also like the idea.

calvix commented 7 years ago

The most important case we need it now is when pods are rescheduled. That means no change in adding or removing nodes.

r7vme commented 7 years ago

When pod rescheduled, new node comes up with NEW hostname, so OLD hostname pod does not exist anymore. And OLD node NotReady, so it should be removed.

r7vme commented 7 years ago

@kopiczko

+1 I also don't follow this. Are you going to add some functionality here? What will it be? Why not ?removing provider package now, and add it when it does something useful?

This is reference skeleton and it works, it just has dumb logic. Says not found for all requested nodes. As was discussed today, i'll change design little bit (fortunately i did not implement anything yet), so this controller will run in host cluster (not in guest cluster.)

So this provider will just do API call similar to kubectl get pod $POD.

smth like this. Crap this will also need kvm-operator part to deploy this service in host cluster for every guest cluster :/ But this is really important thing for upgrades and resilincy of our clusters. This functionality even missing in kubernetes itself for on-prem (which is not surprising because on-prem always different.)

calvix commented 7 years ago

You can make one controller that can operate all guest clusters. It can watch for TPR kvm and for each of them spawn goroutine which will do the cleaning loop

r7vme commented 6 years ago

Fellas, i addressed all comments except micro logger and implemented missed part in provider.

Please take a look. If ok, i want to merge this working version and rewrite to microkit in separate PR later.

calvix commented 6 years ago

cant really say much about code, but if it works I am OK with that

r7vme commented 6 years ago

@kopiczko answered your comments.

kopiczko commented 6 years ago

Can we just agree that this is a temporary thing and should be a part of kvm-operator? The main problem I see is provider is based on very internal knowledge from kvm-operator. We also should name it with -operator suffix according to our newest 🚓 https://gigantic.slack.com/archives/C04TGHDEF/p1508852314000677

I approve that if you'll make sure to merge it later to the kvm-operator.

I'd also like to hear @xh3b4sd opinion on that. I think you both guys agreed it should be kvm-operator part.

xh3b4sd commented 6 years ago

Plus one on what Pawel said. My only fear, based on a lot of experiences is, as soon as we go with a solution now we will not improve it for a very long time. That is why I would have like to "make it right" in the first place. I understand the urgency though. So in case we can fix a real problem quickly lets move forward. I do not want to block this.

r7vme commented 6 years ago

1) Here is the issue to make this a part of kvm-operator. From my side i feel that i don't have enough expertise both in Go and operatorkit to make it happen. So i can only rely that one of you guys can take this issue.

2) Having this rewritten to operatorkit does not makes sense if in long term we want to make it part of kvm-operator.

Okay, so i agree with you gyus. But looks like main problem that my skill set is not enough to make this properly in short time.

WDYT?

JFYI: This is only part 1 of the whole task. Part 2 is cert-operator feature in testing phase now. Part 3 is to add deployment for this controller into kvm-operator.

xh3b4sd commented 6 years ago

No worries. Go ahead and lets fix a real problem first. We then just have to cleanup a bit later as a team. It is maybe also good to get a bit more experienced to see how it can work. Then we can press it into our conventions and integrate it in our common tolling.

r7vme commented 6 years ago

Ok good. Can i get Approval then?