Rebooting cluster results in completely unusable cluster.

loft-sh / loft

Namespace & Virtual Cluster Manager for Kubernetes - Lightweight Virtual Clusters, Self-Service Provisioning for Engineers and 70% Cost Savings with Sleep Mode

https://loft.sh/docs/introduction

Other

738 stars 65 forks source link

Rebooting cluster results in completely unusable cluster. #165

Closed withinboredom closed 1 year ago

withinboredom commented 2 years ago

I installed Loft on a single node, bare metal, cluster. It's using Longhorn to supply PV's, just like the production cluster. The node was rebooted for security patches, and when it came back up, the node never recovered.

After discovering that Longhorn wasn't coming up, I learned that it was failing to connect to loft's proxy which couldn't come up due to loft being unable to get access to it's PVs. It seems there is a chicken-and-egg type problem here.

Do you have any suggestions or best-practices to ensure clusters that suffer catastrophic failures (all nodes going down) can come back up without loft being available yet?

FabianKramm commented 2 years ago

@withinboredom thanks for creating this issue! Could you specify the full error log why longhorn cannot come up? Why is it connecting to loft's proxy? Loft does not need access to any PV's since it stores its cluster state as CRDs, so would be great if you could post any loft relevant information why Loft couldn't come up.

FabianKramm commented 2 years ago

In general the problematic loft parts are its webhooks and apiservices, which you could just delete after failure recovery as they will be recreated by loft anyways on successful restart.

Deleting the webhooks and apiservices can be done via:

kubectl delete validatingwebhookconfiguration loft loft-agent
kubectl delete apiservices v1.management.loft.sh v1.cluster.loft.sh

withinboredom commented 2 years ago

So, I managed to recreate it. Apparently, we got overzealous and attached "owners" to namespaces. This caused longhorn to fail majestically when loft doesn't come back. Loft wasn't coming up for the same reason, because the loft namespace got an "owner" that it couldn't validate because loft was down.

Basically, just make the loft namespace have the admin team as an owner, then stop the loft pods. Loft won't come back.

Edit to add: your commands above rectify the issue.

FabianKramm commented 2 years ago

@withinboredom thanks for the information! With owners you mean setting the owner in the Loft UI? What was the error then that you experienced? Are there any logs within Loft that show the problem, since this sounds like not wanted behaviour to me?

withinboredom commented 2 years ago

With owners you mean setting the owner in the Loft UI?

Yep, exactly!

What was the error then that you experienced? Are there any logs within Loft that show the problem, since this sounds like not wanted behaviour to me?

Dangit! I forgot to grab the logs again! I'll see if I can dig them up, or just do it again :)

carlmontanari commented 2 years ago

👋 hey @withinboredom I know this was a while ago, but is this still an issue for ya?

carlmontanari commented 1 year ago

going to close this one out for now -- if this is still an issue let us know!