Closed withinboredom closed 1 year ago
@withinboredom thanks for creating this issue! Could you specify the full error log why longhorn cannot come up? Why is it connecting to loft's proxy? Loft does not need access to any PV's since it stores its cluster state as CRDs, so would be great if you could post any loft relevant information why Loft couldn't come up.
In general the problematic loft parts are its webhooks and apiservices, which you could just delete after failure recovery as they will be recreated by loft anyways on successful restart.
Deleting the webhooks and apiservices can be done via:
kubectl delete validatingwebhookconfiguration loft loft-agent
kubectl delete apiservices v1.management.loft.sh v1.cluster.loft.sh
So, I managed to recreate it. Apparently, we got overzealous and attached "owners" to namespaces. This caused longhorn to fail majestically when loft doesn't come back. Loft wasn't coming up for the same reason, because the loft namespace got an "owner" that it couldn't validate because loft was down.
Basically, just make the loft
namespace have the admin team as an owner, then stop the loft pods. Loft won't come back.
Edit to add: your commands above rectify the issue.
@withinboredom thanks for the information! With owners you mean setting the owner in the Loft UI? What was the error then that you experienced? Are there any logs within Loft that show the problem, since this sounds like not wanted behaviour to me?
With owners you mean setting the owner in the Loft UI?
Yep, exactly!
What was the error then that you experienced? Are there any logs within Loft that show the problem, since this sounds like not wanted behaviour to me?
Dangit! I forgot to grab the logs again! I'll see if I can dig them up, or just do it again :)
👋 hey @withinboredom I know this was a while ago, but is this still an issue for ya?
going to close this one out for now -- if this is still an issue let us know!
I installed Loft on a single node, bare metal, cluster. It's using
Longhorn
to supply PV's, just like the production cluster. The node was rebooted for security patches, and when it came back up, the node never recovered.After discovering that
Longhorn
wasn't coming up, I learned that it was failing to connect to loft's proxy which couldn't come up due to loft being unable to get access to it's PVs. It seems there is a chicken-and-egg type problem here.Do you have any suggestions or best-practices to ensure clusters that suffer catastrophic failures (all nodes going down) can come back up without loft being available yet?