Many pods take a long time to be in running state and crash a lot in larger clusters

vitobotta commented 6 months ago

Spegel version

v0.0.22

Kubernetes distribution

k3s

Kubernetes version

v1.28.8+k3s1

CNI

Cilium

Describe the bug

Hi! First of all thanks a lot for this useful project. It's helping work around problems with pulling images from some nodes in Hetzner Cloud when their IPs have previously been banned by registries for some reason.

I am testing Spegel with a couple of clusters between 200 and 400 nodes, and while many pods are running, many others crash continuously and in the logs I see attempting to acquire leader lease followed by gracefully shutdown. What am I missing?

Another question I had is related to firewall. At the moment I am not opening any ports in the firewall for Spegel and it seems to work anyway, at least I see all nodes pulling images that are often problematic for the problem explained above. Is this OK or am I supposed to open ports? If I need to open ports, are there any security implications or is the communication between nodes encrypted on that port?

Thanks a lot in advance!

vitobotta commented 6 months ago

Update: it seems that only 100 pods are running at the same time. Is this a known limitation?

bittrance commented 6 months ago

Hm, if there is no message between attempting to acquire leader lease and shutdown gracefully that implies that the central context was cancelled, probably by some other part of spegel. There are no other interesting messages in the log? If not, my guess would be that the pod is receiving a SIGTERM from Kubernetes.

Just to be sure: are you consistently seeing exactly 100 pods starting successfully? I did some superficial searching but I cannot find a good candidate for a limit in that range, so I'm wondering if this is the result of configuration in your environment? Some operator injecting a ResourceQuota in the namespace perhaps, or configuration for watchers in etcd? If so, there might be some interesting events on the failing pods.

Apart from the registry being a type: NodePort service, there is nothing special about spegel inter-pod communication, so it should not normally require firewall configuration. If spegel doesn't know of an image, it will return 404, expecting Kubernetes to find the image by some other means.

vitobotta commented 6 months ago

Hi @bittrance ! There is no other useful information in the logs unfortunately and yes, always max 100. I have tested with several different clusters and always see the same limit. I have also tested with brand new clusters with nothing that (to my knowledge) could cause such a limit, also because Spegel is the only thing that seems to be limited always the same way. I have also tested with Postgres instead of etcd (with k3s) with the same result. What could I try to troubleshoot and find what's causing this limit? It's really weird because it's a daemonset. Thanks!

phillebaba commented 6 months ago

@vitobotta I do not think that you need to make any firewall configuration changes on your nodes. All traffic goes through the k8s CNI so as long as that is working Spegel should work.

Your issues with 200+ node clusters is an interesting one. So far I have only run benchmarks with 100 nodes. The problem can come from multiple different sources. My best guess is that there is limit reached in regards to leader election or daemonset IPs.

vitobotta commented 6 months ago

Yesterday I was working with a 500 node cluster and had again the same problem. Interestingly, I tried k3s' embedded Spegel version, and it works fine. I mean it doesn't use a daemonset since Spegel is embedded, but from a cursory look it seemed that all nodes had connections with each other on port 5001 as expected from k3s' docs.

phillebaba commented 6 months ago

Makes sense with your observations. You are correct that exposing port 5001 between nodes is required as the p2p component runs on that.

I had a thought which may explain the issues you are seeing. I was reading about others having issues with leader election in large clusters, and one common issue was the time it took. My best guess is that the startup probe times out because leader election is taking too long. This is why you are seeing instances shut down without any error, as the signal sent cancels the context which stops the process.

The first step here is to add a log to make it a bit more clear why thins are being shutdown. Second of all is to explore how to fix this long term for large clusters. Either my just increasing the startup time or some other optimization.

If I remember correctly k3s does not use leader election for bootstrapping. Either way I think you are better off running the embedded Spegel in k3s even after we figure out how to solve this.

vitobotta commented 6 months ago

I think I see what you are saying. I can use the embedded spegel for clusters created with my project in Hetzner Cloud since it uses k3s, but I would also like to try Spegel at work with GKE clusters because it's pretty handy :)

phillebaba commented 5 months ago

I have observed something interesting today. When upgrading Spegel where a lease has existed before, all pods will wait for 60 seconds until they agree on a new leader. This is also the max duration for the startup probe. I will explore this further to see if I have missed some timeout.

phillebaba commented 3 months ago

@vitobotta have you observed similar issues with the latest release of Spegel?

vitobotta commented 3 months ago

Hi @phillebaba unfortunately I haven't had a chance to work on my side project lately so no, I haven't tried yet.

spegel-org / spegel