Failure story length and language

marratj commented 5 years ago

Hi there,

is there any lower limit on how long a failure story needs to be (i.e would an Ingress traffic outage because of a wrongly configured Load Balancer Service count already)?

And, what are the requirements on language? English only? I have published the failure story mentioned above in my German blog, but still would like to contribute it.

Regards Marcel

hjacobs commented 5 years ago

@marratj can you share the link to your blog post? English is obviously better, but I would also include other languages (automated translation works fine "some times"). For the length/impact: let's discuss, in can be brief, but it should be long enough to make sense for people to get something out of it :smile:

marratj commented 5 years ago

The blog post is here: https://www.devops-hof.de/kubernetes-load-balancer-konfiguration-vorsicht-beim-drainen-von-nodes/

Basically, we had a problem with our ingress-nginx on a GKE Cluster where the externalTrafficPolicy for its Load Balancer Service was set to Local while draining the nodes it was running on. As soon as the nodes were cordoned, the Service Controller removed them from the Load Balancers backend Pool but the ingress-nginx Pods were not yet restarted on a different node.

Due to the externalTrafficPolicy, Service traffic was not routed between nodes so the GCP Load Balancer did not have any healthy endpoints anymore, despite the Pods still running fine on the Cordoned nodes, rendering apps in the cluster unreachable from the Internet during this time.

Von: Henning Jacobs notifications@github.com Gesendet: Sonntag, Januar 27, 2019 5:29 PM An: hjacobs/kubernetes-failure-stories Cc: Marcel Juhnke; Mention Betreff: Re: [hjacobs/kubernetes-failure-stories] Failure story length and language (#6)

@marratjhttps://github.com/marratj can you share the link to your blog post? English is obviously better, but I would also include other languages (automated translation works fine "some times"). For the length/impact: let's discuss, in can be brief, but it should be long enough to make sense for people to get something out of it 😄

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/hjacobs/kubernetes-failure-stories/issues/6#issuecomment-457932320, or mute the threadhttps://github.com/notifications/unsubscribe-auth/ADv4yW6HS4EXDxVyKz7ec840lwAxelqoks5vHdPKgaJpZM4aUsoi.

hikhvar commented 5 years ago

I think this is something to learn from. Until now I was not aware that also ready pods on unschedulable nodes are removed from the service endpoints.

hjacobs commented 5 years ago

Added.

marratj commented 5 years ago

Well, the Pods are not removed from the Service Endpoints, rather the nodes are removed from the Cloud Provider’s Load Balancer backend pool (GCP in our case).

This wouldn’t normally be an issue, as kube-proxy forwards any traffic to the correct nodes where the Pods are running.

But when externalTrafficPolicy is set to Local in the Service config (which it is by default in the ingress-nginx installation manifest for GCP), traffic will not get forwarded to other nodes. With the combination that the ingress controller Pods now were running nodes that were no longer part of the Load Balancer backend and kube-proxy not forwarding the traffic to them, they were effectively dead.

Von: Christoph Petrausch notifications@github.com Gesendet: Sonntag, Januar 27, 2019 9:51 PM An: hjacobs/kubernetes-failure-stories Cc: Marcel Juhnke; Mention Betreff: Re: [hjacobs/kubernetes-failure-stories] Failure story length and language (#6)

I think this is something to learn from. Until now I was not aware that also ready pods on unschedulable nodes are removed from the service endpoints.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/hjacobs/kubernetes-failure-stories/issues/6#issuecomment-457953079, or mute the threadhttps://github.com/notifications/unsubscribe-auth/ADv4yQF5_X317NttK1OmXfL7qTjJ3LRgks5vHhE1gaJpZM4aUsoi.

hjacobs commented 5 years ago

@marratj sounds like a proper English post about this would make sense?

marratj commented 5 years ago

@hjacobs agreed, I will do a translation later this week :)

marratj commented 5 years ago

@hjacobs I have created an English translation, hope you like it :) https://www.devops-hof.de/kubernetes-load-balancer-konfiguration-beware-when-draining-nodes/

hjacobs commented 5 years ago

Thanks, replaced the link to go to the English blog post.

hjacobs / kubernetes-failure-stories

Failure story length and language #6