Tolerance to API unavailability

zerkms commented 4 years ago

At the moment if the API becomes unavailable (for any reason, eg outage) it takes about a minute or so before ingress-nginx unregisters all backends with the following message:

W0605 01:56:10.166870       9 controller.go:909] Service "ns/svc" does not have any active Endpoint.

Even though the endpoints are still up, running, and are perfectly available.

Would it not be better if ingress-nginx was running with the same set of endpoints while API is unavailable?

Is there currently another issue associated with this? Could not find

Does it require a particular kubernetes version? No

/kind feature

AndrewNeudegg commented 4 years ago

We are observing a similar pattern of behaviour.

Disruption to the Api server causes instability in the nginx pods, regardless of the overall state of the cluster.

Some background, we are undergoing the process of rotating some of the secrets used by the Kube Api server because the client certificates will expire. To acheive this we need to restart the api servers, but when they restart the bearer tokens that were originally given by the service account are invalid.

Our current mitigation strategy for this is to stand up an api server and to use hostAlias to override the kubernetes, kubernetes.default, kubernetes.default.svc.cluster.local, etc, on the ingress pods. But this is a real chore, since restarting the nginx ingress can trigger disruption, and we need to restart the pods twice (once to apply the override hostname and once to remove it when we are done).

Is the duration of time before unregistration configurable?

ElvinEfendi commented 4 years ago

@zerkms thanks for bringing this up. I think the controller should not consider the the list of endpoints to be empty when the API server is down. Can you write an e2e test showing this? Then we can think how to solve it.

cc @aledbf I think this is critical for availability.

aledbf commented 4 years ago

Let's separate the issues. In case of unavailability of the API server, you will see something like this:

I0611 02:56:30.099619       6 streamwatcher.go:114] Unexpected EOF during watch stream event decoding: unexpected EOF
I0611 02:56:30.099975       6 streamwatcher.go:114] Unexpected EOF during watch stream event decoding: unexpected EOF
I0611 02:56:30.100256       6 streamwatcher.go:114] Unexpected EOF during watch stream event decoding: unexpected EOF
I0611 02:56:30.100260       6 streamwatcher.go:114] Unexpected EOF during watch stream event decoding: unexpected EOF
I0611 02:56:30.100287       6 streamwatcher.go:114] Unexpected EOF during watch stream event decoding: unexpected EOF
I0611 02:56:30.100490       6 streamwatcher.go:114] Unexpected EOF during watch stream event decoding: unexpected EOF
E0611 02:56:30.101137       6 reflector.go:382] k8s.io/ingress-nginx/internal/ingress/controller/store/store.go:160: Failed to watch *v1.ConfigMap: Get "http://0.0.0.0:8001/api/v1/configmaps?allowWatchBookmarks=true&resourceVersion=9070852&timeout=8m52s&timeoutSeconds=532&watch=true": dial tcp 0.0.0.0:8001: connect: connection refused
W0611 02:56:30.100961       6 reflector.go:404] k8s.io/ingress-nginx/internal/ingress/controller/store/store.go:161: watch of *v1.Pod ended with: very short watch: k8s.io/ingress-nginx/internal/ingress/controller/store/store.go:161: Unexpected watch close - watch lasted less than a second and no items received
W0611 02:56:30.101041       6 reflector.go:404] k8s.io/ingress-nginx/internal/ingress/controller/store/store.go:182: watch of *v1beta1.Ingress ended with: very short watch: k8s.io/ingress-nginx/internal/ingress/controller/store/store.go:182: Unexpected watch close - watch lasted less than a second and no items received
W0611 02:56:30.101399       6 reflector.go:404] k8s.io/ingress-nginx/internal/ingress/controller/store/store.go:157: watch of *v1.Secret ended with: very short watch: k8s.io/ingress-nginx/internal/ingress/controller/store/store.go:157: Unexpected watch close - watch lasted less than a second and no items received
E0611 02:56:30.101450       6 reflector.go:382] k8s.io/ingress-nginx/internal/ingress/controller/store/store.go:159: Failed to watch *v1.Service: Get "http://0.0.0.0:8001/api/v1/services?allowWatchBookmarks=true&resourceVersion=2820793&timeout=9m24s&timeoutSeconds=564&watch=true": dial tcp 0.0.0.0:8001: connect: connection refused
E0611 02:56:30.102176       6 reflector.go:178] k8s.io/ingress-nginx/internal/ingress/controller/store/store.go:161: Failed to list *v1.Pod: Get "http://0.0.0.0:8001/api/v1/namespaces/invalid-namespace/pods?resourceVersion=9069652": dial tcp 0.0.0.0:8001: connect: connection refused
E0611 02:56:30.102191       6 reflector.go:178] k8s.io/ingress-nginx/internal/ingress/controller/store/store.go:182: Failed to list *v1beta1.Ingress: Get "http://0.0.0.0:8001/apis/networking.k8s.io/v1beta1/ingresses?resourceVersion=2862637": dial tcp 0.0.0.0:8001: connect: connection refused
E0611 02:56:30.102592       6 reflector.go:178] k8s.io/ingress-nginx/internal/ingress/controller/store/store.go:157: Failed to list *v1.Secret: Get "http://0.0.0.0:8001/api/v1/secrets?resourceVersion=9070770": dial tcp 0.0.0.0:8001: connect: connection refused
E0611 02:56:30.102835       6 reflector.go:382] k8s.io/ingress-nginx/internal/ingress/controller/store/store.go:158: Failed to watch *v1.Endpoints: Get "http://0.0.0.0:8001/api/v1/endpoints?allowWatchBookmarks=true&resourceVersion=9070859&timeout=8m2s&timeoutSeconds=482&watch=true": dial tcp 0.0.0.0:8001: connect: connection refused
E0611 02:56:31.102165       6 reflector.go:382] k8s.io/ingress-nginx/internal/ingress/controller/store/store.go:160: Failed to watch *v1.ConfigMap: Get "http://0.0.0.0:8001/api/v1/configmaps?allowWatchBookmarks=true&resourceVersion=9070852&timeout=6m58s&timeoutSeconds=418&watch=true": dial tcp 0.0.0.0:8001: connect: connection refused
E0611 02:56:31.103086       6 reflector.go:382] k8s.io/ingress-nginx/internal/ingress/controller/store/store.go:159: Failed to watch *v1.Service: Get "http://0.0.0.0:8001/api/v1/services?allowWatchBookmarks=true&resourceVersion=2820793&timeout=8m11s&timeoutSeconds=491&watch=true": dial tcp 0.0.0.0:8001: connect: connection refused
E0611 02:56:31.107797       6 reflector.go:382] k8s.io/ingress-nginx/internal/ingress/controller/store/store.go:158: Failed to watch *v1.Endpoints: Get "http://0.0.0.0:8001/api/v1/endpoints?allowWatchBookmarks=true&resourceVersion=9070859&timeout=8m24s&timeoutSeconds=504&watch=true": dial tcp 0.0.0.0:8001: connect: connection refused
E0611 02:56:31.743855       6 reflector.go:178] k8s.io/ingress-nginx/internal/ingress/controller/store/store.go:161: Failed to list *v1.Pod: Get "http://0.0.0.0:8001/api/v1/namespaces/invalid-namespace/pods?resourceVersion=9069652": dial tcp 0.0.0.0:8001: connect: connection refused
E0611 02:56:31.947330       6 reflector.go:178] k8s.io/ingress-nginx/internal/ingress/controller/store/store.go:157: Failed to list *v1.Secret: Get "http://0.0.0.0:8001/api/v1/secrets?resourceVersion=9070770": dial tcp 0.0.0.0:8001: connect: connection refused
E0611 02:56:32.010309       6 reflector.go:178] k8s.io/ingress-nginx/internal/ingress/controller/store/store.go:182: Failed to list *v1beta1.Ingress: Get "http://0.0.0.0:8001/apis/networking.k8s.io/v1beta1/ingresses?resourceVersion=2862637": dial tcp 0.0.0.0:8001: connect: connection refused
E0611 02:56:32.103217       6 reflector.go:382] k8s.io/ingress-nginx/internal/ingress/controller/store/store.go:160: Failed to watch *v1.ConfigMap: Get "http://0.0.0.0:8001/api/v1/configmaps?allowWatchBookmarks=true&resourceVersion=9070852&timeout=6m54s&timeoutSeconds=414&watch=true": dial tcp 0.0.0.0:8001: connect: connection refused
E0611 02:56:32.104147       6 reflector.go:382] k8s.io/ingress-nginx/internal/ingress/controller/store/store.go:159: Failed to watch *v1.Service: Get "http://0.0.0.0:8001/api/v1/services?allowWatchBookmarks=true&resourceVersion=2820793&timeout=8m58s&timeoutSeconds=538&watch=true": dial tcp 0.0.0.0:8001: connect: connection refused
E0611 02:56:32.108573       6 reflector.go:382] k8s.io/ingress-nginx/internal/ingress/controller/store/store.go:158: Failed to watch *v1.Endpoints: Get "http://0.0.0.0:8001/api/v1/endpoints?allowWatchBookmarks=true&resourceVersion=9070859&timeout=6m49s&timeoutSeconds=409&watch=true": dial tcp 0.0.0.0:8001: connect: connection refused
E0611 02:56:33.104284       6 reflector.go:382] k8s.io/ingress-nginx/internal/ingress/controller/store/store.go:160: Failed to watch *v1.ConfigMap: Get "http://0.0.0.0:8001/api/v1/configmaps?allowWatchBookmarks=true&resourceVersion=9070852&timeout=9m41s&timeoutSeconds=581&watch=true": dial tcp 0.0.0.0:8001: connect: connection refused
E0611 02:56:33.105079       6 reflector.go:382] k8s.io/ingress-nginx/internal/ingress/controller/store/store.go:159: Failed to watch *v1.Service: Get "http://0.0.0.0:8001/api/v1/services?allowWatchBookmarks=true&resourceVersion=2820793&timeout=9m0s&timeoutSeconds=540&watch=true": dial tcp 0.0.0.0:8001: connect: connection refused
E0611 02:56:33.109347       6 reflector.go:382] k8s.io/ingress-nginx/internal/ingress/controller/store/store.go:158: Failed to watch *v1.Endpoints: Get "http://0.0.0.0:8001/api/v1/endpoints?allowWatchBookmarks=true&resourceVersion=9070859&timeout=5m55s&timeoutSeconds=355&watch=true": dial tcp 0.0.0.0:8001: connect: connection refused
E0611 02:56:34.104820       6 reflector.go:382] k8s.io/ingress-nginx/internal/ingress/controller/store/store.go:160: Failed to watch *v1.ConfigMap: Get "http://0.0.0.0:8001/api/v1/configmaps?allowWatchBookmarks=true&resourceVersion=9070852&timeout=7m18s&timeoutSeconds=438&watch=true": dial tcp 0.0.0.0:8001: connect: connection refused
E0611 02:56:34.105892       6 reflector.go:382] k8s.io/ingress-nginx/internal/ingress/controller/store/store.go:159: Failed to watch *v1.Service: Get "http://0.0.0.0:8001/api/v1/services?allowWatchBookmarks=true&resourceVersion=2820793&timeout=5m41s&timeoutSeconds=341&watch=true": dial tcp 0.0.0.0:8001: connect: connection refused
E0611 02:56:34.110270       6 reflector.go:382] k8s.io/ingress-nginx/internal/ingress/controller/store/store.go:158: Failed to watch *v1.Endpoints: Get "http://0.0.0.0:8001/api/v1/endpoints?allowWatchBookmarks=true&resourceVersion=9070859&timeout=7m29s&timeoutSeconds=449&watch=true": dial tcp 0.0.0.0:8001: connect: connection refused
E0611 02:56:35.105586       6 reflector.go:382] k8s.io/ingress-nginx/internal/ingress/controller/store/store.go:160: Failed to watch *v1.ConfigMap: Get "http://0.0.0.0:8001/api/v1/configmaps?allowWatchBookmarks=true&resourceVersion=9070852&timeout=8m23s&timeoutSeconds=503&watch=true": dial tcp 0.0.0.0:8001: connect: connection refused
E0611 02:56:35.106736       6 reflector.go:382] k8s.io/ingress-nginx/internal/ingress/controller/store/store.go:159: Failed to watch *v1.Service: Get "http://0.0.0.0:8001/api/v1/services?allowWatchBookmarks=true&resourceVersion=2820793&timeout=7m53s&timeoutSeconds=473&watch=true": dial tcp 0.0.0.0:8001: connect: connection refused
E0611 02:56:35.111164       6 reflector.go:382] k8s.io/ingress-nginx/internal/ingress/controller/store/store.go:158: Failed to watch *v1.Endpoints: Get "http://0.0.0.0:8001/api/v1/endpoints?allowWatchBookmarks=true&resourceVersion=9070859&timeout=5m5s&timeoutSeconds=305&watch=true": dial tcp 0.0.0.0:8001: connect: connection refused
E0611 02:56:35.646766       6 reflector.go:178] k8s.io/ingress-nginx/internal/ingress/controller/store/store.go:161: Failed to list *v1.Pod: Get "http://0.0.0.0:8001/api/v1/namespaces/invalid-namespace/pods?resourceVersion=9069652": dial tcp 0.0.0.0:8001: connect: connection refused
E0611 02:56:36.069955       6 reflector.go:178] k8s.io/ingress-nginx/internal/ingress/controller/store/store.go:182: Failed to list *v1beta1.Ingress: Get "http://0.0.0.0:8001/apis/networking.k8s.io/v1beta1/ingresses?resourceVersion=2862637": dial tcp 0.0.0.0:8001: connect: connection refused
E0611 02:56:36.106781       6 reflector.go:382] k8s.io/ingress-nginx/internal/ingress/controller/store/store.go:160: Failed to watch *v1.ConfigMap: Get "http://0.0.0.0:8001/api/v1/configmaps?allowWatchBookmarks=true&resourceVersion=9070852&timeout=9m10s&timeoutSeconds=550&watch=true": dial tcp 0.0.0.0:8001: connect: connection refused
E0611 02:56:36.107470       6 reflector.go:382] k8s.io/ingress-nginx/internal/ingress/controller/store/store.go:159: Failed to watch *v1.Service: Get "http://0.0.0.0:8001/api/v1/services?allowWatchBookmarks=true&resourceVersion=2820793&timeout=8m12s&timeoutSeconds=492&watch=true": dial tcp 0.0.0.0:8001: connect: connection refused
E0611 02:56:36.112088       6 reflector.go:382] k8s.io/ingress-nginx/internal/ingress/controller/store/store.go:158: Failed to watch *v1.Endpoints: Get "http://0.0.0.0:8001/api/v1/endpoints?allowWatchBookmarks=true&resourceVersion=9070859&timeout=9m20s&timeoutSeconds=560&watch=true": dial tcp 0.0.0.0:8001: connect: connection refused
E0611 02:56:36.431501       6 reflector.go:178] k8s.io/ingress-nginx/internal/ingress/controller/store/store.go:157: Failed to list *v1.Secret: Get "http://0.0.0.0:8001/api/v1/secrets?resourceVersion=9070770": dial tcp 0.0.0.0:8001: connect: connection refused
E0611 02:56:37.107723       6 reflector.go:382] k8s.io/ingress-nginx/internal/ingress/controller/store/store.go:160: Failed to watch *v1.ConfigMap: Get "http://0.0.0.0:8001/api/v1/configmaps?allowWatchBookmarks=true&resourceVersion=9070852&timeout=7m2s&timeoutSeconds=422&watch=true": dial tcp 0.0.0.0:8001: connect: connection refused
E0611 02:56:37.108695       6 reflector.go:382] k8s.io/ingress-nginx/internal/ingress/controller/store/store.go:159: Failed to watch *v1.Service: Get "http://0.0.0.0:8001/api/v1/services?allowWatchBookmarks=true&resourceVersion=2820793&timeout=7m59s&timeoutSeconds=479&watch=true": dial tcp 0.0.0.0:8001: connect: connection refused
E0611 02:56:37.112833       6 reflector.go:382] k8s.io/ingress-nginx/internal/ingress/controller/store/store.go:158: Failed to watch *v1.Endpoints: Get "http://0.0.0.0:8001/api/v1/endpoints?allowWatchBookmarks=true&resourceVersion=9070859&timeout=5m14s&timeoutSeconds=314&watch=true": dial tcp 0.0.0.0:8001: connect: connection refused
E0611 02:56:38.108760       6 reflector.go:382] k8s.io/ingress-nginx/internal/ingress/controller/store/store.go:160: Failed to watch *v1.ConfigMap: Get "http://0.0.0.0:8001/api/v1/configmaps?allowWatchBookmarks=true&resourceVersion=9070852&timeout=6m19s&timeoutSeconds=379&watch=true": dial tcp 0.0.0.0:8001: connect: connection refused
E0611 02:56:38.109533       6 reflector.go:382] k8s.io/ingress-nginx/internal/ingress/controller/store/store.go:159: Failed to watch *v1.Service: Get "http://0.0.0.0:8001/api/v1/services?allowWatchBookmarks=true&resourceVersion=2820793&timeout=8m10s&timeoutSeconds=490&watch=true": dial tcp 0.0.0.0:8001: connect: connection refused
E0611 02:56:38.113538       6 reflector.go:382] k8s.io/ingress-nginx/internal/ingress/controller/store/store.go:158: Failed to watch *v1.Endpoints: Get "http://0.0.0.0:8001/api/v1/endpoints?allowWatchBookmarks=true&resourceVersion=9070859&timeout=9m13s&timeoutSeconds=553&watch=true": dial tcp 0.0.0.0:8001: connect: connection refused
E0611 02:56:38.294058       6 leaderelection.go:320] error retrieving resource lock invalid-namespace/ingress-controller-leader-nginx: Get "http://0.0.0.0:8001/api/v1/namespaces/invalid-namespace/configmaps/ingress-controller-leader-nginx": dial tcp 0.0.0.0:8001: connect: connection refused
E0611 02:56:39.109905       6 reflector.go:382] k8s.io/ingress-nginx/internal/ingress/controller/store/store.go:160: Failed to watch *v1.ConfigMap: Get "http://0.0.0.0:8001/api/v1/configmaps?allowWatchBookmarks=true&resourceVersion=9070852&timeout=8m47s&timeoutSeconds=527&watch=true": dial tcp 0.0.0.0:8001: connect: connection refused
E0611 02:56:39.110554       6 reflector.go:382] k8s.io/ingress-nginx/internal/ingress/controller/store/store.go:159: Failed to watch *v1.Service: Get "http://0.0.0.0:8001/api/v1/services?allowWatchBookmarks=true&resourceVersion=2820793&timeout=8m25s&timeoutSeconds=505&watch=true": dial tcp 0.0.0.0:8001: connect: connection refused
E0611 02:56:39.114372       6 reflector.go:382] k8s.io/ingress-nginx/internal/ingress/controller/store/store.go:158: Failed to watch *v1.Endpoints: Get "http://0.0.0.0:8001/api/v1/endpoints?allowWatchBookmarks=true&resourceVersion=9070859&timeout=8m7s&timeoutSeconds=487&watch=true": dial tcp 0.0.0.0:8001: connect: connection refused
E0611 02:56:40.110786       6 reflector.go:382] k8s.io/ingress-nginx/internal/ingress/controller/store/store.go:160: Failed to watch *v1.ConfigMap: Get "http://0.0.0.0:8001/api/v1/configmaps?allowWatchBookmarks=true&resourceVersion=9070852&timeout=5m22s&timeoutSeconds=322&watch=true": dial tcp 0.0.0.0:8001: connect: connection refused
E0611 02:56:40.111850       6 reflector.go:382] k8s.io/ingress-nginx/internal/ingress/controller/store/store.go:159: Failed to watch *v1.Service: Get "http://0.0.0.0:8001/api/v1/services?allowWatchBookmarks=true&resourceVersion=2820793&timeout=8m44s&timeoutSeconds=524&watch=true": dial tcp 0.0.0.0:8001: connect: connection refused
E0611 02:56:40.115094       6 reflector.go:382] k8s.io/ingress-nginx/internal/ingress/controller/store/store.go:158: Failed to watch *v1.Endpoints: Get "http://0.0.0.0:8001/api/v1/endpoints?allowWatchBookmarks=true&resourceVersion=9070859&timeout=6m9s&timeoutSeconds=369&watch=true": dial tcp 0.0.0.0:8001: connect: connection refused
E0611 02:56:41.111831       6 reflector.go:382] k8s.io/ingress-nginx/internal/ingress/controller/store/store.go:160: Failed to watch *v1.ConfigMap: Get "http://0.0.0.0:8001/api/v1/configmaps?allowWatchBookmarks=true&resourceVersion=9070852&timeout=5m14s&timeoutSeconds=314&watch=true": dial tcp 0.0.0.0:8001: connect: connection refused
E0611 02:56:41.112632       6 reflector.go:382] k8s.io/ingress-nginx/internal/ingress/controller/store/store.go:159: Failed to watch *v1.Service: Get "http://0.0.0.0:8001/api/v1/services?allowWatchBookmarks=true&resourceVersion=2820793&timeout=6m26s&timeoutSeconds=386&watch=true": dial tcp 0.0.0.0:8001: connect: connection refused
E0611 02:56:41.115883       6 reflector.go:382] k8s.io/ingress-nginx/internal/ingress/controller/store/store.go:158: Failed to watch *v1.Endpoints: Get "http://0.0.0.0:8001/api/v1/endpoints?allowWatchBookmarks=true&resourceVersion=9070859&timeout=9m51s&timeoutSeconds=591&watch=true": dial tcp 0.0.0.0:8001: connect: connection refused
E0611 02:56:42.112808       6 reflector.go:382] k8s.io/ingress-nginx/internal/ingress/controller/store/store.go:160: Failed to watch *v1.ConfigMap: Get "http://0.0.0.0:8001/api/v1/configmaps?allowWatchBookmarks=true&resourceVersion=9070852&timeout=9m9s&timeoutSeconds=549&watch=true": dial tcp 0.0.0.0:8001: connect: connection refused
E0611 02:56:42.113817       6 reflector.go:382] k8s.io/ingress-nginx/internal/ingress/controller/store/store.go:159: Failed to watch *v1.Service: Get "http://0.0.0.0:8001/api/v1/services?allowWatchBookmarks=true&resourceVersion=2820793&timeout=8m59s&timeoutSeconds=539&watch=true": dial tcp 0.0.0.0:8001: connect: connection refused
E0611 02:56:42.116610       6 reflector.go:382] k8s.io/ingress-nginx/internal/ingress/controller/store/store.go:158: Failed to watch *v1.Endpoints: Get "http://0.0.0.0:8001/api/v1/endpoints?allowWatchBookmarks=true&resourceVersion=9070859&timeout=5m11s&timeoutSeconds=311&watch=true": dial tcp 0.0.0.0:8001: connect: connection refused

When the API server is available again, you will not see such an output.

aledbf commented 4 years ago

Would it not be better if ingress-nginx was running with the same set of endpoints while API is unavailable?

When the API server is not available, there are no changes in the nginx model. This means if some of the pods running your application dies, the ingress controller will not be aware of such an event and it will try to send traffic to the pod.

aledbf commented 4 years ago

Some background, we are undergoing the process of rotating some of the secrets used by the Kube Api server because the client certificates will expire. To acheive this we need to restart the api servers, but when they restart the bearer tokens that were originally given by the service account are invalid.

There is no support in client-go to "reload" the service accounts. You need to restart the pod. Not sure if there is any support for this scenario

zerkms commented 4 years ago

When the API server is not available, there are no changes in the nginx model. This means if some of the pods running your application dies, the ingress controller will not be aware of such an event and it will try to send traffic to the pod.

I'm not sure I can understand it.

When API server is not available - ingress-nginx stops sending traffic anywhere because it believes endpoints have died (while they are still available).

aledbf commented 4 years ago

When API server is not available - ingress-nginx stops sending traffic anywhere because it believes endpoints have died (while they are still available).

Please post the logs of the ingress controller pod when this happens.

zerkms commented 4 years ago

It outputs these lines (one per service, only one shown) when it happens:

W0612 02:25:19.114102       6 controller.go:909] Service "rook-ceph/rook-ceph-mgr-dashboard" does not have any active Endpoint.

Yet, if I try to request this service from the cli:

# curl -v 10.53.185.114:7000
* Rebuilt URL to: 10.53.185.114:7000/
*   Trying 10.53.185.114...
* TCP_NODELAY set
* Connected to 10.53.185.114 (10.53.185.114) port 7000 (#0)
> GET / HTTP/1.1
> Host: 10.53.185.114:7000
> User-Agent: curl/7.58.0
> Accept: */*
> 
< HTTP/1.1 200 OK
< Content-Length: 1162
< Content-Language: en-US
< Accept-Ranges: bytes
< Vary: Accept-Language, Accept-Encoding
< Server: CherryPy/3.2.2
< Last-Modified: Mon, 02 Mar 2020 17:55:01 GMT
< Date: Fri, 12 Jun 2020 02:28:47 GMT
< Content-Type: text/html;charset=utf-8
< 
<!doctype html>
<html lang="en">
<head>
  <meta charset="utf-8">
  <title>Ceph</title>

It responds successfully

djagannath-asapp commented 4 years ago

We had to roll our masters and the IPs in our Kubernetes API got stale due to a DNS issue. Ingress nginx couldn't talk to the API and then it cleared out all of the backends.

fejta-bot commented 4 years ago

Issues go stale after 90d of inactivity. Mark the issue as fresh with /remove-lifecycle stale. Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta. /lifecycle stale

zerkms commented 4 years ago

/remove-lifecycle stale

fejta-bot commented 3 years ago

Issues go stale after 90d of inactivity. Mark the issue as fresh with /remove-lifecycle stale. Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta. /lifecycle stale

zerkms commented 3 years ago

/remove-lifecycle stale

fejta-bot commented 3 years ago

Issues go stale after 90d of inactivity. Mark the issue as fresh with /remove-lifecycle stale. Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale

zerkms commented 3 years ago

/remove-lifecycle stale

fejta-bot commented 3 years ago

Issues go stale after 90d of inactivity. Mark the issue as fresh with /remove-lifecycle stale. Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale

zerkms commented 3 years ago

/remove-lifecycle stale

strongjz commented 3 years ago

/assign @rikatz

strongjz commented 3 years ago

/lifecycle active

k8s-triage-robot commented 3 years ago

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle stale
Mark this issue or PR as rotten with /lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

manzsolutions-lpr commented 3 years ago

/remove-lifecycle stale

rikatz commented 2 years ago

/lifecycle frozen

iamNoah1 commented 2 years ago

/triage accepted /priority important-longterm

rikatz commented 2 years ago

This seems that after splitting the control plane from datplane something that will be doable.

Control Plane will keep an state of the current configuration. Everytime dataplane gets it, the same config will be there.

The con of this is that still we cannot deal with interruptions of kubernetes api server, but now we can think on another approach like

Data Plane -gRPC-> Kubernetes Service -> control plane (1-N) -> Kubernetes API Server (1-N)

strongjz commented 2 years ago

/help-wanted

iamNoah1 commented 2 years ago

/help

k8s-ci-robot commented 2 years ago

@iamNoah1: This request has been marked as needing help from a contributor.

Guidelines

Please ensure that the issue body includes answers to the following questions:

Why are we solving this issue?
To address this issue, are there any code changes? If there are code changes, what needs to be done in the code and what places can the assignee treat as reference points?
Does this issue have zero to low barrier of entry?
How can the assignee reach out to you for help?

For more details on the requirements of such an issue, please see here and ensure that they are met.

If this request no longer meets these requirements, the label can be removed by commenting with the /remove-help command.

In response to [this](https://github.com/kubernetes/ingress-nginx/issues/5667): >/help Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.

essh commented 2 years ago

Also seeing this across some AKS clusters that seem to sometimes briefly (~ a few minutes) lose connectivity to the control plane resulting in the same Service "zzz" does not have any active Endpoint. messages.

This manifests as 503 errors being served when hitting an Ingress URL until connection with the control plane is once again resumed and the appropriate endpoints populated again.

nvanheuverzwijn commented 2 years ago

This will also happen if you have a service pointing to a deleted deployment. The endpoints are effectively not there and the logs Service "x" does not have any active Endpoint. starts to appear. Nginx will become very unstable, throwing 503 left and right. I know we should not have service pointing to deleted deployment but hey, this happened to us :(

dippynark commented 2 years ago

We also hit this issue on GKE, although I'm confused what's going on because the Nginx Ingress Controller nginx.conf generation logic seems to discover all Kubernetes resources through informer caches so I would expect an API server disconnect to just prevent updates (rather than wiping all existing endpoints).

Could the informers be clearing their local cache after a disconnect?

rstedman commented 11 months ago

I hit the issue described here with an on-prem k8s deployment, and spent time trying to determine what was going on using a custom build of ingress-nginx with additional logging. What I've found is that ingress-nginx does seem to be behaving properly when the control plane is down & will not drop any backends. However, there are scenarios where it can appear the control plane is down externally when it is not that can cause the cluster to get into a state where the ingress will drop all backends.

For context - we bootstrap our clusters using kubeadm, and for high availability of the control plane we use an external load balancer (haproxy) which all the kubelets are configured to talk to the apiserver through (https://github.com/kubernetes/kubeadm/blob/main/docs/ha-considerations.md).

When I was first working on reproducing this issue I was trying to reproduce the control plane being down by just stopping haproxy - just because it was faster. From the logging I added I could see that after about a minute the EndpointSlice informer ingress-nginx uses was getting watch events about the endpoint slices changed to Not Ready, and once there are no ready EndpointSlices for an ingress it would drop the backend for that ingress resource.

Because we weren't passing in --kubeconfig or --master params to ingress-nginx it uses the in cluster config when constructing the k8s client, which uses the KUBERNETES_SERVICE_HOST environment variable and communicates to the apiserver using the k8s service instead of the external load balancer. Around a minute or so after bringing down the external load balancer k8s marks all the nodes as unreachable due to not getting health checks from the kubelets, which causes all the endpointslices for those nodes to be marked not ready, which the ingress sees resulting in backends being dropped due to no endpoint slices being ready.

When I changed my testing approach to kill the apiserver pod and prevent it from starting up again I found that ingress-nginx would behave as expected. It would continue using the cached EndpointSlices indefinitely until the apiserver came back up & it is able to reconnect.

It seems possible that this could be what others are encountering. It appears that the control plane is down because whatever external load balancer is used is down or unreachable for whatever reason, which also causes the control plane to see all of its nodes as unhealthy because they can't push health checks, resulting in all EndpointSlice resources getting marked as not ready & the ingress dropping all backends. External tools (ex. kubectl) trying to access the apiserver are likely also configured to also use the external load balancer, which would make it look like the control plane is completely down.

kubernetes / ingress-nginx

Tolerance to API unavailability #5667

Guidelines