HPA pod scaling causes 503

voyagermesh / voyager

🚀 Secure L7/L4 (HAProxy) Ingress Controller for Kubernetes

https://voyagermesh.com

Apache License 2.0

1.35k stars 134 forks source link

HPA pod scaling causes 503 #1389

Open rmohammed-xtime opened 5 years ago

rmohammed-xtime commented 5 years ago

Hi,

We are using version 7.4.0.

It looks like 503s are being generated because the ingress configuration is not updated quickly enough because HPA is causing pods to scale up and scale down fairly rapidly.

We have turned off HPA.

Is there anything that can be done about the ingress configuration update to prevent the 503s.

Thanks Riad

mkozjak commented 5 years ago

You sure it's related to HPA? We're getting this even when not using it. For us Voyager never seems to sync its table of pods and sends requests to terminated ones. #1334

marceldegraaf commented 5 years ago

We're seeing 503 errors as well, but only during rolling restarts of our deployments. This is the RollingUpdate configuration in one of our deployments:

  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxUnavailable: 0
      maxSurge: 1

This means there must always be at least one healthy instance of our app available. However, during the rolling update there is a very short period of time that a 503 Service Unavailable error is returned.

This seems to be related to the way the HAProxy reloader works: it seems it can send requests to killed pods for a short amount of time, until it has reloaded its configuration.

@tamalsaha is this a known issue/limitation with how Voyager reloads HAProxy? Or am I doing it wrong? 🙂

rmohammed-xtime commented 5 years ago

I should have titled the defect better.

HPA was the cause of a rapid up/down change in pods that lead to the 503s.

It looks like the 503s are possible with anything that changes pods

@tamalsaha Is this still an issue with Voyager 10

mkozjak commented 5 years ago

Anyone tried with Voyager 10?

kfoozminus commented 5 years ago

@rmohammed-xtime Did you happen to notice whether this problem occurs in case of only scaling down or both up/down? Our guess is this should happen when pods are terminating. (Requests keep going to terminated pods - for a brief period of time)

This issue is different from #1334 btw.

rmohammed-xtime commented 5 years ago

@kfoozminus Both up/down

We observed peak 503s at 5:52am, 5.56am, 7:18am and below is data for number of pods around that time frame.

Data for number of pods (fluctuation in number of pods is due to HPA)

Data from 5:45 AM 5:45 – 5:50 AM 17 pods. 2 new at 5.48 AM 5:50-5:55 AM 19 pods 2 new at 5:52 5:55 – 6:00 AM 24 pods 5 new at 5:59:30 AM 6:00 – 6:05 AM 28 pods 4 new at 6:03 AM 6:05 – 6:10 AM 30 pods 2 new at 6:07 AM 6:10 – 6:15 AM 30 pods 4 down at 6:14 AM 6:15 – 6:20 AM 30 pods 4 down at 6:14 AM 6:20 – 6:25 AM 25 Pods 2 down at 6:15 AM 6:25 – 6:30 AM 21 Pods 2 additional at 6:26 AM 6:30 – 6:35 AM 24 pods 3 additions at 6:31 6:35 – 6:40 AM 30 Pods 6 additions at 6:36 6:40 – 6:45 AM 30 Pods 2 down 6:44 AM 6:45 – 6:50 AM 27 Pods 1 down at 6:45 AM 6:50 – 6:55 AM 26 Pods 3 down at 6:50 AM 6:55 – 7:00 AM 20 Pods 7:05 – 7:10 AM 22 Pods 2 additions at 7:09 AM 7:10 – 7:15 AM 29 Pods 7 additions at 7:14 AM 7:15 -7:20 AM 29 Pods 7:20 – 7:25 AM 30 Pods 1 addition at 7:22 AM 7:25 – 7:30 AM 30 Pods 7:30 – 7:35 AM 30 Pods 4 down at 7:31 AM

kfoozminus commented 5 years ago

@rmohammed-xtime are you using hpa for the ingress-pod too (Or, maybe using multiple ingress-pod)?

rmohammed-xtime commented 5 years ago

@kfoozminus The pods listed in my previous comment https://github.com/appscode/voyager/issues/1389#issuecomment-497111914 are from the same deployment. HPA was only enabled for pods from that deployment and nothing else.

There is only a single ingress deployed. There are 5 voyager pods that get created.