Increasing 503 error codes over time

mcasperson commented 5 years ago

We have a cluster that has experienced an error where requests increasingly return 503 HTTP codes over time.

The image below shows the errors generated by requests to a K8S cluster with Voyager. This cluster was up for a number of days, and then starting at 11:00 requests started returning 503 (ignore the peak around 9:00, that was an unrelated problem). At about 15:30 the Voyager ingress controller pods were deleted and restarted, and the errors went away.

We had 2 Voyager ingress controller pods, which would explain the zigzag pattern as each pod returns more (but not the same number of) 503 response codes.

I've experienced this problem a few times since, and deleting the Voyager pods always resolves the issue with no additional changes to the cluster.

The log files from the Voyager ingress controller pods don't show anything that indicates an error, other than listing the 503 error responses themselves.

I can set up a cron job to recycle the Voyager pods easily enough, which does seem to prevent the problem from happening.

So I have two questions:

Has anyone else experienced these 503 errors?
Is there any downside to using cron to delete the Voyager ingress controller pods?

We are using Voyager 7.40 in AKS, and routing TCP and HTTP traffic.

mkozjak commented 5 years ago

I have this issue on v9 too. For some reason voyager ingress still sends requests to terminated backend pods, even though they aren't in /etc/haproxy/haproxy.cfg anymore.

These are the failing logs:

local0.info: May  8 11:56:38 haproxy[7694]: xxx:39504 [08/May/2019:11:56:23.183] http-0_0_0_0-443~ collection.default:8080/pod-collection-b88c7bdb5-rqtlm 0/0/-1/-1/15495 503 212 - - SC-- 2/2/1/0/3 0/0 "GET /status HTTP/1.1"
local0.info: May  8 11:56:52 haproxy[7694]: xxx:56866 [08/May/2019:11:56:37.138] http-0_0_0_0-443~ collection.default:8080/pod-collection-b88c7bdb5-6qrdh 0/0/-1/-1/15380 503 212 - - SC-- 2/2/1/0/3 0/0 "GET /agentRecommendedListings?userId=xxx-xxx HTTP/1.1"
local0.info: May  8 11:56:57 haproxy[7694]: xxx:56946 [08/May/2019:11:56:41.690] http-0_0_0_0-443~ collection.default:8080/pod-collection-b88c7bdb5-md8w9 0/0/-1/-1/15387 503 212 - - SC-- 2/2/1/0/3 0/0 "GET /query/savedListings?userId=xxx-8 HTTP/1.1"
local0.info: May  8 11:57:10 haproxy[7694]: xxx:57074 [08/May/2019:11:56:55.154] http-0_0_0_0-443~ collection.default:8080/pod-collection-b88c7bdb5-rqtlm 0/0/-1/-1/15444 503 212 - - SC-- 2/2/0/0/3 0/0 "GET /query/listing-set?first=10&listingId=xxx-5&userId=xxx-8 HTTP/1.1"

Backend pods displayed in logs are long gone and new ones were deployed. Voyager kept sending requests to old ones, though.

kubectl get pods -l run=collection
NAME                          READY     STATUS    RESTARTS   AGE
collection-794fff995f-5kwtk   1/1       Running   0          1h
collection-794fff995f-c2nmb   1/1       Running   0          1h
collection-794fff995f-mz7jr   1/1       Running   0          1h
collection-794fff995f-rr5ln   1/1       Running   0          1h

New backend pods were brought up by scaling the entire voyager deployment and also didn't produce any 503 SC-- errors, obviously.

local0.info: May  8 12:03:26 haproxy[7946]: xxx:59730 [08/May/2019:12:03:26.025] http-0_0_0_0-443~ collection.default:8080/pod-collection-794fff995f-c2nmb 0/0/1/9/10 200 792 - - ---- 2/2/0/0/0 0/0 "GET /query/savedListings?userId=xxx-xxx-8 HTTP/1.1"

ConfigMap also seems to be synced:

    xxx:8080     ssl verify none     \nbackend collection.default:8080\n\tserver
    pod-collection-794fff995f-5kwtk xxx:8080     ssl verify none     \n\tserver
    pod-collection-794fff995f-c2nmb xxx:8080     ssl verify none     \n\tserver
    pod-collection-794fff995f-mz7jr xxx:8080     ssl verify none     \n\tserver
    pod-collection-794fff995f-rr5ln xxx:8080     ssl verify none     \nbackend

@tamalsaha: Maybe any updates/thoughts on this?

mkozjak commented 5 years ago

Happened again. We're using Google Kubernetes Engine with Voyager 9.0.0.

kfoozminus commented 5 years ago

@mkozjak For some reason voyager ingress still sends requests to terminated backend pods, even though they aren't in /etc/haproxy/haproxy.cfg anymore. - How did you check that? By doing exec to the voyager pod or by describing the ingress's configmap?

mkozjak commented 5 years ago

@kfoozminus ingress' configmap values were updated. haproxy.cfg was ok too (yes, exec'd). I tailed voyager pod logs and it'd still hit backend pods that weren't alive anymore. It feels like voyager has something in its memory that hasn't synced until the end (just a pseudo-idea :))

kfoozminus commented 5 years ago

@mkozjak Does this problem go away after a certain period of time automatically? Or do you have to delete the controller pod every time?

I'm trying to reproduce the problem. Any details that can help me do that is much appreciated :)

mkozjak commented 5 years ago

@kfoozminus You mean does it go away? No. And it starts happening automatically in random intervals. Maybe @mcasperson has more info.

kfoozminus commented 5 years ago

@mcasperson

I've experienced this problem a few times since, and deleting the Voyager pods always resolves the issue with no additional changes to the cluster.

Does this problem never go away? Or it takes too long to be fixed up so you just delete the pod to avoid down time?

kfoozminus commented 5 years ago

@mkozjak

And it starts happening automatically in random intervals.

So here's what I gathered:

You had a working voyager ingress with (let's say) one backend service which has 4 pods.
You updated the backend deployment. So all of your 4 pods are terminated and 4 new are created.
Ingress-configmap and /etc/haproxy/haproxy.cfg inside the ingress-pod shows updated configuration, but your log shows traffic is going to terminated pods.
This problem never goes away (traffics keep going to terminated pods indefinitely). You restart the ingress-pod (the one generated from Ingress object) and everything works well.

Is that ok?

@mcasperson Same like this?

mkozjak commented 5 years ago

@kfoozminus all good, except that the traffic also goes to new ones. Sometimes it chooses new ones and sometimes old ones (3rd bullet).

kfoozminus commented 5 years ago

@kfoozminus all good, except that the traffic also goes to new ones. Sometimes it chooses new ones and sometimes old ones (3rd bullet).

But the haproxy.cfg contains only the new pods?

mkozjak commented 5 years ago

Exactly!

kfoozminus commented 5 years ago

Ok, thanks. I'm trying to reproduce this with load testing and restarting the backend pods over and over.

kfoozminus commented 5 years ago

@mcasperson @mkozjak While we are trying to understand what's the problem - if this problem happens again - maybe we can arrange a virtual meeting to see what's going wrong.

mkozjak commented 5 years ago

Yes, that'd be great.

kfoozminus commented 5 years ago

@mkozjak Are you using the default alpine based haproxy-image or did you specify --haproxy-image-tag while installing voyager?

mkozjak commented 5 years ago

@kfoozminus

helm install appscode/voyager --name voyager-operator --version 9.0.0 \
--namespace kube-system --set cloudProvider=gke \
--set apiserver.enableValidatingWebhook=false

mkozjak commented 5 years ago

@kfoozminus Want me to upgrade to 10.x?

kfoozminus commented 5 years ago

No it's ok.

mkozjak commented 5 years ago

@kfoozminus Note that I'll soon need to update to 10.x so maybe try running tests on that version also.

mkozjak commented 5 years ago

@kfoozminus I'm having a reproduced issue currently. Anyway to do a live debugging session together? I will need to restart everything in half an hour..

mkozjak commented 5 years ago

@mcasperson What did you do to tackle this one in the mean time?

mkozjak commented 5 years ago

@kfoozminus Upgraded to 10.x, I guess.

kubectl exec -it voyager-voyager-operator-76778cc5f9-4ctln -n kube-system voyager version

Version = 10.0.0
VersionStrategy = tag
Os = alpine
Arch = amd64
CommitHash = b866601fd012b57feb048e8eb3caa954ae5280af
GitBranch = release-10.0
GitTag = 10.0.0
CommitTimestamp = 2019-04-29T00:07:03

A really confusing part is:

kubectl describe pod voyager-k8s-ingress-66f7d879fd-2lfv9

Image:         appscode/haproxy:1.9.2-9.0.0-alpine

Shouldn't it be appscode/haproxy:1.9.6-10.0.0-alpine?

kfoozminus commented 5 years ago

@mkozjak sorry I wasn't online. The problem is gone by now, I guess? :(

mkozjak commented 5 years ago

Yup.

On Mon, Jun 3, 2019, 21:29 Jannatul Ferdows Jenny notifications@github.com wrote:

@mkozjak https://github.com/mkozjak sorry I wasn't online. The problem ia gone by now, I guess? :(

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/appscode/voyager/issues/1334?email_source=notifications&email_token=AA2X77BAZC2TK4MSV2EL2FDPYVWJJA5CNFSM4GTNHL2KYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODW2OE4Q#issuecomment-498393714, or mute the thread https://github.com/notifications/unsubscribe-auth/AA2X77B7CJJCXNBSWTDRNQ3PYVWJJANCNFSM4GTNHL2A .

tamalsaha commented 5 years ago

Hi Mario, Our offices are closed this week for Eid vacation. We will be back to regular schedule from next Monday (June 10). We are going to pick up this issue once we are back.

Regards, Tamal

On Mon, Jun 3, 2019 at 12:31 PM Mario Kozjak notifications@github.com wrote:

Yup.

On Mon, Jun 3, 2019, 21:29 Jannatul Ferdows Jenny < notifications@github.com> wrote:

@mkozjak https://github.com/mkozjak <https://mailtrack.io/trace/link/9bf68dde576f938c5bc2fa49acb3f9199a1e71fc?url=https%3A%2F%2Fgithub.com%2Fmkozjak&userId=3471932&signature=44c99f75681d2963> sorry I wasn't online. The problem ia gone by now, I guess? :(

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub < https://github.com/appscode/voyager/issues/1334?email_source=notifications&email_token=AA2X77BAZC2TK4MSV2EL2FDPYVWJJA5CNFSM4GTNHL2KYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODW2OE4Q#issuecomment-498393714 https://mailtrack.io/trace/link/aea2715ed170ff57c475167fe165bab578536635?url=https%3A%2F%2Fgithub.com%2Fappscode%2Fvoyager%2Fissues%2F1334%3Femail_source%3Dnotifications%26email_token%3DAA2X77BAZC2TK4MSV2EL2FDPYVWJJA5CNFSM4GTNHL2KYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODW2OE4Q%23issuecomment-498393714&userId=3471932&signature=0a5d8b8f7d42420a , or mute the thread < https://github.com/notifications/unsubscribe-auth/AA2X77B7CJJCXNBSWTDRNQ3PYVWJJANCNFSM4GTNHL2A https://mailtrack.io/trace/link/1c7a4b92462c5996ae6f3d84361c85fec3a7d416?url=https%3A%2F%2Fgithub.com%2Fnotifications%2Funsubscribe-auth%2FAA2X77B7CJJCXNBSWTDRNQ3PYVWJJANCNFSM4GTNHL2A&userId=3471932&signature=bc6bb7821ca468af

.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://mailtrack.io/trace/link/4a0522c37d523d361cee5aa3424b410bc4e3634d?url=https%3A%2F%2Fgithub.com%2Fappscode%2Fvoyager%2Fissues%2F1334%3Femail_source%3Dnotifications%26email_token%3DAAAXEXVYQPAMZUORCS2EQCLPYVWQLA5CNFSM4GTNHL2KYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODW2OKCI%23issuecomment-498394377&userId=3471932&signature=1945a156f2a7081f, or mute the thread https://mailtrack.io/trace/link/71acfaf596d85990f25ac32f65cc426f71ea9855?url=https%3A%2F%2Fgithub.com%2Fnotifications%2Funsubscribe-auth%2FAAAXEXWCBW3JG4EMKO7M7Y3PYVWQLANCNFSM4GTNHL2A&userId=3471932&signature=c88fa998820b9248 .

lanserver commented 5 years ago

Hey everyone, just wanted to mention that I'm having a very similar issue, if not the same exact issue. Everything works great until large amounts of traffic are moved through the ingress, then the ingress will continue working properly for 10-15 minutes before intermittently throwing 503s.

I'm using Azure Kubernetes Service 1.14.0 running Voyager 10 pointed at external backends (Azure app services). Nothing else is having trouble reaching these services when we see this issue. Unfortunately, I'm unable to confirm whether this issue scales out in the same way since I'm only able to replicate the issue in prod. If there's anything I can do to help with this issue, I'm happy to help.

lanserver commented 5 years ago

Hey guys, just curious if you have any updates on this issue. Thanks!

kfoozminus commented 5 years ago

We are assuming for now that this issue is happening because of haproxy reload issue and pods being terminated without grace. I'm writing a detailed doc about it. But we've only been able to reproduce this issue in where after a certain while it goes away. (Basically we are assuming no 503 for new requests.)

Are you facing this problem right now?

lanserver commented 5 years ago

Yes, we're able to reproduce this issue any time we ramp up traffic to our ingress. We have it down for the time being but I can bring it back up if you need something.

kfoozminus commented 5 years ago

Yes, that'd be great. Join our slack: https://appscode.slack.com/

lanserver commented 5 years ago

Great, thanks! Just messaged you over there.

mkozjak commented 5 years ago

@kfoozminus @lanserver Maybe any updates on this one? In which slack channel did you do the debugging if not in a direct message? Thanks in advance!

kfoozminus commented 5 years ago

@mkozjak we used direct message. His issue seems different from this though.

mkozjak commented 5 years ago

@kfoozminus So it was something different than sudden 503s after certain pods die and 200s after the ingress controller pods are restarted?

kfoozminus commented 5 years ago

Yeah.

mkozjak commented 5 years ago

Well, ok. The next time I reproduce the issue I'm gonna try sending SIGUSR2 signal to a haproxy process so it reloads its configuration in runtime to see if that'll solve the issue. Can you please continue your tests on this internally, @kfoozminus?

kfoozminus commented 5 years ago

@mkozjak Yes, of course! We are testing and trying to reproduce this continuously! Were you able to see how many haproxy process was running in that moment?

mkozjak commented 5 years ago

@kfoozminus In one pod or you mean the number of pods? It was 4 running pods at the time.

kfoozminus commented 5 years ago

Next time it happens again, give us the output to ps aux | grep haproxy from inside the voyager pod. Want to see how many haproxy processes were running. I hope to do a live session though.

kfoozminus commented 5 years ago

kubectl describe pod voyager-k8s-ingress-66f7d879fd-2lfv9

Image: appscode/haproxy:1.9.2-9.0.0-alpine

Shouldn't it be appscode/haproxy:1.9.6-10.0.0-alpine?

@mkozjak yes, it should be. What exact command did you use to upgrade to 10.0.0?

mkozjak commented 5 years ago

@kfoozminus helm upgrade voyager-operator appscode/voyager --version 10.0.0 --namespace kube-system --set cloudProvider=gke --set apiserver.enableValidatingWebhook=false

mkozjak commented 5 years ago

@kfoozminus: Maybe I did something wrong with the update procedure?

kfoozminus commented 5 years ago

I upgraded with same command. Mine shows appscode/haproxy:1.9.6-10.0.0-alpine. Does yours still show appscode/haproxy:1.9.2-9.0.0-alpine?

mkozjak commented 5 years ago

Seems to have managed to pull the latest one in the end, since I restart the ingress all the time.

appscode/haproxy:1.9.6-10.0.0-alpine

kfoozminus commented 5 years ago

After upgrading, it takes a few seconds to update the image (time new voyager operator takes to go into running state and update all ingress deployments).

mkozjak commented 5 years ago

@kfoozminus: Maybe had any luck reproducing the issue?

Simon3 commented 5 years ago

Hi, it seems like the issue is still present in version 10.0.0. I got several 503 errors on frontend files today right after upgrading my stack, and killing the Voyager pod solved the issue.

mkozjak commented 5 years ago

@Simon3: next time you reproduce, can you please try running ps aux | grep haproxy and sending SIGUSR2 to the process inside the pod instead of killing it and see if it'll start working again? Thanks!

Simon3 commented 5 years ago

Sure I will if it happens again.

Simon3 commented 5 years ago

Hey, I'm still on Voyager 10.0.0 and the bug keeps occurring, like twice a month. We are doing lots of helm upgrades though, like 20 per day. I tried the SIGUSR2 signal, but it didn't change anything, here is my history in case I would have missed something:

bash-4.4# ps | grep haproxy
    9 root      0:00 runsv haproxy-controller
   12 root     43:03 voyager haproxy-controller --enable-analytics=true --burst=1000000 --cloud-provider=gke --ingress-api-version=voyager.appscode.com/v1beta1 --ingress-name=flowr-527-rc-latest-ingress --qps=1e+06 --logtostderr=false --alsologtostderr=false --v=3 --stderrthreshold=0
 1002 root      0:10 haproxy -f /etc/haproxy/haproxy.cfg -p /var/run/haproxy.pid
 1051 root      0:03 haproxy -f /etc/haproxy/haproxy.cfg -p /var/run/haproxy.pid -x /var/run/haproxy.sock -sf 1048
 1057 root      0:01 haproxy -f /etc/haproxy/haproxy.cfg -p /var/run/haproxy.pid -x /var/run/haproxy.sock -sf 1054
bash-4.4# 
bash-4.4# kill -SIGUSR2 12

It would be nice if we could get an update on this issue, is there a point in upgrading to v11.0.1 for this issue? Are you still investigating it?

Thanks in advance!

voyagermesh / voyager

Increasing 503 error codes over time #1334