Closed mcasperson closed 3 years ago
I have this issue on v9 too. For some reason voyager ingress still sends requests to terminated backend pods, even though they aren't in /etc/haproxy/haproxy.cfg anymore.
These are the failing logs:
local0.info: May 8 11:56:38 haproxy[7694]: xxx:39504 [08/May/2019:11:56:23.183] http-0_0_0_0-443~ collection.default:8080/pod-collection-b88c7bdb5-rqtlm 0/0/-1/-1/15495 503 212 - - SC-- 2/2/1/0/3 0/0 "GET /status HTTP/1.1"
local0.info: May 8 11:56:52 haproxy[7694]: xxx:56866 [08/May/2019:11:56:37.138] http-0_0_0_0-443~ collection.default:8080/pod-collection-b88c7bdb5-6qrdh 0/0/-1/-1/15380 503 212 - - SC-- 2/2/1/0/3 0/0 "GET /agentRecommendedListings?userId=xxx-xxx HTTP/1.1"
local0.info: May 8 11:56:57 haproxy[7694]: xxx:56946 [08/May/2019:11:56:41.690] http-0_0_0_0-443~ collection.default:8080/pod-collection-b88c7bdb5-md8w9 0/0/-1/-1/15387 503 212 - - SC-- 2/2/1/0/3 0/0 "GET /query/savedListings?userId=xxx-8 HTTP/1.1"
local0.info: May 8 11:57:10 haproxy[7694]: xxx:57074 [08/May/2019:11:56:55.154] http-0_0_0_0-443~ collection.default:8080/pod-collection-b88c7bdb5-rqtlm 0/0/-1/-1/15444 503 212 - - SC-- 2/2/0/0/3 0/0 "GET /query/listing-set?first=10&listingId=xxx-5&userId=xxx-8 HTTP/1.1"
Backend pods displayed in logs are long gone and new ones were deployed. Voyager kept sending requests to old ones, though.
kubectl get pods -l run=collection
NAME READY STATUS RESTARTS AGE
collection-794fff995f-5kwtk 1/1 Running 0 1h
collection-794fff995f-c2nmb 1/1 Running 0 1h
collection-794fff995f-mz7jr 1/1 Running 0 1h
collection-794fff995f-rr5ln 1/1 Running 0 1h
New backend pods were brought up by scaling the entire voyager deployment and also didn't produce any 503 SC-- errors, obviously.
local0.info: May 8 12:03:26 haproxy[7946]: xxx:59730 [08/May/2019:12:03:26.025] http-0_0_0_0-443~ collection.default:8080/pod-collection-794fff995f-c2nmb 0/0/1/9/10 200 792 - - ---- 2/2/0/0/0 0/0 "GET /query/savedListings?userId=xxx-xxx-8 HTTP/1.1"
ConfigMap also seems to be synced:
xxx:8080 ssl verify none \nbackend collection.default:8080\n\tserver
pod-collection-794fff995f-5kwtk xxx:8080 ssl verify none \n\tserver
pod-collection-794fff995f-c2nmb xxx:8080 ssl verify none \n\tserver
pod-collection-794fff995f-mz7jr xxx:8080 ssl verify none \n\tserver
pod-collection-794fff995f-rr5ln xxx:8080 ssl verify none \nbackend
@tamalsaha: Maybe any updates/thoughts on this?
Happened again. We're using Google Kubernetes Engine with Voyager 9.0.0.
@mkozjak For some reason voyager ingress still sends requests to terminated backend pods, even though they aren't in /etc/haproxy/haproxy.cfg anymore.
- How did you check that? By doing exec to the voyager pod or by describing the ingress's configmap?
@kfoozminus ingress' configmap values were updated. haproxy.cfg
was ok too (yes, exec'd). I tailed voyager pod logs and it'd still hit backend pods that weren't alive anymore.
It feels like voyager has something in its memory that hasn't synced until the end (just a pseudo-idea :))
@mkozjak Does this problem go away after a certain period of time automatically? Or do you have to delete the controller pod every time?
I'm trying to reproduce the problem. Any details that can help me do that is much appreciated :)
@kfoozminus You mean does it go away? No. And it starts happening automatically in random intervals. Maybe @mcasperson has more info.
@mcasperson
I've experienced this problem a few times since, and deleting the Voyager pods always resolves the issue with no additional changes to the cluster.
Does this problem never go away? Or it takes too long to be fixed up so you just delete the pod to avoid down time?
@mkozjak
And it starts happening automatically in random intervals.
So here's what I gathered:
/etc/haproxy/haproxy.cfg
inside the ingress-pod shows updated configuration, but your log shows traffic is going to terminated pods.Is that ok?
@mcasperson Same like this?
@kfoozminus all good, except that the traffic also goes to new ones. Sometimes it chooses new ones and sometimes old ones (3rd bullet).
@kfoozminus all good, except that the traffic also goes to new ones. Sometimes it chooses new ones and sometimes old ones (3rd bullet).
But the haproxy.cfg contains only the new pods?
Exactly!
Ok, thanks. I'm trying to reproduce this with load testing and restarting the backend pods over and over.
@mcasperson @mkozjak While we are trying to understand what's the problem - if this problem happens again - maybe we can arrange a virtual meeting to see what's going wrong.
Yes, that'd be great.
@mkozjak Are you using the default alpine based haproxy-image or did you specify --haproxy-image-tag
while installing voyager?
@kfoozminus
helm install appscode/voyager --name voyager-operator --version 9.0.0 \
--namespace kube-system --set cloudProvider=gke \
--set apiserver.enableValidatingWebhook=false
@kfoozminus Want me to upgrade to 10.x?
No it's ok.
@kfoozminus Note that I'll soon need to update to 10.x so maybe try running tests on that version also.
@kfoozminus I'm having a reproduced issue currently. Anyway to do a live debugging session together? I will need to restart everything in half an hour..
@mcasperson What did you do to tackle this one in the mean time?
@kfoozminus Upgraded to 10.x, I guess.
kubectl exec -it voyager-voyager-operator-76778cc5f9-4ctln -n kube-system voyager version
Version = 10.0.0
VersionStrategy = tag
Os = alpine
Arch = amd64
CommitHash = b866601fd012b57feb048e8eb3caa954ae5280af
GitBranch = release-10.0
GitTag = 10.0.0
CommitTimestamp = 2019-04-29T00:07:03
A really confusing part is:
kubectl describe pod voyager-k8s-ingress-66f7d879fd-2lfv9
Image: appscode/haproxy:1.9.2-9.0.0-alpine
Shouldn't it be appscode/haproxy:1.9.6-10.0.0-alpine
?
@mkozjak sorry I wasn't online. The problem is gone by now, I guess? :(
Yup.
On Mon, Jun 3, 2019, 21:29 Jannatul Ferdows Jenny notifications@github.com wrote:
@mkozjak https://github.com/mkozjak sorry I wasn't online. The problem ia gone by now, I guess? :(
ā You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/appscode/voyager/issues/1334?email_source=notifications&email_token=AA2X77BAZC2TK4MSV2EL2FDPYVWJJA5CNFSM4GTNHL2KYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODW2OE4Q#issuecomment-498393714, or mute the thread https://github.com/notifications/unsubscribe-auth/AA2X77B7CJJCXNBSWTDRNQ3PYVWJJANCNFSM4GTNHL2A .
Hi Mario, Our offices are closed this week for Eid vacation. We will be back to regular schedule from next Monday (June 10). We are going to pick up this issue once we are back.
Regards, Tamal
On Mon, Jun 3, 2019 at 12:31 PM Mario Kozjak notifications@github.com wrote:
Yup.
On Mon, Jun 3, 2019, 21:29 Jannatul Ferdows Jenny < notifications@github.com> wrote:
@mkozjak https://github.com/mkozjak <https://mailtrack.io/trace/link/9bf68dde576f938c5bc2fa49acb3f9199a1e71fc?url=https%3A%2F%2Fgithub.com%2Fmkozjak&userId=3471932&signature=44c99f75681d2963> sorry I wasn't online. The problem ia gone by now, I guess? :(
ā You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub < https://github.com/appscode/voyager/issues/1334?email_source=notifications&email_token=AA2X77BAZC2TK4MSV2EL2FDPYVWJJA5CNFSM4GTNHL2KYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODW2OE4Q#issuecomment-498393714 https://mailtrack.io/trace/link/aea2715ed170ff57c475167fe165bab578536635?url=https%3A%2F%2Fgithub.com%2Fappscode%2Fvoyager%2Fissues%2F1334%3Femail_source%3Dnotifications%26email_token%3DAA2X77BAZC2TK4MSV2EL2FDPYVWJJA5CNFSM4GTNHL2KYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODW2OE4Q%23issuecomment-498393714&userId=3471932&signature=0a5d8b8f7d42420a , or mute the thread < https://github.com/notifications/unsubscribe-auth/AA2X77B7CJJCXNBSWTDRNQ3PYVWJJANCNFSM4GTNHL2A https://mailtrack.io/trace/link/1c7a4b92462c5996ae6f3d84361c85fec3a7d416?url=https%3A%2F%2Fgithub.com%2Fnotifications%2Funsubscribe-auth%2FAA2X77B7CJJCXNBSWTDRNQ3PYVWJJANCNFSM4GTNHL2A&userId=3471932&signature=bc6bb7821ca468af
.
ā You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://mailtrack.io/trace/link/4a0522c37d523d361cee5aa3424b410bc4e3634d?url=https%3A%2F%2Fgithub.com%2Fappscode%2Fvoyager%2Fissues%2F1334%3Femail_source%3Dnotifications%26email_token%3DAAAXEXVYQPAMZUORCS2EQCLPYVWQLA5CNFSM4GTNHL2KYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODW2OKCI%23issuecomment-498394377&userId=3471932&signature=1945a156f2a7081f, or mute the thread https://mailtrack.io/trace/link/71acfaf596d85990f25ac32f65cc426f71ea9855?url=https%3A%2F%2Fgithub.com%2Fnotifications%2Funsubscribe-auth%2FAAAXEXWCBW3JG4EMKO7M7Y3PYVWQLANCNFSM4GTNHL2A&userId=3471932&signature=c88fa998820b9248 .
Hey everyone, just wanted to mention that I'm having a very similar issue, if not the same exact issue. Everything works great until large amounts of traffic are moved through the ingress, then the ingress will continue working properly for 10-15 minutes before intermittently throwing 503s.
I'm using Azure Kubernetes Service 1.14.0 running Voyager 10 pointed at external backends (Azure app services). Nothing else is having trouble reaching these services when we see this issue. Unfortunately, I'm unable to confirm whether this issue scales out in the same way since I'm only able to replicate the issue in prod. If there's anything I can do to help with this issue, I'm happy to help.
Hey guys, just curious if you have any updates on this issue. Thanks!
We are assuming for now that this issue is happening because of haproxy reload issue and pods being terminated without grace. I'm writing a detailed doc about it. But we've only been able to reproduce this issue in where after a certain while it goes away. (Basically we are assuming no 503 for new requests.)
Are you facing this problem right now?
Yes, we're able to reproduce this issue any time we ramp up traffic to our ingress. We have it down for the time being but I can bring it back up if you need something.
Yes, that'd be great. Join our slack: https://appscode.slack.com/
Great, thanks! Just messaged you over there.
@kfoozminus @lanserver Maybe any updates on this one? In which slack channel did you do the debugging if not in a direct message? Thanks in advance!
@mkozjak we used direct message. His issue seems different from this though.
@kfoozminus So it was something different than sudden 503s after certain pods die and 200s after the ingress controller pods are restarted?
Yeah.
Well, ok. The next time I reproduce the issue I'm gonna try sending SIGUSR2 signal to a haproxy process so it reloads its configuration in runtime to see if that'll solve the issue. Can you please continue your tests on this internally, @kfoozminus?
@mkozjak Yes, of course! We are testing and trying to reproduce this continuously! Were you able to see how many haproxy process was running in that moment?
@kfoozminus In one pod or you mean the number of pods? It was 4 running pods at the time.
Next time it happens again, give us the output to ps aux | grep haproxy
from inside the voyager pod. Want to see how many haproxy processes were running.
I hope to do a live session though.
kubectl describe pod voyager-k8s-ingress-66f7d879fd-2lfv9
Image: appscode/haproxy:1.9.2-9.0.0-alpine
Shouldn't it be
appscode/haproxy:1.9.6-10.0.0-alpine
?
@mkozjak yes, it should be. What exact command did you use to upgrade to 10.0.0?
@kfoozminus
helm upgrade voyager-operator appscode/voyager --version 10.0.0 --namespace kube-system --set cloudProvider=gke --set apiserver.enableValidatingWebhook=false
@kfoozminus: Maybe I did something wrong with the update procedure?
I upgraded with same command. Mine shows appscode/haproxy:1.9.6-10.0.0-alpine
.
Does yours still show appscode/haproxy:1.9.2-9.0.0-alpine
?
Seems to have managed to pull the latest one in the end, since I restart the ingress all the time.
appscode/haproxy:1.9.6-10.0.0-alpine
After upgrading, it takes a few seconds to update the image (time new voyager operator takes to go into running state and update all ingress deployments).
@kfoozminus: Maybe had any luck reproducing the issue?
Hi, it seems like the issue is still present in version 10.0.0. I got several 503 errors on frontend files today right after upgrading my stack, and killing the Voyager pod solved the issue.
@Simon3: next time you reproduce, can you please try running ps aux | grep haproxy and sending SIGUSR2 to the process inside the pod instead of killing it and see if it'll start working again? Thanks!
Sure I will if it happens again.
Hey, I'm still on Voyager 10.0.0 and the bug keeps occurring, like twice a month. We are doing lots of helm upgrades though, like 20 per day. I tried the SIGUSR2 signal, but it didn't change anything, here is my history in case I would have missed something:
bash-4.4# ps | grep haproxy
9 root 0:00 runsv haproxy-controller
12 root 43:03 voyager haproxy-controller --enable-analytics=true --burst=1000000 --cloud-provider=gke --ingress-api-version=voyager.appscode.com/v1beta1 --ingress-name=flowr-527-rc-latest-ingress --qps=1e+06 --logtostderr=false --alsologtostderr=false --v=3 --stderrthreshold=0
1002 root 0:10 haproxy -f /etc/haproxy/haproxy.cfg -p /var/run/haproxy.pid
1051 root 0:03 haproxy -f /etc/haproxy/haproxy.cfg -p /var/run/haproxy.pid -x /var/run/haproxy.sock -sf 1048
1057 root 0:01 haproxy -f /etc/haproxy/haproxy.cfg -p /var/run/haproxy.pid -x /var/run/haproxy.sock -sf 1054
bash-4.4#
bash-4.4# kill -SIGUSR2 12
It would be nice if we could get an update on this issue, is there a point in upgrading to v11.0.1 for this issue? Are you still investigating it?
Thanks in advance!
We have a cluster that has experienced an error where requests increasingly return 503 HTTP codes over time.
The image below shows the errors generated by requests to a K8S cluster with Voyager. This cluster was up for a number of days, and then starting at 11:00 requests started returning 503 (ignore the peak around 9:00, that was an unrelated problem). At about 15:30 the Voyager ingress controller pods were deleted and restarted, and the errors went away.
We had 2 Voyager ingress controller pods, which would explain the zigzag pattern as each pod returns more (but not the same number of) 503 response codes.
I've experienced this problem a few times since, and deleting the Voyager pods always resolves the issue with no additional changes to the cluster.
The log files from the Voyager ingress controller pods don't show anything that indicates an error, other than listing the 503 error responses themselves.
I can set up a cron job to recycle the Voyager pods easily enough, which does seem to prevent the problem from happening.
So I have two questions:
We are using Voyager 7.40 in AKS, and routing TCP and HTTP traffic.