HTTPProxy multiple timeouts (daily)

cloudmaniac commented 3 years ago

What steps did you take and what happened: I have an environment with multiple httpproxy resources defined; they are all running fine, except 2 or 3 times a day where all applications delivered via those are not available anymore (timeout). It usually last for approx. 10 min and then everything comes back to normal.

I checked if I could find any corresponding events, but nothing relevant so far. I'm looking for ideas where to look. I can submit any logs asked if required.

Anything else you would like to add:

During one of those event, I was able to confirm that the IP used by Envoy is available and answering to the ping. However, a telnet <ip> 80 timed out as well.
Not a 'scientific' method but we observed that this behavior becoming worse over time (it was initially one timeout period in a week and it increased until multiple times a day).

Environment:

Contour version: 1.4.0
Kubernetes version: (use kubectl version): 1.16.0
Kubernetes installer & version: kubespray
Cloud provider or hardware configuration: vSphere VMs
OS (e.g. from /etc/os-release): Ubuntu 18.04.3 LTS

skriss commented 3 years ago

@cloudmaniac just confirming that you're running Contour 1.4?

If so, is there any chance that you could upgrade to a newer release (v1.12.0 is the latest) and see if you have the same issue there?

cloudmaniac commented 3 years ago

Yes, version 1.4. I can upgrade tomorrow.

Can I go from 1.4 directly to 1.12 using the easy way to upgrade?

skriss commented 3 years ago

As long as you're already using HTTPProxy (not IngressRoute), and using a standard quickstart install, then yeah, I think that should be OK, since it's effectively uninstall and reinstall.

cloudmaniac commented 3 years ago

I upgraded this morning from 1.4 to 1.12; I'll come back to you if we observe the issue again.

cloudmaniac commented 3 years ago

Well, I just had the exact same issue 15 min ago. Where should I start looking?

youngnick commented 3 years ago

In terms of a quick check, can you check for restarts of the Contour and Envoy pods? If the Envoy pods were restarting, that could cause something like what you're describing.

Aside from that though, how are you installing Contour? Is Envoy running with the Service of Type Loadbalancer in a public cloud?

cloudmaniac commented 3 years ago

Envoy pods have not restarted, and don't have any events.

I installed Contour using kubectl apply -f https://projectcontour.io/quickstart/v1.12.0/contour.yaml. Those are vSphere virtual machines onprem, the IP used by Envoy is delivered by MetalLB.

youngnick commented 3 years ago

Okay, I'd flip the Envoys to debug logging mode, and see if you can see anything at the time when there is an issue. Contour logs will also tell you if there were updates at the time that there was a problem.

I haven't heard of anything like this before, so we just have to try and find where the problem is coming from; Contour and Envoy logs and lining times up is the next step.

cloudmaniac commented 3 years ago

I activated the debug log for Envoy. Where/how should I look to the logs after an issue occurred? It's not mentioned in the documentation.

sunjayBhatia commented 3 years ago

Something like kubectl -n projectcontour logs daemonset/envoy -c envoy -f should work

cloudmaniac commented 3 years ago

I did a kubectl -n projectcontour logs daemonset/envoy -c envoy -f > debug.log and I logged for a day where multiple of those timeouts events occured. I don't find any error in the logs.

youngnick commented 3 years ago

I've been thinking about this one for a few days, and I must admit I'm stumped. For this sort of periodic traffic drop, I'd expect something about either the Envoy or the Contour to be changing, and we've checked the obvious things (envoy logs, no restarts). The only other thing I can think to check with Contour is to check the Contour logs, it's possible if something is funky with Contour, it might be restarting with cold caches and then the config takes a while to converge back to the correct state. The fact that a telnet to port 80 failed at this time also indicates this may be it (because Envoy will not respond on a port when there are no configured listeners). If this is the case, you will see Contour restarts, or a lot of log entries in Contour's logs.

Lastly, I'd also do a check to try and confirm that it's not related to MetalLB by checking a direct service of Type LB for the service you're currently putting behind Contour.

stevesloka commented 3 years ago

There are some metrics in envoy we could look at if you've got prometheus up and running. I'll digg up the specifics in the morning (on phone now) but should be some for if connected to an xds server (e.g. contour). That would tell if that connection is dropping for some reason.

cloudmaniac commented 3 years ago

FYI, we migrated production to a fresh k8s cluster with exact same config and kept the pre-prod on the previous cluster: production is running fine, and pre-prod continues to have regular timeouts.

I'll check about Contour logs later, and see if I can deploy Prometheus (but I never did it, so I have to read some doc first).

youngnick commented 3 years ago

Thanks for the update @cloudmaniac. I wonder what it is about the pre-prod cluster that's different? Looking forward to hearing more.

cloudmaniac commented 3 years ago

Nothing different technically; I used the same VM templates, the same components version, and the same Kubespray config to configure kubernetes. The only difference is the IPs used.

youngnick commented 3 years ago

That's interesting, is there any chance those IPs have been double-allocated? I've seen something like that before, where there was a routing tug-of-war over ownership of some IPs, and when traffic showed up at the wrong place, it was dropped by firewalls and so disappeared.

cloudmaniac commented 3 years ago

No, I made sure it was not an IP conflict.

youngnick commented 3 years ago

Thanks @cloudmaniac.

github-actions[bot] commented 10 months ago

The Contour project currently lacks enough contributors to adequately respond to all Issues.

This bot triages Issues according to the following rules:

After 60d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, the Issue is closed

You can:

Mark this Issue as fresh by commenting
Close this Issue
Offer to help out with triage

Please send feedback to the #contour channel in the Kubernetes Slack

cloudmaniac commented 9 months ago

Closing, we dropped usage of Contour because of the recurring issues.

projectcontour / contour

HTTPProxy multiple timeouts (daily) #3385