Closed cloudmaniac closed 9 months ago
@cloudmaniac just confirming that you're running Contour 1.4?
If so, is there any chance that you could upgrade to a newer release (v1.12.0 is the latest) and see if you have the same issue there?
Yes, version 1.4. I can upgrade tomorrow.
Can I go from 1.4 directly to 1.12 using the easy way to upgrade?
As long as you're already using HTTPProxy
(not IngressRoute
), and using a standard quickstart install, then yeah, I think that should be OK, since it's effectively uninstall and reinstall.
I upgraded this morning from 1.4 to 1.12; I'll come back to you if we observe the issue again.
Well, I just had the exact same issue 15 min ago. Where should I start looking?
In terms of a quick check, can you check for restarts of the Contour and Envoy pods? If the Envoy pods were restarting, that could cause something like what you're describing.
Aside from that though, how are you installing Contour? Is Envoy running with the Service of Type Loadbalancer in a public cloud?
Envoy pods have not restarted, and don't have any events.
I installed Contour using kubectl apply -f https://projectcontour.io/quickstart/v1.12.0/contour.yaml
.
Those are vSphere virtual machines onprem, the IP used by Envoy is delivered by MetalLB.
Okay, I'd flip the Envoys to debug logging mode, and see if you can see anything at the time when there is an issue. Contour logs will also tell you if there were updates at the time that there was a problem.
I haven't heard of anything like this before, so we just have to try and find where the problem is coming from; Contour and Envoy logs and lining times up is the next step.
I activated the debug log for Envoy. Where/how should I look to the logs after an issue occurred? It's not mentioned in the documentation.
Something like kubectl -n projectcontour logs daemonset/envoy -c envoy -f
should work
I did a kubectl -n projectcontour logs daemonset/envoy -c envoy -f > debug.log
and I logged for a day where multiple of those timeouts events occured. I don't find any error in the logs.
I've been thinking about this one for a few days, and I must admit I'm stumped. For this sort of periodic traffic drop, I'd expect something about either the Envoy or the Contour to be changing, and we've checked the obvious things (envoy logs, no restarts). The only other thing I can think to check with Contour is to check the Contour logs, it's possible if something is funky with Contour, it might be restarting with cold caches and then the config takes a while to converge back to the correct state. The fact that a telnet to port 80 failed at this time also indicates this may be it (because Envoy will not respond on a port when there are no configured listeners). If this is the case, you will see Contour restarts, or a lot of log entries in Contour's logs.
Lastly, I'd also do a check to try and confirm that it's not related to MetalLB by checking a direct service of Type LB for the service you're currently putting behind Contour.
There are some metrics in envoy we could look at if you've got prometheus up and running. I'll digg up the specifics in the morning (on phone now) but should be some for if connected to an xds server (e.g. contour). That would tell if that connection is dropping for some reason.
FYI, we migrated production to a fresh k8s cluster with exact same config and kept the pre-prod on the previous cluster: production is running fine, and pre-prod continues to have regular timeouts.
I'll check about Contour logs later, and see if I can deploy Prometheus (but I never did it, so I have to read some doc first).
Thanks for the update @cloudmaniac. I wonder what it is about the pre-prod cluster that's different? Looking forward to hearing more.
Nothing different technically; I used the same VM templates, the same components version, and the same Kubespray config to configure kubernetes. The only difference is the IPs used.
That's interesting, is there any chance those IPs have been double-allocated? I've seen something like that before, where there was a routing tug-of-war over ownership of some IPs, and when traffic showed up at the wrong place, it was dropped by firewalls and so disappeared.
No, I made sure it was not an IP conflict.
Thanks @cloudmaniac.
The Contour project currently lacks enough contributors to adequately respond to all Issues.
This bot triages Issues according to the following rules:
You can:
Please send feedback to the #contour channel in the Kubernetes Slack
Closing, we dropped usage of Contour because of the recurring issues.
What steps did you take and what happened: I have an environment with multiple
httpproxy
resources defined; they are all running fine, except 2 or 3 times a day where all applications delivered via those are not available anymore (timeout). It usually last for approx. 10 min and then everything comes back to normal.I checked if I could find any corresponding events, but nothing relevant so far. I'm looking for ideas where to look. I can submit any logs asked if required.
Anything else you would like to add:
telnet <ip> 80
timed out as well.Environment:
kubectl version
): 1.16.0/etc/os-release
): Ubuntu 18.04.3 LTS