Closed burkhat closed 9 months ago
Hi Tobias
I don't have experience with the ARO service but probably this information could help you.
If only the me_ingress
is failing and no path_
or me_service
checks, I would investigate the load balancer or the ingress controller as the networking inside the cluster seems to work.
Therefore I would check the following:
One other option to "debug" this issue would be to start a Pod with a curl -v
job that frequently fetches the /alwayshappy
endpoint.
Currently there is no debug level in kubenurse but I could imagine that something like httptrace directly or go-httpstat could be an improvement to get the error and latency details.
Best regards Boris
Hello Boris,
Thanks for your reply.
Only me_ingress isn't working everything else works fine.
Is it possible that you have high latencies on the ingress or DNS path?
If the issue occurs I tried a curl -v in the Terminal of the same kubenurse pod and this works without any problem. But I can setup a second debug pod with curl -v.
This would be great if this could be implemented :)
Regards Tobias
Hi Tobias
kubenurse isn't retrying DNS or HTTP requests by itself so it could be that you just hit the failing DNS resolutions.
The httptrace hooks are looking promising to get additional error details and a basic extension shouldn't be that hard to implement. I will try it when I got time but will open a proposal right now.
Best regards, Boris
Hi Tobias
I hacked something together for #45, which adds usable httptrace metrics and logs also lower level errors. This looks like this:
# Connection issue
2022/07/26 15:45:04 httptrace: failed connect_done for path_kind-control-plane with dial tcp 10.244.0.8:8080: connect: connection refused
2022/07/26 15:45:04 failed request for path_kind-control-plane with Get "http://10.244.0.8:8080/alwayshappy": dial tcp 10.244.0.8:8080: connect: connection refused
# DNS Issue (dns_done)
2022/07/26 15:46:32 httptrace: failed dns_done for api_server_dns with lookup kubernetes.default.svc.cluster.local: operation was canceled
2022/07/26 15:46:32 failed request for api_server_dns with Get "https://kubernetes.default.svc.cluster.local:443/version": net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)
And metrics are similar to this:
kubenurse_httpclient_trace_request_duration_seconds_bucket{event="wrote_request",type="api_server_dns",le="0.0005"} 0
kubenurse_httpclient_trace_request_duration_seconds_bucket{event="wrote_request",type="api_server_dns",le="0.005"} 0
kubenurse_httpclient_trace_request_duration_seconds_bucket{event="wrote_request",type="api_server_dns",le="0.01"} 0
kubenurse_httpclient_trace_request_duration_seconds_bucket{event="wrote_request",type="api_server_dns",le="0.025"} 11
kubenurse_httpclient_trace_request_duration_seconds_bucket{event="wrote_request",type="api_server_dns",le="0.05"} 13
kubenurse_httpclient_trace_request_duration_seconds_bucket{event="wrote_request",type="api_server_dns",le="0.1"} 13
kubenurse_httpclient_trace_request_duration_seconds_bucket{event="wrote_request",type="api_server_dns",le="0.25"} 13
kubenurse_httpclient_trace_request_duration_seconds_bucket{event="wrote_request",type="api_server_dns",le="0.5"} 13
kubenurse_httpclient_trace_request_duration_seconds_bucket{event="wrote_request",type="api_server_dns",le="1"} 13
kubenurse_httpclient_trace_request_duration_seconds_bucket{event="wrote_request",type="api_server_dns",le="+Inf"} 55
kubenurse_httpclient_trace_request_duration_seconds_sum{event="wrote_request",type="api_server_dns"} 103.054209003
kubenurse_httpclient_trace_request_duration_seconds_count{event="wrote_request",type="api_server_dns"} 55
Are you interested in trying this work-in-progress out? You can build it from branch 45-proposal-use-httptrace-to-get-advanced-metrics-and-error-messages
or let me know if I should temporarly publish an image.
Hello Boris,
Thanks for implementing this feature. if it is possible for you it would be great if you publish an image. But no hurry i am not in the company until the end of august.
If it is not possible for you to publish an image I will build it from scratch when I'm back.
Thanks & Regards Tobias
hi @djboris9 and @burkhat
the failed request might also have been due to a "race condition" between the time where a kubenurse pod is stopped and the time it takes for the Kubernetes endpoint change to propagate to the ingress controller.
there are more details in this commit, but basically, as the shutdown following a SIGTERM
happened almost instantaneously, sometimes your ingress controller was sending a request to a pod IP which had already terminated. with the above commit, it should help prevent such cases (I've set a 5s wind down duration per default, configurable per env variable).
In our environments at least, it proved quite successful, and thanks to it we don't have me_ingress failures during e.g. a rollout restart of the daemonset, something we had before.
maybe this fix also applies to the issue you described here.
@djboris9 I will continue the work you did in #45
Hello together,
We're using kubenurse in our Openshift environment on-premises and on Azure. On-premises everything works as expected, but on Azure RedHat Openshift (ARO) we have a strange behaviour.
Sometimes the request for "me_ingress" failed with following error message:
2022/07/10 03:03:25 failed request for me_ingress with Get "https://kubenurse.cloud.rohde-schwarz.com/alwayshappy": net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)
But if we execute a curl command from the terminal at the same time the command is successful.
Sometimes only a restart of the complete Node solves the issue.
We don't have any idea why this error message appears, is it possible to set a Debug level, to get more information?
Regards Tobias