postfinance / kubenurse

Kubernetes network monitoring
MIT License
407 stars 39 forks source link

Failed request for me_ingress #44

Closed burkhat closed 9 months ago

burkhat commented 2 years ago

Hello together,

We're using kubenurse in our Openshift environment on-premises and on Azure. On-premises everything works as expected, but on Azure RedHat Openshift (ARO) we have a strange behaviour.

Sometimes the request for "me_ingress" failed with following error message: 2022/07/10 03:03:25 failed request for me_ingress with Get "https://kubenurse.cloud.rohde-schwarz.com/alwayshappy": net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)

But if we execute a curl command from the terminal at the same time the command is successful.

~ $ curl https://kubenurse.cloud.rohde-schwarz.com/alwayshappy -v
*   Trying 51.105.220.245:443...
* Connected to kubenurse.cloud.rohde-schwarz.com (51.105.220.245) port 443 (#0)
* ALPN: offers h2
* ALPN: offers http/1.1
*  CAfile: /etc/ssl/certs/ca-certificates.crt
*  CApath: none
* TLSv1.3 (OUT), TLS handshake, Client hello (1):
* TLSv1.3 (IN), TLS handshake, Server hello (2):
* TLSv1.2 (IN), TLS handshake, Certificate (11):
* TLSv1.2 (IN), TLS handshake, Server key exchange (12):
* TLSv1.2 (IN), TLS handshake, Server finished (14):
* TLSv1.2 (OUT), TLS handshake, Client key exchange (16):
* TLSv1.2 (OUT), TLS change cipher, Change cipher spec (1):
* TLSv1.2 (OUT), TLS handshake, Finished (20):
* TLSv1.2 (IN), TLS handshake, Finished (20):
* SSL connection using TLSv1.2 / ECDHE-RSA-AES128-GCM-SHA256
* ALPN: server accepted h2
* Server certificate:
*  subject: C=DE; ST=Bayern; L=München; O=Rohde & Schwarz GmbH & Co. KG; CN=*.cloud.rohde-schwarz.com
*  start date: Jun 21 00:00:00 2022 GMT
*  expire date: Jun 29 23:59:59 2023 GMT
*  subjectAltName: host "kubenurse.cloud.rohde-schwarz.com" matched cert's "*.cloud.rohde-schwarz.com"
*  issuer: C=US; O=DigiCert Inc; OU=www.digicert.com; CN=Thawte RSA CA 2018
*  SSL certificate verify ok.
* Using HTTP2, server supports multiplexing
* Copying HTTP/2 data in stream buffer to connection buffer after upgrade: len=0
* h2h3 [:method: GET]
* h2h3 [:path: /alwayshappy]
* h2h3 [:scheme: https]
* h2h3 [:authority: kubenurse.cloud.rohde-schwarz.com]
* h2h3 [user-agent: curl/7.83.1]
* h2h3 [accept: */*]
* Using Stream ID: 1 (easy handle 0x7f37d3bd51b0)
> GET /alwayshappy HTTP/2
> Host: kubenurse.cloud.rohde-schwarz.com
> user-agent: curl/7.83.1
> accept: */*
> 
< HTTP/2 200 
< date: Mon, 11 Jul 2022 11:34:03 GMT
< set-cookie: ca2642a40ba64b7363cb7d68301b753a=4d110bca2cf93cb742c45b40f9c7d239; path=/; HttpOnly; Secure; SameSite=None
< content-length: 0
< cache-control: private
< strict-transport-security: max-age=31536000; includesubdomains; preload
< 
* Connection #0 to host kubenurse.cloud.rohde-schwarz.com left intact

Sometimes only a restart of the complete Node solves the issue.

We don't have any idea why this error message appears, is it possible to set a Debug level, to get more information?

Regards Tobias

djboris9 commented 2 years ago

Hi Tobias

I don't have experience with the ARO service but probably this information could help you.

If only the me_ingress is failing and no path_ or me_service checks, I would investigate the load balancer or the ingress controller as the networking inside the cluster seems to work.

Therefore I would check the following:

One other option to "debug" this issue would be to start a Pod with a curl -v job that frequently fetches the /alwayshappy endpoint.

Currently there is no debug level in kubenurse but I could imagine that something like httptrace directly or go-httpstat could be an improvement to get the error and latency details.

Best regards Boris

burkhat commented 2 years ago

Hello Boris,

Thanks for your reply.

Only me_ingress isn't working everything else works fine.

Is it possible that you have high latencies on the ingress or DNS path?

If the issue occurs I tried a curl -v in the Terminal of the same kubenurse pod and this works without any problem. But I can setup a second debug pod with curl -v.

This would be great if this could be implemented :)

Regards Tobias

djboris9 commented 2 years ago

Hi Tobias

kubenurse isn't retrying DNS or HTTP requests by itself so it could be that you just hit the failing DNS resolutions.

The httptrace hooks are looking promising to get additional error details and a basic extension shouldn't be that hard to implement. I will try it when I got time but will open a proposal right now.

Best regards, Boris

djboris9 commented 2 years ago

Hi Tobias

I hacked something together for #45, which adds usable httptrace metrics and logs also lower level errors. This looks like this:

# Connection issue
2022/07/26 15:45:04 httptrace: failed connect_done for path_kind-control-plane with dial tcp 10.244.0.8:8080: connect: connection refused
2022/07/26 15:45:04 failed request for path_kind-control-plane with Get "http://10.244.0.8:8080/alwayshappy": dial tcp 10.244.0.8:8080: connect: connection refused

# DNS Issue (dns_done)
2022/07/26 15:46:32 httptrace: failed dns_done for api_server_dns with lookup kubernetes.default.svc.cluster.local: operation was canceled
2022/07/26 15:46:32 failed request for api_server_dns with Get "https://kubernetes.default.svc.cluster.local:443/version": net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)

And metrics are similar to this:

kubenurse_httpclient_trace_request_duration_seconds_bucket{event="wrote_request",type="api_server_dns",le="0.0005"} 0
kubenurse_httpclient_trace_request_duration_seconds_bucket{event="wrote_request",type="api_server_dns",le="0.005"} 0
kubenurse_httpclient_trace_request_duration_seconds_bucket{event="wrote_request",type="api_server_dns",le="0.01"} 0
kubenurse_httpclient_trace_request_duration_seconds_bucket{event="wrote_request",type="api_server_dns",le="0.025"} 11
kubenurse_httpclient_trace_request_duration_seconds_bucket{event="wrote_request",type="api_server_dns",le="0.05"} 13
kubenurse_httpclient_trace_request_duration_seconds_bucket{event="wrote_request",type="api_server_dns",le="0.1"} 13
kubenurse_httpclient_trace_request_duration_seconds_bucket{event="wrote_request",type="api_server_dns",le="0.25"} 13
kubenurse_httpclient_trace_request_duration_seconds_bucket{event="wrote_request",type="api_server_dns",le="0.5"} 13
kubenurse_httpclient_trace_request_duration_seconds_bucket{event="wrote_request",type="api_server_dns",le="1"} 13
kubenurse_httpclient_trace_request_duration_seconds_bucket{event="wrote_request",type="api_server_dns",le="+Inf"} 55
kubenurse_httpclient_trace_request_duration_seconds_sum{event="wrote_request",type="api_server_dns"} 103.054209003
kubenurse_httpclient_trace_request_duration_seconds_count{event="wrote_request",type="api_server_dns"} 55

Are you interested in trying this work-in-progress out? You can build it from branch 45-proposal-use-httptrace-to-get-advanced-metrics-and-error-messages or let me know if I should temporarly publish an image.

burkhat commented 2 years ago

Hello Boris,

Thanks for implementing this feature. if it is possible for you it would be great if you publish an image. But no hurry i am not in the company until the end of august.

If it is not possible for you to publish an image I will build it from scratch when I'm back.

Thanks & Regards Tobias

clementnuss commented 9 months ago

hi @djboris9 and @burkhat

the failed request might also have been due to a "race condition" between the time where a kubenurse pod is stopped and the time it takes for the Kubernetes endpoint change to propagate to the ingress controller.

there are more details in this commit, but basically, as the shutdown following a SIGTERM happened almost instantaneously, sometimes your ingress controller was sending a request to a pod IP which had already terminated. with the above commit, it should help prevent such cases (I've set a 5s wind down duration per default, configurable per env variable).

In our environments at least, it proved quite successful, and thanks to it we don't have me_ingress failures during e.g. a rollout restart of the daemonset, something we had before.

maybe this fix also applies to the issue you described here.

@djboris9 I will continue the work you did in #45