sensu / sensu-k8s-quick-start

14 stars 17 forks source link

etcd context deadline exceeded - sensu backend not connecting to etcd #9

Open dcharleston opened 3 years ago

dcharleston commented 3 years ago

I'm following the readme and using all default settings. Running locally on minikube. sensu-backend pod repeatedly fails because the readiness check for the backend's /health endpoint never passes. It returns:

{
    "Alarms": null,
    "ClusterHealth": [
        {
            "MemberID": 10276657743932975437,
            "MemberIDHex": "8e9e05c52164694d",
            "Name": "sensu-etcd-0",
            "Err": "context deadline exceeded",
            "Healthy": false
        }
    ],
    "Header": {
        "cluster_id": 14841639068965178418,
        "member_id": 10276657743932975437,
        "raft_term": 2
    }
}

etcd cluster health comes back as healthy from both the etcd and the sensu-backend container :

/ # ETCDCTL_API=3 etcdctl --endpoints "http://sensu-etcd-0.sensu-etcd.sensu-exam
ple.svc.cluster.local:2379" endpoint health
http://sensu-etcd-0.sensu-etcd.sensu-example.svc.cluster.local:2379 is healthy: successfully committed proposal: took = 1.554986ms

Following errors appear in the sensu-backend logs:

{"level":"warn","ts":"2021-05-11T23:18:27.057Z","caller":"clientv3/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"endpoint://client-69922e1f-6460-409b-99ae-ede3c1ab2c80/sensu-etcd-0:2379","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = context deadline exceeded"}
....
....
....
{"component":"store","level":"info","members":[{"ID":10276657743932975437,"name":"sensu-etcd-0","peerURLs":["http://localhost:2380"],"clientURLs":["http://sensu-etcd-0:2379"]}],"msg":"retrieved cluster members","time":"2021-05-11T23:27:28Z"}
{"cache_version":"v1","component":"cache","level":"debug","msg":"rebuilding the cache for resource type *v2.Silenced","time":"2021-05-11T23:27:29Z"}
{"cache_version":"v1","component":"cache","level":"debug","msg":"rebuilding the cache for resource type *v2.Namespace","time":"2021-05-11T23:27:29Z"}
{"cache_version":"v2","component":"cache","level":"debug","msg":"rebuilding the cache for resource type *v3.EntityConfig","time":"2021-05-11T23:27:29Z"}
{"cache_version":"v2","component":"cache","level":"debug","msg":"rebuilding the cache for resource type *v3.EntityConfig","time":"2021-05-11T23:27:29Z"}
{"backend_id":"1d9a5640-eba9-4ee9-89a9-a8c25ff09831","component":"metricsd","level":"debug","msg":"metricsd heartbeat","name":"entity_metrics","time":"2021-05-11T23:27:29Z"}
{"backend_id":"1d9a5640-eba9-4ee9-89a9-a8c25ff09831","component":"metricsd","level":"debug","msg":"metricsd heartbeat","name":"cluster_metrics","time":"2021-05-11T23:27:30Z"}
{"component":"metricsd","level":"info","msg":"refreshing metrics suite on this backend","name":"entity_metrics","time":"2021-05-11T23:27:31Z"}
{"level":"warn","ts":"2021-05-11T23:27:31.672Z","caller":"clientv3/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"endpoint://client-df513939-a8df-41d5-a64b-ae6bb7bee084/sensu-etcd-0:2379","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = context deadline exceeded"}
{"component":"store","health":{"MemberID":10276657743932975437,"MemberIDHex":"8e9e05c52164694d","Name":"sensu-etcd-0","Err":"context deadline exceeded","Healthy":false},"level":"info","msg":"cluster member health","time":"2021-05-11T23:27:31Z"}
tattwei46 commented 2 years ago

Hi, anyone taking a look at the above issue? Also having the same problem

rivlinpereira commented 2 years ago

same issue

jspaleta commented 2 years ago

Okay Here's the underlying issue as i see it in my minikube environment running on my fedora linux system. The sensu-backend readinessProbe is failing in a weird way because of not quite yet supported ipv6 in minikube.

It looks like minikube is letting the sensu-backend bind its tcp api to ipv6 localhost tcp port 8080 instead of ipv4 tcp port 8080, and there doesn't seeem to be an obvious way to prevent minikube from allowing this to happen. Here's what it looks like from inside the sensu-backend-0 container running under my minikube

$ netstat -tlpn
Active Internet connections (only servers)
Proto Recv-Q Send-Q Local Address           Foreign Address         State       PID/Program name    
tcp        0      0 127.0.0.1:6060          0.0.0.0:*               LISTEN      1/sensu-backend
tcp        0      0 127.0.0.1:3030          0.0.0.0:*               LISTEN      -
tcp        0      0 127.0.0.1:3031          0.0.0.0:*               LISTEN      -
tcp        0      0 :::8081                 :::*                    LISTEN      1/sensu-backend
tcp        0      0 :::8080                 :::*                    LISTEN      1/sensu-backend
tcp        0      0 :::3000                 :::*                    LISTEN      1/sensu-backend

those last 3 services are listening on ipv6 and that definitely not good.

The k8s configurations provided in this repo assumes ipv4 will be using in the pods. The sensu-backend readinessProbe uses the busybox provided wget in an alpine container which is not ipv6 compatible.

Need to either figure out a way to configure minikube so it doesnt let that happen, or we need to figure out a way to tell the sensu-backend to explicitly bind onf ipv4 localhost.

jspaleta commented 2 years ago

Turns out this is a problem with the sensu-backend readinessProbe settings. The settings were too aggressive for default minikube resource provisioning and probes were being started faster then they were timing out, causing a problem.

Please test PR #10 and comment there on the potential fix

mvthul commented 2 years ago

Did anyone got this working?

jspaleta commented 2 years ago

@mvthul I believe I have a fix for this, and I have an open PR for it, see previous comment. I just need someone experiencing the problem to test my proposed fix and make sure it works for them.

mvthul commented 2 years ago

I changed ur changes that I could see I see everything is green and running but stil context deadline is appearing in logs. When I log in t sensu there is a red bar popping up and if I click details I see under ETCD context deadline. Tried so many things to fix and tried so many other helm charts and scripts. Nothing seems to work with version 6+

jspaleta commented 2 years ago

The specific changes needed to solve the problem may require system specific changes to the configuration... let me explain.

There are timeouts configured for the readiness probes and if the system running minikube is resource poor, then the those configurations will be too aggressive and the readiness probes will fall over because the underlying service didn't get enough cpu cycles to complete the start up process.

the PR i put together changes these settings enough so that it works on my laptop running minikube. But the nature of the problem is such that even though it works for me, it might fail for someone else with tighter system resources.

There might not be a one size fits all solution here, because we definitely still want the readiness probes to give up at reasonable point. For something like google or amazon's service that reasonable point of failure is much sooner than any local minikube deployment...because of available resources.

If as a minikube user your still having this specific problem, you may need to further adjust the readinessProbe settings to give your minikube deployment more time to provision everything.

mvthul commented 2 years ago

I tried in Azure AKS and locally with Microk8s both same issue 😭


Van: Jef Spaleta @.> Verzonden: Thursday, August 11, 2022 6:40:31 PM Aan: sensu/sensu-k8s-quick-start @.> CC: mvthul @.>; Mention @.> Onderwerp: Re: [sensu/sensu-k8s-quick-start] etcd context deadline exceeded - sensu backend not connecting to etcd (#9)

The specific changes needed to solve the problem may require system specific changes to the configuration... let me explain.

There are timeouts configured for the readiness probes and if the system running minikube is resource poor, then the those configurations will be too aggressive and the readiness probes will fall over because the underlying service didn't get enough cpu cycles to complete the start up process.

the PR i put together changes these settings enough so that it works on my laptop running minikube. But the nature of the problem is such that even though it works for me, it might fail for someone else with tighter system resources.

There might not be a one size fits all solution here, because we definitely still want the readiness probes to give up at reasonable point. For something like google or amazon's service that reasonable point of failure is much sooner than any local minikube deployment...because of available resources.

If as a minikube user your still having this specific problem, you may need to further adjust the readinessProbe settings to give your minikube deployment more time to provision everything.

— Reply to this email directly, view it on GitHubhttps://github.com/sensu/sensu-k8s-quick-start/issues/9#issuecomment-1212226150, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AM3SRPX2VHTZMGBFECJSI3TVYUUH7ANCNFSM44XJBN3Q. You are receiving this because you were mentioned.Message ID: @.***>

jspaleta commented 2 years ago

okay well this inst confined to minikube.. this needs to be reinvestigated.

Azure AKS isn't a service I've tested against yet, but I'll look into it.

jspaleta commented 2 years ago

@mvthul Okay so for me on minikube, the context deadline exceeded error is most likely due to having slow disk access to the visualized volumes. etcd is sensitive to slow disk performance for its backing store.

For me the context timeout exceeded messages are intermittent and aren't causing a problem for the intended purpose of kicking the tires in minikube, everything spins up and I'm able to use the sensu dashboard.

For Azure AKS, you might need to change the storage class associated with the sensu-etcd persistent volume. I don't know what AKS storageClass options has out of the gate, but you'll want a dedicated SSD for the sensu-etcd volume.

WladyX commented 1 year ago

I was experiencing the same issue on tanzu kubernetes, seems the PR works as expected, i think you should merge it.

sensu-discourse commented 1 year ago

This issue has been mentioned on Sensu Community. There might be relevant details there:

https://discourse.sensu.io/t/issues-installing-sensu-6-10-on-eks/3137/2