kubewharf / kelemetry

Global control plane tracing for Kubernetes
Apache License 2.0
252 stars 28 forks source link

etcd CrashLoopBackOff and health check failed #134

Closed calvinxu closed 1 year ago

calvinxu commented 1 year ago

Steps to reproduce

  1. check out 0.2.2 tag from kelemetry repo
  2. helm install kelemetry oci://ghcr.io/kubewharf/kelemetry-chart --values values.yaml

Expected behavior

etcd pod running normal and health check successfully

Actual behavior

kelemetry-etcd-0 0/1 CrashLoopBackOff 18 (83s ago) 74m kelemetry-etcd-1 1/1 Running 0 74m kelemetry-etcd-2 1/1 Running 0 74m

# kubectl logs kelemetry-etcd-0
2023-07-20 08:54:32.056926 I | etcdmain: etcd Version: 3.3.13
2023-07-20 08:54:32.056972 I | etcdmain: Git SHA: 98d3084
2023-07-20 08:54:32.056976 I | etcdmain: Go Version: go1.10.8
2023-07-20 08:54:32.056979 I | etcdmain: Go OS/Arch: linux/amd64
2023-07-20 08:54:32.056982 I | etcdmain: setting maximum number of CPUs to 4, total number of available CPUs is 4
2023-07-20 08:54:32.057030 N | etcdmain: the server is already initialized as member before, starting as etcd member...
2023-07-20 08:54:32.057261 I | embed: listening for peers on http://0.0.0.0:2380
2023-07-20 08:54:32.057293 I | embed: listening for client requests on 0.0.0.0:2379
2023-07-20 08:54:32.057826 I | pkg/netutil: resolving kelemetry-etcd-0:2380 to 192.168.253.80:2380
2023-07-20 08:54:32.060864 W | pkg/netutil: failed resolving host kelemetry-etcd-0.kelemetry-etcd.default.svc:2380 (lookup kelemetry-etcd-0.kelemetry-etcd.default.svc on 10.96.0.10:53: no such host); retrying in 1s
2023-07-20 08:54:33.065013 W | pkg/netutil: failed resolving host kelemetry-etcd-0.kelemetry-etcd.default.svc:2380 (lookup kelemetry-etcd-0.kelemetry-etcd.default.svc on 10.96.0.10:53: no such host); retrying in 1s
2023-07-20 08:54:34.069684 W | pkg/netutil: failed resolving host kelemetry-etcd-0.kelemetry-etcd.default.svc:2380 (lookup kelemetry-etcd-0.kelemetry-etcd.default.svc on 10.96.0.10:53: no such host); retrying in 1s
2023-07-20 08:54:35.073983 W | pkg/netutil: failed resolving host kelemetry-etcd-0.kelemetry-etcd.default.svc:2380 (lookup kelemetry-etcd-0.kelemetry-etcd.default.svc on 10.96.0.10:53: no such host); retrying in 1s
2023-07-20 08:54:36.079512 W | pkg/netutil: failed resolving host kelemetry-etcd-0.kelemetry-etcd.default.svc:2380 (lookup kelemetry-etcd-0.kelemetry-etcd.default.svc on 10.96.0.10:53: no such host); retrying in 1s
2023-07-20 08:54:37.083088 W | pkg/netutil: failed resolving host kelemetry-etcd-0.kelemetry-etcd.default.svc:2380 (lookup kelemetry-etcd-0.kelemetry-etcd.default.svc on 10.96.0.10:53: no such host); retrying in 1s
2023-07-20 08:54:38.102341 W | pkg/netutil: failed resolving host kelemetry-etcd-0.kelemetry-etcd.default.svc:2380 (lookup kelemetry-etcd-0.kelemetry-etcd.default.svc on 10.96.0.10:53: no such host); retrying in 1s
2023-07-20 08:54:39.107733 W | pkg/netutil: failed resolving host kelemetry-etcd-0.kelemetry-etcd.default.svc:2380 (lookup kelemetry-etcd-0.kelemetry-etcd.default.svc on 10.96.0.10:53: no such host); retrying in 1s
2023-07-20 08:54:40.112295 W | pkg/netutil: failed resolving host kelemetry-etcd-0.kelemetry-etcd.default.svc:2380 (lookup kelemetry-etcd-0.kelemetry-etcd.default.svc on 10.96.0.10:53: no such host); retrying in 1s
2023-07-20 08:54:41.116753 W | pkg/netutil: failed resolving host kelemetry-etcd-0.kelemetry-etcd.default.svc:2380 (lookup kelemetry-etcd-0.kelemetry-etcd.default.svc on 10.96.0.10:53: no such host); retrying in 1s
2023-07-20 08:54:42.121587 W | pkg/netutil: failed resolving host kelemetry-etcd-0.kelemetry-etcd.default.svc:2380 (lookup kelemetry-etcd-0.kelemetry-etcd.default.svc on 10.96.0.10:53: no such host); retrying in 1s
2023-07-20 08:54:43.126384 W | pkg/netutil: failed resolving host kelemetry-etcd-0.kelemetry-etcd.default.svc:2380 (lookup kelemetry-etcd-0.kelemetry-etcd.default.svc on 10.96.0.10:53: no such host); retrying in 1s
2023-07-20 08:54:44.131495 W | pkg/netutil: failed resolving host kelemetry-etcd-0.kelemetry-etcd.default.svc:2380 (lookup kelemetry-etcd-0.kelemetry-etcd.default.svc on 10.96.0.10:53: no such host); retrying in 1s
2023-07-20 08:54:45.136237 W | pkg/netutil: failed resolving host kelemetry-etcd-0.kelemetry-etcd.default.svc:2380 (lookup kelemetry-etcd-0.kelemetry-etcd.default.svc on 10.96.0.10:53: no such host); retrying in 1s
2023-07-20 08:54:46.141086 W | pkg/netutil: failed resolving host kelemetry-etcd-0.kelemetry-etcd.default.svc:2380 (lookup kelemetry-etcd-0.kelemetry-etcd.default.svc on 10.96.0.10:53: no such host); retrying in 1s
2023-07-20 08:54:47.145136 W | pkg/netutil: failed resolving host kelemetry-etcd-0.kelemetry-etcd.default.svc:2380 (lookup kelemetry-etcd-0.kelemetry-etcd.default.svc on 10.96.0.10:53: no such host); retrying in 1s
2023-07-20 08:54:48.150200 W | pkg/netutil: failed resolving host kelemetry-etcd-0.kelemetry-etcd.default.svc:2380 (lookup kelemetry-etcd-0.kelemetry-etcd.default.svc on 10.96.0.10:53: no such host); retrying in 1s
2023-07-20 08:54:49.157245 W | pkg/netutil: failed resolving host kelemetry-etcd-0.kelemetry-etcd.default.svc:2380 (lookup kelemetry-etcd-0.kelemetry-etcd.default.svc on 10.96.0.10:53: no such host); retrying in 1s
2023-07-20 08:54:50.162159 W | pkg/netutil: failed resolving host kelemetry-etcd-0.kelemetry-etcd.default.svc:2380 (lookup kelemetry-etcd-0.kelemetry-etcd.default.svc on 10.96.0.10:53: no such host); retrying in 1s
2023-07-20 08:54:51.166931 W | pkg/netutil: failed resolving host kelemetry-etcd-0.kelemetry-etcd.default.svc:2380 (lookup kelemetry-etcd-0.kelemetry-etcd.default.svc on 10.96.0.10:53: no such host); retrying in 1s
2023-07-20 08:54:52.172077 W | pkg/netutil: failed resolving host kelemetry-etcd-0.kelemetry-etcd.default.svc:2380 (lookup kelemetry-etcd-0.kelemetry-etcd.default.svc on 10.96.0.10:53: no such host); retrying in 1s
2023-07-20 08:54:53.177795 W | pkg/netutil: failed resolving host kelemetry-etcd-0.kelemetry-etcd.default.svc:2380 (lookup kelemetry-etcd-0.kelemetry-etcd.default.svc on 10.96.0.10:53: no such host); retrying in 1s
2023-07-20 08:54:54.183211 W | pkg/netutil: failed resolving host kelemetry-etcd-0.kelemetry-etcd.default.svc:2380 (lookup kelemetry-etcd-0.kelemetry-etcd.default.svc on 10.96.0.10:53: no such host); retrying in 1s
2023-07-20 08:54:55.187278 I | pkg/netutil: resolving kelemetry-etcd-0.kelemetry-etcd.default.svc:2380 to 192.168.253.80:2380
2023-07-20 08:54:55.197378 C | etcdmain: member 6fa7a00416c5d67d has already been bootstrapped
#kubectl logs kelemetry-etcd-1
...
2023-07-20 08:58:06.740587 W | etcdserver: cannot get the version of member 6fa7a00416c5d67d (Get http://kelemetry-etcd-0.kelemetry-etcd.default.svc:2380/version: dial tcp: lookup kelemetry-etcd-0.kelemetry-etcd.default.svc on 10.96.0.10:53: no such host)
2023-07-20 08:58:08.670076 W | rafthttp: health check for peer 6fa7a00416c5d67d could not connect: dial tcp: lookup kelemetry-etcd-0.kelemetry-etcd.default.svc on 10.96.0.10:53: no such host (prober "ROUND_TRIPPER_RAFT_MESSAGE")
2023-07-20 08:58:08.670482 W | rafthttp: health check for peer 6fa7a00416c5d67d could not connect: dial tcp: lookup kelemetry-etcd-0.kelemetry-etcd.default.svc on 10.96.0.10:53: no such host (prober "ROUND_TRIPPER_SNAPSHOT")
2023-07-20 08:58:10.745502 W | etcdserver: failed to reach the peerURL(http://kelemetry-etcd-0.kelemetry-etcd.default.svc:2380) of member 6fa7a00416c5d67d (Get http://kelemetry-etcd-0.kelemetry-etcd.default.svc:2380/version: dial tcp: lookup kelemetry-etcd-0.kelemetry-etcd.default.svc on 10.96.0.10:53: no such host)
2023-07-20 08:58:10.745557 W | etcdserver: cannot get the version of member 6fa7a00416c5d67d (Get http://kelemetry-etcd-0.kelemetry-etcd.default.svc:2380/version: dial tcp: lookup kelemetry-etcd-0.kelemetry-etcd.default.svc on 10.96.0.10:53: no such host)
2023-07-20 08:58:13.670395 W | rafthttp: health check for peer 6fa7a00416c5d67d could not connect: dial tcp: lookup kelemetry-etcd-0.kelemetry-etcd.default.svc on 10.96.0.10:53: no such host (prober "ROUND_TRIPPER_RAFT_MESSAGE")
2023-07-20 08:58:13.670855 W | rafthttp: health check for peer 6fa7a00416c5d67d could not connect: dial tcp: lookup kelemetry-etcd-0.kelemetry-etcd.default.svc on 10.96.0.10:53: no such host (prober "ROUND_TRIPPER_SNAPSHOT")
2023-07-20 08:58:14.750999 W | etcdserver: failed to reach the peerURL(http://kelemetry-etcd-0.kelemetry-etcd.default.svc:2380) of member 6fa7a00416c5d67d (Get http://kelemetry-etcd-0.kelemetry-etcd.default.svc:2380/version: dial tcp: lookup kelemetry-etcd-0.kelemetry-etcd.default.svc on 10.96.0.10:53: no such host)
2023-07-20 08:58:14.751047 W | etcdserver: cannot get the version of member 6fa7a00416c5d67d (Get http://kelemetry-etcd-0.kelemetry-etcd.default.svc:2380/version: dial tcp: lookup kelemetry-etcd-0.kelemetry-etcd.default.svc on 10.96.0.10:53: no such host)
# kubectl logs kelemetry-etcd-2
...
2023-07-20 08:58:53.156209 W | rafthttp: health check for peer 6fa7a00416c5d67d could not connect: dial tcp: lookup kelemetry-etcd-0.kelemetry-etcd.default.svc on 10.96.0.10:53: no such host (prober "ROUND_TRIPPER_RAFT_MESSAGE")
2023-07-20 08:58:53.156809 W | rafthttp: health check for peer 6fa7a00416c5d67d could not connect: dial tcp: lookup kelemetry-etcd-0.kelemetry-etcd.default.svc on 10.96.0.10:53: no such host (prober "ROUND_TRIPPER_SNAPSHOT")
2023-07-20 08:58:58.156996 W | rafthttp: health check for peer 6fa7a00416c5d67d could not connect: dial tcp: lookup kelemetry-etcd-0.kelemetry-etcd.default.svc on 10.96.0.10:53: no such host (prober "ROUND_TRIPPER_RAFT_MESSAGE")
2023-07-20 08:58:58.157404 W | rafthttp: health check for peer 6fa7a00416c5d67d could not connect: dial tcp: lookup kelemetry-etcd-0.kelemetry-etcd.default.svc on 10.96.0.10:53: no such host (prober "ROUND_TRIPPER_SNAPSHOT")
2023-07-20 08:59:03.157340 W | rafthttp: health check for peer 6fa7a00416c5d67d could not connect: dial tcp: lookup kelemetry-etcd-0.kelemetry-etcd.default.svc on 10.96.0.10:53: no such host (prober "ROUND_TRIPPER_RAFT_MESSAGE")
2023-07-20 08:59:03.157662 W | rafthttp: health check for peer 6fa7a00416c5d67d could not connect: dial tcp: lookup kelemetry-etcd-0.kelemetry-etcd.default.svc on 10.96.0.10:53: no such host (prober "ROUND_TRIPPER_SNAPSHOT")
2023-07-20 08:59:08.157610 W | rafthttp: health check for peer 6fa7a00416c5d67d could not connect: dial tcp: lookup kelemetry-etcd-0.kelemetry-etcd.default.svc on 10.96.0.10:53: no such host (prober "ROUND_TRIPPER_RAFT_MESSAGE")
2023-07-20 08:59:08.157900 W | rafthttp: health check for peer 6fa7a00416c5d67d could not connect: dial tcp: lookup kelemetry-etcd-0.kelemetry-etcd.default.svc on 10.96.0.10:53: no such host (prober "ROUND_TRIPPER_SNAPSHOT")
2023-07-20 08:59:13.158076 W | rafthttp: health check for peer 6fa7a00416c5d67d could not connect: dial tcp: lookup kelemetry-etcd-0.kelemetry-etcd.default.svc on 10.96.0.10:53: no such host (prober "ROUND_TRIPPER_RAFT_MESSAGE")
2023-07-20 08:59:13.158478 W | rafthttp: health check for peer 6fa7a00416c5d67d could not connect: dial tcp: lookup kelemetry-etcd-0.kelemetry-etcd.default.svc on 10.96.0.10:53: no such host (prober "ROUND_TRIPPER_SNAPSHOT")

Kelemetry version

0.2.2

Environment

k8s:1.23.17 jaeger:1.4.2

SOF3 commented 1 year ago

is there some problem with your coredns? it seems to say coredns query failed?

calvinxu commented 1 year ago

yes, might be related to coredns, re-deployed again, it seems now it runs well. However, found below failed error in one pod, not in other two pods log

# kubectl logs kelemetry-etcd-0
...
2023-07-21 01:23:19.283550 W | etcdserver: failed to send out heartbeat on time (exceeded the 100ms timeout for 20.010227ms, to 8359b3cd6960003a)
2023-07-21 01:23:19.283607 W | etcdserver: server is likely overloaded
2023-07-21 01:23:19.283623 W | etcdserver: failed to send out heartbeat on time (exceeded the 100ms timeout for 20.095774ms, to 48518a52c6de43e2)
2023-07-21 01:23:19.283631 W | etcdserver: server is likely overloaded
SOF3 commented 1 year ago

As the error says, this issue seems to be caused by etcdserver getting overloaded. This does not seem to be an issue with Kelemetry, and I cannot reproduce, so I am closing this issue. Feel free to post here if you have further updates.