kcp-dev / kcp

Kubernetes-like control planes for form-factors and use-cases beyond Kubernetes and container workloads.
https://kcp.io
Apache License 2.0
2.35k stars 381 forks source link

Flake: embedded etcd falls over #1425

Closed ncdc closed 7 months ago

ncdc commented 2 years ago

From e.g. https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/kcp-dev_kcp/1423/pull-ci-kcp-dev-kcp-main-e2e-multiple-runs/1542910917962895360

=== FAIL: test/e2e/reconciler/cluster TestClusterController (21.33s)
    util.go:109: Saving test artifacts for test "TestClusterController" under "/logs/artifacts/TestClusterController/1966731418".
    fixture.go:157: Starting kcp servers...
    kcp.go:235: running: /go/src/github.com/kcp-dev/kcp/bin/kcp start --root-directory /logs/artifacts/TestClusterController/1966731418/data/kcp/shared --secure-port=33131 --embedded-etcd-client-port=44651 --embedded-etcd-peer-port=46809 --embedded-etcd-wal-size-bytes=5000 --kubeconfig-path=/logs/artifacts/TestClusterController/1966731418/data/kcp/shared/admin.kubeconfig --discovery-poll-interval=5s -v=4 --token-auth-file /logs/artifacts/TestClusterController/1966731418/data/auth-tokens.csv
I0701 16:57:09.788695   38634 request.go:717] Waited for 2.25908432s due to client-side throttling, not priority and fairness, request: GET:https://10.130.116.30:35083/clusters/root:e2e-org-2qhn7:e2e-workspace-s6pbp/apis/flowcontrol.apiserver.k8s.io/v1beta1?timeout=32s
E0701 16:57:10.190526   38634 apiimporter.go:173] error pulling CRDs: Get "https://10.130.116.30:35083/clusters/root:e2e-org-2qhn7:e2e-workspace-s6pbp/api?timeout=32s": dial tcp 10.130.116.30:35083: connect: connection refused
    kcp.go:488: success contacting https://10.130.116.30:33131/livez
    kcp.go:488: success contacting https://10.130.116.30:33131/readyz
    fixture.go:184: Started kcp servers after 20.591787851s
    fixture.go:266: Created organization workspace root:e2e-org-r8cql
    kcp.go:510: error contacting /readyz: an error on the server ("[+]ping ok\n[+]log ok\n[-]etcd failed: reason withheld

Seeing this in a variety of test cases / not sure it's specific to any particular one.

ncdc commented 2 years ago

Excerpt from kcp.log

I0701 16:57:40.802803   39127 httplog.go:129] "HTTP" verb="POST" URI="/clusters/root:e2e-org-r8cql:e2e-workspace-v7t6r/apis/apiresource.kcp.dev/v1alpha1/negotiatedapiresources" latency="1.907054527s" userAgent="kcp/v1.23.5+kcp (linux/amd64) kubernetes/65a10a2/kcp-api-resource-controller" audit-ID="76d2997a-7a29-4b37-a595-a600740c1ea0" srcIP="[::1]:52980" apf_pl="exempt" apf_fs="exempt" apf_fd="" resp=201
I0701 16:57:40.803621   39127 httplog.go:129] "HTTP" verb="PATCH" URI="/clusters/root:e2e-org-r8cql:e2e-workspace-v7t6r/apis/apis.kcp.dev/v1alpha1/apiexports/kubernetes" latency="1.907893791s" userAgent="kcp/v1.23.5+kcp (linux/amd64) kubernetes/65a10a2/kcp-apiexport-controller" audit-ID="2de1f6f4-ab08-43a1-9867-c6bbafe8f63d" srcIP="[::1]:52980" apf_pl="exempt" apf_fs="exempt" apf_fd="" resp=200
I0701 16:57:40.811039   39127 trace.go:205] Trace[1761944324]: "GuaranteedUpdate etcd3" type:*unstructured.Unstructured (01-Jul-2022 16:57:38.928) (total time: 1882ms):
Trace[1761944324]: ---"initial value restored" 4ms (16:57:38.933)
Trace[1761944324]: ---"Transaction prepared" 12ms (16:57:38.945)
Trace[1761944324]: ---"Transaction committed" 1858ms (16:57:40.803)
Trace[1761944324]: [1.882086085s] [1.882086085s] END
I0701 16:57:40.824094   39127 trace.go:205] Trace[1717036715]: "Create" url:/apis/apis.kcp.dev/v1alpha1/apiresourceschemas,user-agent:kcp/v1.23.5+kcp (linux/amd64) kubernetes/65a10a2/kcp-workloads-apiexport-controller,audit-id:c26c9d63-ca9c-4d59-98ec-a71118b9b2b3,client:::1,accept:application/json, */*,protocol:HTTP/2.0 (01-Jul-2022 16:57:38.931) (total time: 1892ms):
Trace[1717036715]: ---"About to convert to expected version" 0ms (16:57:38.932)
Trace[1717036715]: ---"Conversion done" 8ms (16:57:38.940)
Trace[1717036715]: ---"About to store object in database" 0ms (16:57:38.940)
Trace[1717036715]: ---"Object stored in database" 1867ms (16:57:40.808)
Trace[1717036715]: [1.892691606s] [1.892691606s] END
I0701 16:57:40.824248   39127 httplog.go:129] "HTTP" verb="POST" URI="/clusters/root:e2e-org-r8cql:e2e-workspace-v7t6r/apis/apis.kcp.dev/v1alpha1/apiresourceschemas" latency="1.893057582s" userAgent="kcp/v1.23.5+kcp (linux/amd64) kubernetes/65a10a2/kcp-workloads-apiexport-controller" audit-ID="c26c9d63-ca9c-4d59-98ec-a71118b9b2b3" srcIP="[::1]:52980" apf_pl="exempt" apf_fs="exempt" apf_fd="" resp=201
I0701 16:57:40.830600   39127 trace.go:205] Trace[2043337700]: "Update" url:/apis/apiresource.kcp.dev/v1alpha1/apiresourceimports/deployments.pcluster-01.v1.apps/status,user-agent:kcp/v1.23.5+kcp (linux/amd64) kubernetes/65a10a2/kcp-api-resource-controller,audit-id:f93d3b68-a726-418b-a0e2-99b31093a810,client:::1,accept:application/json, */*,protocol:HTTP/2.0 (01-Jul-2022 16:57:38.913) (total time: 1916ms):
Trace[2043337700]: ---"About to convert to expected version" 0ms (16:57:38.914)
Trace[2043337700]: ---"Conversion done" 13ms (16:57:38.927)
Trace[2043337700]: ---"About to store object in database" 0ms (16:57:38.927)
Trace[2043337700]: ---"Object stored in database" 1883ms (16:57:40.811)
Trace[2043337700]: [1.916939578s] [1.916939578s] END
I0701 16:57:40.831167   39127 httplog.go:129] "HTTP" verb="PUT" URI="/clusters/root:e2e-org-r8cql:e2e-workspace-v7t6r/apis/apiresource.kcp.dev/v1alpha1/apiresourceimports/deployments.pcluster-01.v1.apps/status" latency="1.917767698s" userAgent="kcp/v1.23.5+kcp (linux/amd64) kubernetes/65a10a2/kcp-api-resource-controller" audit-ID="f93d3b68-a726-418b-a0e2-99b31093a810" srcIP="[::1]:52980" apf_pl="exempt" apf_fs="exempt" apf_fd="" resp=200
I0701 16:57:40.853858   39127 apiexport_apireconciler_controller.go:144] Queueing APIExport root:e2e-org-r8cql:e2e-workspace-v7t6r|kubernetes for APIResourceSchema rev-444.deployments.apps
I0701 16:57:40.853923   39127 apiexport_apireconciler_reconcile.go:44] No APIs found for API domain key root:e2e-org-r8cql:e2e-workspace-v7t6r/kubernetes
I0701 16:57:40.853967   39127 apibinding_controller.go:273] Mapping APIResourceSchema "root:e2e-org-r8cql:e2e-workspace-v7t6r#$#rev-444.deployments.apps"
I0701 16:57:40.853994   39127 syncer_apireconciler_controller.go:206] Queueing APIExport root:e2e-org-r8cql:e2e-workspace-v7t6r|kubernetes for APIResourceSchema rev-444.deployments.apps
I0701 16:57:40.854034   39127 syncer_apireconciler_reconcile.go:46] No APIs found for API domain key root:e2e-org-r8cql:e2e-workspace-v7t6r#$#pcluster-01
I0701 16:57:40.940471   39127 httplog.go:129] "HTTP" verb="POST" URI="/clusters/root:e2e-org-r8cql:e2e-workspace-v7t6r/apis/authorization.k8s.io/v1/subjectaccessreviews" latency="1.125739ms" userAgent="kcp/v1.23.5+kcp (linux/amd64) kubernetes/65a10a2/syncer-virtual-workspace" audit-ID="57bcc8c4-6d55-4c59-91a7-3babcd3caa02" srcIP="[::1]:52980" apf_pl="exempt" apf_fs="exempt" apf_fd="" resp=201
I0701 16:57:40.941235   39127 httplog.go:129] "HTTP" verb="GET" URI="/clusters/root:e2e-org-r8cql:e2e-workspace-v7t6r/services/syncer/root:e2e-org-r8cql:e2e-workspace-v7t6r/pcluster-01/clusters/%2A/api?timeout=32s" latency="2.800989ms" userAgent="kcp#spec-syncer/v0.0.0-master+$Format:%H$" audit-ID="5d724cf5-55bb-42bb-afdd-f00754b60f9b" srcIP="10.130.116.30:41150" resp=200
I0701 16:57:40.944529   39127 httplog.go:129] "HTTP" verb="POST" URI="/clusters/root:e2e-org-r8cql:e2e-workspace-v7t6r/apis/authorization.k8s.io/v1/subjectaccessreviews" latency="809.468µs" userAgent="kcp/v1.23.5+kcp (linux/amd64) kubernetes/65a10a2/syncer-virtual-workspace" audit-ID="ac5a66a9-4925-41b4-8467-7a91d219622b" srcIP="[::1]:52980" apf_pl="exempt" apf_fs="exempt" apf_fd="" resp=201
I0701 16:57:40.944968   39127 httplog.go:129] "HTTP" verb="GET" URI="/clusters/root:e2e-org-r8cql:e2e-workspace-v7t6r/services/syncer/root:e2e-org-r8cql:e2e-workspace-v7t6r/pcluster-01/clusters/%2A/apis?timeout=32s" latency="1.75211ms" userAgent="kcp#spec-syncer/v0.0.0-master+$Format:%H$" audit-ID="87d1af87-8cdf-481e-9c43-ae184d9f9e37" srcIP="10.130.116.30:41150" resp=404
I0701 16:57:40.948143   39127 httplog.go:129] "HTTP" verb="POST" URI="/clusters/root:e2e-org-r8cql:e2e-workspace-v7t6r/apis/authorization.k8s.io/v1/subjectaccessreviews" latency="777.372µs" userAgent="kcp/v1.23.5+kcp (linux/amd64) kubernetes/65a10a2/syncer-virtual-workspace" audit-ID="81eb6d1c-3fe3-4575-abb6-7d73010ff221" srcIP="[::1]:52980" apf_pl="exempt" apf_fs="exempt" apf_fd="" resp=201
I0701 16:57:40.948509   39127 httplog.go:129] "HTTP" verb="GET" URI="/clusters/root:e2e-org-r8cql:e2e-workspace-v7t6r/services/syncer/root:e2e-org-r8cql:e2e-workspace-v7t6r/pcluster-01/clusters/%2A/api/v1?timeout=32s" latency="1.559602ms" userAgent="kcp#spec-syncer/v0.0.0-master+$Format:%H$" audit-ID="e4e2b787-ee21-4255-8050-c5def713b9b7" srcIP="10.130.116.30:41150" resp=404
{"level":"warn","ts":"2022-07-01T16:57:41.299Z","caller":"etcdserver/v3_server.go:815","msg":"waiting for ReadIndex response took too long, retrying","sent-request-id":5280331726964144124,"retry-timeout":"500ms"}
{"level":"warn","ts":"2022-07-01T16:57:41.800Z","caller":"etcdserver/v3_server.go:815","msg":"waiting for ReadIndex response took too long, retrying","sent-request-id":5280331726964144124,"retry-timeout":"500ms"}
{"level":"warn","ts":"2022-07-01T16:57:41.910Z","logger":"etcd-client","caller":"v3@v3.5.0/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc0024d6540/#initially=[https://localhost:44651]","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = context deadline exceeded"}
I0701 16:57:41.910762   39127 healthz.go:257] etcd check failed: livez
[-]etcd failed: error getting data from etcd: context deadline exceeded
{"level":"warn","ts":"2022-07-01T16:57:41.910Z","caller":"etcdserver/util.go:166","msg":"apply request took too long","took":"1.999879071s","expected-duration":"100ms","prefix":"read-only range ","request":"key:\"/registry/health\" ","response":"","error":"context deadline exceeded"}
{"level":"warn","ts":"2022-07-01T16:57:41.910Z","caller":"v3rpc/interceptor.go:197","msg":"request stats","start time":"2022-07-01T16:57:39.910Z","time spent":"1.999997464s","remote":"127.0.0.1:58212","response type":"/etcdserverpb.KV/Range","request count":0,"request size":18,"response count":0,"response size":0,"request content":"key:\"/registry/health\" "}
W0701 16:57:41.911009   39127 server.go:1310] [core] grpc: Server.processUnaryRPC failed to write status: connection error: desc = "transport is closing"
mjudeikis commented 7 months ago

/close reopen if still needed

kcp-ci-bot commented 7 months ago

@mjudeikis: Closing this issue.

In response to [this](https://github.com/kcp-dev/kcp/issues/1425#issuecomment-2018478142): >/close >reopen if still needed Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.