Closed ncdc closed 7 months ago
We need to amend that framework.Eventually
to spit out to conditions/reasons and/or add the ClusterWorkspace object to the artifacts, or this is not something that can be debugged post-hoc.
One specific (new?) reason this happens: https://github.com/kcp-dev/kcp/pull/2440#issuecomment-1371202704
Saw this happen again in https://github.com/kcp-dev/kcp/actions/runs/3968949742/jobs/6802786983. Did some digging into the logs, and it looks like the kcp-workspacetypes-bootstrap-root:universal
and kcp-apibinder-initializer
reconcilers fight with each other for a bit over patching the LogicalCluster
. Eventually the bootstrapper bit completes (removing its initializer).
Then the apibinder initializer tries multiple times to remove its initializer. I see this over and over:
I0120 15:45:04.846704 38354 committer.go:97] "patching *v1alpha1.LogicalCluster" reconciler="kcp-apibinder-initializer" key="ukwxjtfpgmf4w4bi|cluster" logicalcluster.workspace="ukwxjtfpgmf4w4bi" logicalcluster.namespace="" logicalcluster.name="cluster" logicalcluster.apiVersion="" patch="{\"metadata\":{\"resourceVersion\":\"3861\",\"uid\":\"9a8ffd73-d99b-4d16-b621-7a710a11e26d\"},\"status\":{\"initializers\":[\"root:universal\"]}}"
I0120 15:45:05.253906 38441 httplog.go:131] "HTTP" verb="PATCH" URI="/services/initializingworkspaces/system:apibindings/clusters/ukwxjtfpgmf4w4bi/apis/core.kcp.io/v1alpha1/logicalclusters/cluster/status" latency="404.302777ms" userAgent="kcp/v1.24.3+kcp (linux/amd64) kubernetes/1f918f5/kcp-apibinder-initializer" audit-ID="87f7af9a-3718-4ff1-90d2-130ed357acee" srcIP="10.1.0.36:56398" resp=409 addedInfo=<
It's trying to patch at resourceVersion 3861 to remove the system:apibindings
initializer, leaving the root:universal
initializer remaining. Except root:universal
removed itself previously - that was what bumped the RV to 3861.
This is the timing of the attempts from the apibinder controller to remove its initializer, each resulting in a conflict:
It looks like it's gone into backoff. I am not clear why the reconciler continues to try to reconcile from RV 3861 instead of seeing a newer version in the cache.
The last entry (and another one that followed) were after the test had failed, cleanup had occurred, and the LogicalCluster no longer existed. Note that it's still trying RV 3861. I'm a little confused why it's showing both root:universal
and system:apibindings
initializers still both exist.
I0120 15:45:47.243815 38354 committer.go:97] "patching *v1alpha1.LogicalCluster" reconciler="kcp-apibinder-initializer" key="ukwxjtfpgmf4w4bi|cluster" logicalcluster.workspace="ukwxjtfpgmf4w4bi" logicalcluster.namespace="" logicalcluster.name="cluster" logicalcluster.apiVersion="" patch="{\"metadata\":{\"resourceVersion\":\"3861\",\"uid\":\"9a8ffd73-d99b-4d16-b621-7a710a11e26d\"},\"status\":{\"conditions\":[{\"lastTransitionTime\":\"2023-01-20T15:45:47Z\",\"message\":\"encountered errors: an error on the server (\\\"error getting logicalcluster ukwxjtfpgmf4w4bi|cluster: LogicalCluster.core.kcp.io \\\\\\\"cluster\\\\\\\" not found\\\") has prevented the request from succeeding (post apibindings.apis.kcp.io)\",\"reason\":\"APIBindingErrors\",\"severity\":\"Error\",\"status\":\"False\",\"type\":\"APIBindingsInitialized\"},{\"lastTransitionTime\":\"2023-01-20T15:45:03Z\",\"message\":\"Initializers still exist: [root:universal system:apibindings]\",\"reason\":\"InitializerExists\",\"severity\":\"Info\",\"status\":\"False\",\"type\":\"WorkspaceInitialized\"}]}}"
The kcp-logicalcluster-deletion
reconciler is able to see the newer RV (it gets a conflict on 3925 and then quickly retries with 3927 and succeeds):
I0120 15:45:37.385918 38354 committer.go:97] "patching *v1alpha1.LogicalCluster" component="kcp" reconciler="kcp-logicalcluster-deletion" key="ukwxjtfpgmf4w4bi|cluster" logicalcluster.workspace="ukwxjtfpgmf4w4bi" logicalcluster.namespace="" logicalcluster.name="cluster" logicalcluster.apiVersion="" patch="{\"metadata\":{\"resourceVersion\":\"3925\",\"uid\":\"9a8ffd73-d99b-4d16-b621-7a710a11e26d\"},\"status\":{\"conditions\":[{\"lastTransitionTime\":\"2023-01-20T15:45:37Z\",\"message\":\"Some resources are remaining: apibindings.apis.kcp.io has 5 resource instances, namespaces. has 1 resource instances; Some content in the logical cluster has finalizers remaining: apis.kcp.io/apibinding-finalizer in 5 resource instances\",\"reason\":\"SomeResourcesRemain\",\"severity\":\"Info\",\"status\":\"False\",\"type\":\"WorkspaceContentDeleted\"},{\"lastTransitionTime\":\"2023-01-20T15:45:04Z\",\"message\":\"Initializers still exist: [system:apibindings]\",\"reason\":\"InitializerExists\",\"severity\":\"Info\",\"status\":\"False\",\"type\":\"WorkspaceInitialized\"}]}}"
I0120 15:45:37.407808 38354 httplog.go:131] "HTTP" verb="PATCH" URI="/clusters/ukwxjtfpgmf4w4bi/apis/core.kcp.io/v1alpha1/logicalclusters/cluster/status" latency="19.772867ms" userAgent="kcp/v1.24.3+kcp (linux/amd64) kubernetes/1f918f5/kcp-logicalcluster-deletion" audit-ID="ff2e5460-2780-42ec-8bfd-1c4b9ad2c1e9" srcIP="[::1]:45884" resp=409
I0120 15:45:38.578864 38354 committer.go:97] "patching *v1alpha1.LogicalCluster" component="kcp" reconciler="kcp-logicalcluster-deletion" key="ukwxjtfpgmf4w4bi|cluster" logicalcluster.workspace="ukwxjtfpgmf4w4bi" logicalcluster.namespace="" logicalcluster.name="cluster" logicalcluster.apiVersion="" patch="{\"metadata\":{\"resourceVersion\":\"3927\",\"uid\":\"9a8ffd73-d99b-4d16-b621-7a710a11e26d\"},\"status\":{\"conditions\":[{\"lastTransitionTime\":\"2023-01-20T15:45:38Z\",\"message\":\"Some resources are remaining: namespaces. has 1 resource instances\",\"reason\":\"SomeResourcesRemain\",\"severity\":\"Info\",\"status\":\"False\",\"type\":\"WorkspaceContentDeleted\"},{\"lastTransitionTime\":\"2023-01-20T15:45:04Z\",\"message\":\"Initializers still exist: [system:apibindings]\",\"reason\":\"InitializerExists\",\"severity\":\"Info\",\"status\":\"False\",\"type\":\"WorkspaceInitialized\"}]}}"
I0120 15:45:38.604307 38354 httplog.go:131] "HTTP" verb="PATCH" URI="/clusters/ukwxjtfpgmf4w4bi/apis/core.kcp.io/v1alpha1/logicalclusters/cluster/status" latency="16.559642ms" userAgent="kcp/v1.24.3+kcp (linux/amd64) kubernetes/1f918f5/kcp-logicalcluster-deletion" audit-ID="72c39d0b-2609-43a4-ad30-be7154b5cf1d" srcIP="[::1]:45884" resp=200```
Is there any chance that the forwarding storage (pkg/virtual/framework/forwardingregistry) is causing this staleness?
Seeing stuff like this
I0120 15:45:04.086465 38354 panic.go:884] "HTTP" verb="WATCH" URI="/services/initializingworkspaces/system:apibindings/clusters/%2A/apis/core.kcp.io/v1alpha1/logicalclusters?allowWatchBookmarks=true&resourceVersion=816&timeout=8m8s&timeoutSeconds=488&watch=true" latency="1m0.009056546s" userAgent="kcp/v1.24.3+kcp (linux/amd64) kubernetes/1f918f5/kcp-apibinder-initializer" audit-ID="b0e0a207-e7a4-4ac7-a4d6-2ed117f57688" srcIP="10.1.0.36:37458" resp=200
E0120 15:45:04.086556 38354 wrap.go:53] timeout or abort while handling: method=GET URI="/services/initializingworkspaces/system:apibindings/clusters/%2A/apis/core.kcp.io/v1alpha1/logicalclusters?allowWatchBookmarks=true&resourceVersion=816&timeout=8m8s&timeoutSeconds=488&watch=true" audit-ID="b0e0a207-e7a4-4ac7-a4d6-2ed117f57688"
2023/01/20 15:45:04 httputil: ReverseProxy read error during body copy: context deadline exceeded
E0120 15:45:04.087098 38354 timeout.go:141] post-timeout activity - time-elapsed: 2.3µs, GET "/services/initializingworkspaces/system:apibindings/clusters/*/apis/core.kcp.io/v1alpha1/logicalclusters" result: net/http: abort Handler
Here's a filtered set of logs showing reflector activity plus httplog for list/watch of logicalclusters for the time period in question
I0120 15:44:56.984445 38457 reflector.go:536] k8s.io/client-go@v0.0.0-20230113194502-8f8fb5fa4eea/tools/cache/reflector.go:167: Watch close - *v1alpha1.LogicalCluster total 8 items received
I0120 15:44:56.984169 38420 httplog.go:131] "HTTP" verb="WATCH" URI="/clusters/%2A/apis/core.kcp.io/v1alpha1/logicalclusters?allowWatchBookmarks=true&resourceVersion=254&timeout=7m53s&timeoutSeconds=473&watch=true" latency="7m53.001359805s" userAgent="virtual-workspaces/v1.24.3+kcp (linux/amd64) kubernetes/1f918f5" audit-ID="64c30d7b-3cd6-44f8-9e5f-7766ff3676c8" srcIP="[::1]:43984" resp=200
I0120 15:45:03.852115 38420 reflector.go:255] Listing and watching *v1alpha1.LogicalCluster from k8s.io/client-go@v0.0.0-20230113194502-8f8fb5fa4eea/tools/cache/reflector.go:167
I0120 15:45:03.906024 38420 httplog.go:131] "HTTP" verb="LIST" URI="/clusters/%2A/apis/core.kcp.io/v1alpha1/logicalclusters?labelSelector=initializer.internal.kcp.io%2Fdd58c810629ccc3e49c7d8225b9bcfaa42a%3Ddd58c810629ccc3e49c7d8225b9bcfaa42a4a2ecbb7e4a9be76614e3%2Ctenancy.kcp.io%2Fphase%3DInitializing&resourceVersion=279" latency="610.324µs" userAgent="virtual-workspaces/v1.24.3+kcp (linux/amd64) kubernetes/1f918f5/initializingworkspaces-virtual-workspace" audit-ID="c9dad497-7386-44fe-9770-5d3f3c9eae61" srcIP="[::1]:43984" resp=200
I0120 15:45:03.907749 38457 httplog.go:131] "HTTP" verb="LIST" URI="/services/initializingworkspaces/system:apibindings/clusters/%2A/apis/core.kcp.io/v1alpha1/logicalclusters?resourceVersion=279" latency="51.544429ms" userAgent="kcp/v1.24.3+kcp (linux/amd64) kubernetes/1f918f5/kcp-apibinder-initializer" audit-ID="bc920e77-6217-438f-bd06-d8e204819a37" srcIP="10.1.0.36:50728" resp=200 addedInfo=<
&{kcp-shard-1 [system:masters system:authenticated] map[]} is acting as &{system:apiserver f88b80c1-0901-4cdf-bff5-0dffa3fd694c [system:masters system:authenticated] map[]}
>
I0120 15:45:04.086922 38441 httplog.go:131] "HTTP" verb="WATCH" URI="/services/initializingworkspaces/system:apibindings/clusters/%2A/apis/core.kcp.io/v1alpha1/logicalclusters?allowWatchBookmarks=true&resourceVersion=816&timeout=8m8s&timeoutSeconds=488&watch=true" latency="1m0.009058245s" userAgent="kcp/v1.24.3+kcp (linux/amd64) kubernetes/1f918f5/kcp-apibinder-initializer" audit-ID="716a097c-9b87-4c8d-8e38-427b91d990b8" srcIP="10.1.0.36:56398" resp=200 addedInfo=<
&{kcp-shard-0 [system:masters system:authenticated] map[]} is acting as &{system:apiserver 17fe5dc1-5ca5-45d3-be87-f20346c02209 [system:masters system:authenticated] map[]}
>
W0120 15:45:04.105390 38354 reflector.go:442] k8s.io/client-go@v0.0.0-20230113194502-8f8fb5fa4eea/tools/cache/reflector.go:167: watch of *v1alpha1.LogicalCluster ended with: an error on the server ("unable to decode an event from the watch stream: stream error: stream ID 147; INTERNAL_ERROR; received from peer") has prevented the request from succeeding
I0120 15:45:04.117330 38354 httplog.go:131] "HTTP" verb="WATCH" URI="/clusters/%2A/apis/core.kcp.io/v1alpha1/logicalclusters?allowWatchBookmarks=true&labelSelector=initializer.internal.kcp.io%2Fdd58c810629ccc3e49c7d8225b9bcfaa42a%3Ddd58c810629ccc3e49c7d8225b9bcfaa42a4a2ecbb7e4a9be76614e3%2Ctenancy.kcp.io%2Fphase%3DInitializing&resourceVersion=816&timeoutSeconds=488&watch=true" latency="1m0.026760041s" userAgent="virtual-workspaces/v1.24.3+kcp (linux/amd64) kubernetes/1f918f5/initializingworkspaces-virtual-workspace" audit-ID="7b2f45c6-2231-4bf3-94de-6adc3c0bdd66" srcIP="[::1]:57604" resp=200
I0120 15:45:27.691653 38674 reflector.go:536] k8s.io/client-go@v0.0.0-20230113194502-8f8fb5fa4eea/tools/cache/reflector.go:167: Watch close - *v1alpha1.LogicalCluster total 6 items received
I0120 15:45:27.691300 38420 httplog.go:131] "HTTP" verb="WATCH" URI="/clusters/%2A/apis/core.kcp.io/v1alpha1/logicalclusters?allowWatchBookmarks=true&resourceVersion=278&timeout=8m18s&timeoutSeconds=498&watch=true" latency="8m18.001507714s" userAgent="kcp-front-proxy/v1.24.3+kcp (linux/amd64) kubernetes/1f918f5" audit-ID="2bd6fd88-6b73-49eb-9cbd-c354550e2eea" srcIP="10.1.0.36:53304" resp=200
I0120 15:45:36.672804 38354 reflector.go:255] Listing and watching *v1alpha1.LogicalCluster from k8s.io/client-go@v0.0.0-20230113194502-8f8fb5fa4eea/tools/cache/reflector.go:167
I0120 15:45:36.803375 38354 httplog.go:131] "HTTP" verb="LIST" URI="/clusters/%2A/apis/core.kcp.io/v1alpha1/logicalclusters?labelSelector=initializer.internal.kcp.io%2Fdd58c810629ccc3e49c7d8225b9bcfaa42a%3Ddd58c810629ccc3e49c7d8225b9bcfaa42a4a2ecbb7e4a9be76614e3%2Ctenancy.kcp.io%2Fphase%3DInitializing&resourceVersion=3861" latency="520.92µs" userAgent="virtual-workspaces/v1.24.3+kcp (linux/amd64) kubernetes/1f918f5/initializingworkspaces-virtual-workspace" audit-ID="4bae6306-9f90-4a28-b9d3-2a7c70391de2" srcIP="[::1]:57604" resp=200
I0120 15:45:36.811418 38441 httplog.go:131] "HTTP" verb="LIST" URI="/services/initializingworkspaces/system:apibindings/clusters/%2A/apis/core.kcp.io/v1alpha1/logicalclusters?resourceVersion=3861" latency="109.449805ms" userAgent="kcp/v1.24.3+kcp (linux/amd64) kubernetes/1f918f5/kcp-apibinder-initializer" audit-ID="ef7cd986-08ab-4b5e-8d0b-864e3b26364e" srcIP="10.1.0.36:56398" resp=200 addedInfo=<
&{kcp-shard-0 [system:masters system:authenticated] map[]} is acting as &{system:apiserver 17fe5dc1-5ca5-45d3-be87-f20346c02209 [system:masters system:authenticated] map[]}
>
I0120 15:45:51.694521 38674 reflector.go:536] k8s.io/client-go@v0.0.0-20230113194502-8f8fb5fa4eea/tools/cache/reflector.go:167: Watch close - *v1alpha1.LogicalCluster total 699 items received
I0120 15:45:51.694323 38354 httplog.go:131] "HTTP" verb="WATCH" URI="/clusters/%2A/apis/core.kcp.io/v1alpha1/logicalclusters?allowWatchBookmarks=true&resourceVersion=494&timeout=8m42s&timeoutSeconds=522&watch=true" latency="8m42.000832803s" userAgent="kcp-front-proxy/v1.24.3+kcp (linux/amd64) kubernetes/1f918f5" audit-ID="4831d362-1427-4954-acb7-de8ceebdd06a" srcIP="10.1.0.36:36384" resp=200
I0120 15:45:59.483599 38354 httplog.go:131] "HTTP" verb="WATCH" URI="/clusters/%2A/apis/core.kcp.io/v1alpha1/logicalclusters?allowWatchBookmarks=true&resourceVersion=412&timeoutSeconds=544&watch=true" latency="9m4.000518017s" userAgent="kcp/v1.24.3+kcp (linux/amd64) kubernetes/1f918f5/kcp-partial-metadata-informers" audit-ID="ef068ff6-9ed1-449f-b7b0-a120c9a5bb60" srcIP="[::1]:45884" resp=200
I0120 15:46:03.911728 38457 httplog.go:131] "HTTP" verb="WATCH" URI="/services/initializingworkspaces/system:apibindings/clusters/%2A/apis/core.kcp.io/v1alpha1/logicalclusters?allowWatchBookmarks=true&resourceVersion=279&timeout=9m7s&timeoutSeconds=547&watch=true" latency="1m0.002806565s" userAgent="kcp/v1.24.3+kcp (linux/amd64) kubernetes/1f918f5/kcp-apibinder-initializer" audit-ID="a32a850e-13e2-47f9-8707-de168d253fcb" srcIP="10.1.0.36:50728" resp=200 addedInfo=<
Can repro the 1-minute timeout easily locally with cmd/sharded-test-server. Looking into why some reverse proxy somewhere is terminating watches after 1 minute.
Fun fact: if you use http.DefaultTransport
and the request's Context
doesn't have a Done()
channel assigned (it's not cancelable), the code assigns a 1-minute timeout to avoid leaking goroutines...
Hmm, that's not it. Still digging
Aha, so:
watch
request for /services/initializingworkspaces/...
RequestInfoFactory
doesn't know how to parse this URL, so it returns a RequestInfo
that only has these fields filled in:Because of this, the WithRequestDeadline
portion of the handler chain doesn't see this as a long-running request, and installs a 60-second deadline.
Describe the bug
We've lately been seeing random test failures that look like this:
Steps To Reproduce
Expected Behaviour
No failure
Additional Context
No response