bug: flake: workspace(s) not initialized in time

ncdc commented 1 year ago

Describe the bug

We've lately been seeing random test failures that look like this:

=== FAIL: test/e2e/apibinding TestProtectedAPI (30.83s)
    kcp.go:127: shared kcp server will target configuration "/go/src/github.com/kcp-dev/kcp/.kcp/admin.kubeconfig"
    assertions.go:1691: Waiting for condition, but got: workspace phase is Initializing, not Ready
    workspaces.go:182: 
            Error Trace:    util.go:297
                                        workspaces.go:182
                                        apibinding_protected_test.go:50
            Error:          Condition never satisfied
            Test:           TestProtectedAPI
            Messages:       failed to wait for organization workspace e2e-org-xgk49 to become ready

Steps To Reproduce

Run e2e
Sometimes it fails like this

Expected Behaviour

No failure

Additional Context

No response

ncdc commented 1 year ago

Example: https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/kcp-dev_kcp/2475/pull-ci-kcp-dev-kcp-main-e2e-sharded/1602363246315048960

stevekuznetsov commented 1 year ago

We need to amend that framework.Eventually to spit out to conditions/reasons and/or add the ClusterWorkspace object to the artifacts, or this is not something that can be debugged post-hoc.

ncdc commented 1 year ago

One specific (new?) reason this happens: https://github.com/kcp-dev/kcp/pull/2440#issuecomment-1371202704

ncdc commented 1 year ago

Saw this happen again in https://github.com/kcp-dev/kcp/actions/runs/3968949742/jobs/6802786983. Did some digging into the logs, and it looks like the kcp-workspacetypes-bootstrap-root:universal and kcp-apibinder-initializer reconcilers fight with each other for a bit over patching the LogicalCluster. Eventually the bootstrapper bit completes (removing its initializer).

Then the apibinder initializer tries multiple times to remove its initializer. I see this over and over:

I0120 15:45:04.846704   38354 committer.go:97] "patching *v1alpha1.LogicalCluster" reconciler="kcp-apibinder-initializer" key="ukwxjtfpgmf4w4bi|cluster" logicalcluster.workspace="ukwxjtfpgmf4w4bi" logicalcluster.namespace="" logicalcluster.name="cluster" logicalcluster.apiVersion="" patch="{\"metadata\":{\"resourceVersion\":\"3861\",\"uid\":\"9a8ffd73-d99b-4d16-b621-7a710a11e26d\"},\"status\":{\"initializers\":[\"root:universal\"]}}"
I0120 15:45:05.253906   38441 httplog.go:131] "HTTP" verb="PATCH" URI="/services/initializingworkspaces/system:apibindings/clusters/ukwxjtfpgmf4w4bi/apis/core.kcp.io/v1alpha1/logicalclusters/cluster/status" latency="404.302777ms" userAgent="kcp/v1.24.3+kcp (linux/amd64) kubernetes/1f918f5/kcp-apibinder-initializer" audit-ID="87f7af9a-3718-4ff1-90d2-130ed357acee" srcIP="10.1.0.36:56398" resp=409 addedInfo=<

It's trying to patch at resourceVersion 3861 to remove the system:apibindings initializer, leaving the root:universal initializer remaining. Except root:universal removed itself previously - that was what bumped the RV to 3861.

This is the timing of the attempts from the apibinder controller to remove its initializer, each resulting in a conflict:

15:45:04.846704
15:45:05.264883
15:45:05.560427
15:45:05.908089
15:45:06.166070
15:45:06.327768
15:45:06.514522
15:45:06.839948
15:45:08.239655
15:45:10.921759
15:45:16.188496
15:45:36.204823
15:45:36.510246
15:45:36.697661
15:45:36.950332
15:45:47.243815

It looks like it's gone into backoff. I am not clear why the reconciler continues to try to reconcile from RV 3861 instead of seeing a newer version in the cache.

The last entry (and another one that followed) were after the test had failed, cleanup had occurred, and the LogicalCluster no longer existed. Note that it's still trying RV 3861. I'm a little confused why it's showing both root:universal and system:apibindings initializers still both exist.

I0120 15:45:47.243815   38354 committer.go:97] "patching *v1alpha1.LogicalCluster" reconciler="kcp-apibinder-initializer" key="ukwxjtfpgmf4w4bi|cluster" logicalcluster.workspace="ukwxjtfpgmf4w4bi" logicalcluster.namespace="" logicalcluster.name="cluster" logicalcluster.apiVersion="" patch="{\"metadata\":{\"resourceVersion\":\"3861\",\"uid\":\"9a8ffd73-d99b-4d16-b621-7a710a11e26d\"},\"status\":{\"conditions\":[{\"lastTransitionTime\":\"2023-01-20T15:45:47Z\",\"message\":\"encountered errors: an error on the server (\\\"error getting logicalcluster ukwxjtfpgmf4w4bi|cluster: LogicalCluster.core.kcp.io \\\\\\\"cluster\\\\\\\" not found\\\") has prevented the request from succeeding (post apibindings.apis.kcp.io)\",\"reason\":\"APIBindingErrors\",\"severity\":\"Error\",\"status\":\"False\",\"type\":\"APIBindingsInitialized\"},{\"lastTransitionTime\":\"2023-01-20T15:45:03Z\",\"message\":\"Initializers still exist: [root:universal system:apibindings]\",\"reason\":\"InitializerExists\",\"severity\":\"Info\",\"status\":\"False\",\"type\":\"WorkspaceInitialized\"}]}}"

The kcp-logicalcluster-deletion reconciler is able to see the newer RV (it gets a conflict on 3925 and then quickly retries with 3927 and succeeds):


I0120 15:45:37.385918   38354 committer.go:97] "patching *v1alpha1.LogicalCluster" component="kcp" reconciler="kcp-logicalcluster-deletion" key="ukwxjtfpgmf4w4bi|cluster" logicalcluster.workspace="ukwxjtfpgmf4w4bi" logicalcluster.namespace="" logicalcluster.name="cluster" logicalcluster.apiVersion="" patch="{\"metadata\":{\"resourceVersion\":\"3925\",\"uid\":\"9a8ffd73-d99b-4d16-b621-7a710a11e26d\"},\"status\":{\"conditions\":[{\"lastTransitionTime\":\"2023-01-20T15:45:37Z\",\"message\":\"Some resources are remaining: apibindings.apis.kcp.io has 5 resource instances, namespaces. has 1 resource instances; Some content in the logical cluster has finalizers remaining: apis.kcp.io/apibinding-finalizer in 5 resource instances\",\"reason\":\"SomeResourcesRemain\",\"severity\":\"Info\",\"status\":\"False\",\"type\":\"WorkspaceContentDeleted\"},{\"lastTransitionTime\":\"2023-01-20T15:45:04Z\",\"message\":\"Initializers still exist: [system:apibindings]\",\"reason\":\"InitializerExists\",\"severity\":\"Info\",\"status\":\"False\",\"type\":\"WorkspaceInitialized\"}]}}"
I0120 15:45:37.407808   38354 httplog.go:131] "HTTP" verb="PATCH" URI="/clusters/ukwxjtfpgmf4w4bi/apis/core.kcp.io/v1alpha1/logicalclusters/cluster/status" latency="19.772867ms" userAgent="kcp/v1.24.3+kcp (linux/amd64) kubernetes/1f918f5/kcp-logicalcluster-deletion" audit-ID="ff2e5460-2780-42ec-8bfd-1c4b9ad2c1e9" srcIP="[::1]:45884" resp=409
I0120 15:45:38.578864   38354 committer.go:97] "patching *v1alpha1.LogicalCluster" component="kcp" reconciler="kcp-logicalcluster-deletion" key="ukwxjtfpgmf4w4bi|cluster" logicalcluster.workspace="ukwxjtfpgmf4w4bi" logicalcluster.namespace="" logicalcluster.name="cluster" logicalcluster.apiVersion="" patch="{\"metadata\":{\"resourceVersion\":\"3927\",\"uid\":\"9a8ffd73-d99b-4d16-b621-7a710a11e26d\"},\"status\":{\"conditions\":[{\"lastTransitionTime\":\"2023-01-20T15:45:38Z\",\"message\":\"Some resources are remaining: namespaces. has 1 resource instances\",\"reason\":\"SomeResourcesRemain\",\"severity\":\"Info\",\"status\":\"False\",\"type\":\"WorkspaceContentDeleted\"},{\"lastTransitionTime\":\"2023-01-20T15:45:04Z\",\"message\":\"Initializers still exist: [system:apibindings]\",\"reason\":\"InitializerExists\",\"severity\":\"Info\",\"status\":\"False\",\"type\":\"WorkspaceInitialized\"}]}}"
I0120 15:45:38.604307   38354 httplog.go:131] "HTTP" verb="PATCH" URI="/clusters/ukwxjtfpgmf4w4bi/apis/core.kcp.io/v1alpha1/logicalclusters/cluster/status" latency="16.559642ms" userAgent="kcp/v1.24.3+kcp (linux/amd64) kubernetes/1f918f5/kcp-logicalcluster-deletion" audit-ID="72c39d0b-2609-43a4-ad30-be7154b5cf1d" srcIP="[::1]:45884" resp=200```

ncdc commented 1 year ago

Is there any chance that the forwarding storage (pkg/virtual/framework/forwardingregistry) is causing this staleness?

ncdc commented 1 year ago

Seeing stuff like this

I0120 15:45:04.086465   38354 panic.go:884] "HTTP" verb="WATCH" URI="/services/initializingworkspaces/system:apibindings/clusters/%2A/apis/core.kcp.io/v1alpha1/logicalclusters?allowWatchBookmarks=true&resourceVersion=816&timeout=8m8s&timeoutSeconds=488&watch=true" latency="1m0.009056546s" userAgent="kcp/v1.24.3+kcp (linux/amd64) kubernetes/1f918f5/kcp-apibinder-initializer" audit-ID="b0e0a207-e7a4-4ac7-a4d6-2ed117f57688" srcIP="10.1.0.36:37458" resp=200
E0120 15:45:04.086556   38354 wrap.go:53] timeout or abort while handling: method=GET URI="/services/initializingworkspaces/system:apibindings/clusters/%2A/apis/core.kcp.io/v1alpha1/logicalclusters?allowWatchBookmarks=true&resourceVersion=816&timeout=8m8s&timeoutSeconds=488&watch=true" audit-ID="b0e0a207-e7a4-4ac7-a4d6-2ed117f57688"
2023/01/20 15:45:04 httputil: ReverseProxy read error during body copy: context deadline exceeded
E0120 15:45:04.087098   38354 timeout.go:141] post-timeout activity - time-elapsed: 2.3µs, GET "/services/initializingworkspaces/system:apibindings/clusters/*/apis/core.kcp.io/v1alpha1/logicalclusters" result: net/http: abort Handler

ncdc commented 1 year ago

Here's a filtered set of logs showing reflector activity plus httplog for list/watch of logicalclusters for the time period in question

I0120 15:44:56.984445   38457 reflector.go:536] k8s.io/client-go@v0.0.0-20230113194502-8f8fb5fa4eea/tools/cache/reflector.go:167: Watch close - *v1alpha1.LogicalCluster total 8 items received
I0120 15:44:56.984169   38420 httplog.go:131] "HTTP" verb="WATCH" URI="/clusters/%2A/apis/core.kcp.io/v1alpha1/logicalclusters?allowWatchBookmarks=true&resourceVersion=254&timeout=7m53s&timeoutSeconds=473&watch=true" latency="7m53.001359805s" userAgent="virtual-workspaces/v1.24.3+kcp (linux/amd64) kubernetes/1f918f5" audit-ID="64c30d7b-3cd6-44f8-9e5f-7766ff3676c8" srcIP="[::1]:43984" resp=200
I0120 15:45:03.852115   38420 reflector.go:255] Listing and watching *v1alpha1.LogicalCluster from k8s.io/client-go@v0.0.0-20230113194502-8f8fb5fa4eea/tools/cache/reflector.go:167
I0120 15:45:03.906024   38420 httplog.go:131] "HTTP" verb="LIST" URI="/clusters/%2A/apis/core.kcp.io/v1alpha1/logicalclusters?labelSelector=initializer.internal.kcp.io%2Fdd58c810629ccc3e49c7d8225b9bcfaa42a%3Ddd58c810629ccc3e49c7d8225b9bcfaa42a4a2ecbb7e4a9be76614e3%2Ctenancy.kcp.io%2Fphase%3DInitializing&resourceVersion=279" latency="610.324µs" userAgent="virtual-workspaces/v1.24.3+kcp (linux/amd64) kubernetes/1f918f5/initializingworkspaces-virtual-workspace" audit-ID="c9dad497-7386-44fe-9770-5d3f3c9eae61" srcIP="[::1]:43984" resp=200
I0120 15:45:03.907749   38457 httplog.go:131] "HTTP" verb="LIST" URI="/services/initializingworkspaces/system:apibindings/clusters/%2A/apis/core.kcp.io/v1alpha1/logicalclusters?resourceVersion=279" latency="51.544429ms" userAgent="kcp/v1.24.3+kcp (linux/amd64) kubernetes/1f918f5/kcp-apibinder-initializer" audit-ID="bc920e77-6217-438f-bd06-d8e204819a37" srcIP="10.1.0.36:50728" resp=200 addedInfo=<

        &{kcp-shard-1  [system:masters system:authenticated] map[]} is acting as &{system:apiserver f88b80c1-0901-4cdf-bff5-0dffa3fd694c [system:masters system:authenticated] map[]}
 >
I0120 15:45:04.086922   38441 httplog.go:131] "HTTP" verb="WATCH" URI="/services/initializingworkspaces/system:apibindings/clusters/%2A/apis/core.kcp.io/v1alpha1/logicalclusters?allowWatchBookmarks=true&resourceVersion=816&timeout=8m8s&timeoutSeconds=488&watch=true" latency="1m0.009058245s" userAgent="kcp/v1.24.3+kcp (linux/amd64) kubernetes/1f918f5/kcp-apibinder-initializer" audit-ID="716a097c-9b87-4c8d-8e38-427b91d990b8" srcIP="10.1.0.36:56398" resp=200 addedInfo=<

        &{kcp-shard-0  [system:masters system:authenticated] map[]} is acting as &{system:apiserver 17fe5dc1-5ca5-45d3-be87-f20346c02209 [system:masters system:authenticated] map[]}
 >
W0120 15:45:04.105390   38354 reflector.go:442] k8s.io/client-go@v0.0.0-20230113194502-8f8fb5fa4eea/tools/cache/reflector.go:167: watch of *v1alpha1.LogicalCluster ended with: an error on the server ("unable to decode an event from the watch stream: stream error: stream ID 147; INTERNAL_ERROR; received from peer") has prevented the request from succeeding
I0120 15:45:04.117330   38354 httplog.go:131] "HTTP" verb="WATCH" URI="/clusters/%2A/apis/core.kcp.io/v1alpha1/logicalclusters?allowWatchBookmarks=true&labelSelector=initializer.internal.kcp.io%2Fdd58c810629ccc3e49c7d8225b9bcfaa42a%3Ddd58c810629ccc3e49c7d8225b9bcfaa42a4a2ecbb7e4a9be76614e3%2Ctenancy.kcp.io%2Fphase%3DInitializing&resourceVersion=816&timeoutSeconds=488&watch=true" latency="1m0.026760041s" userAgent="virtual-workspaces/v1.24.3+kcp (linux/amd64) kubernetes/1f918f5/initializingworkspaces-virtual-workspace" audit-ID="7b2f45c6-2231-4bf3-94de-6adc3c0bdd66" srcIP="[::1]:57604" resp=200
I0120 15:45:27.691653   38674 reflector.go:536] k8s.io/client-go@v0.0.0-20230113194502-8f8fb5fa4eea/tools/cache/reflector.go:167: Watch close - *v1alpha1.LogicalCluster total 6 items received
I0120 15:45:27.691300   38420 httplog.go:131] "HTTP" verb="WATCH" URI="/clusters/%2A/apis/core.kcp.io/v1alpha1/logicalclusters?allowWatchBookmarks=true&resourceVersion=278&timeout=8m18s&timeoutSeconds=498&watch=true" latency="8m18.001507714s" userAgent="kcp-front-proxy/v1.24.3+kcp (linux/amd64) kubernetes/1f918f5" audit-ID="2bd6fd88-6b73-49eb-9cbd-c354550e2eea" srcIP="10.1.0.36:53304" resp=200
I0120 15:45:36.672804   38354 reflector.go:255] Listing and watching *v1alpha1.LogicalCluster from k8s.io/client-go@v0.0.0-20230113194502-8f8fb5fa4eea/tools/cache/reflector.go:167
I0120 15:45:36.803375   38354 httplog.go:131] "HTTP" verb="LIST" URI="/clusters/%2A/apis/core.kcp.io/v1alpha1/logicalclusters?labelSelector=initializer.internal.kcp.io%2Fdd58c810629ccc3e49c7d8225b9bcfaa42a%3Ddd58c810629ccc3e49c7d8225b9bcfaa42a4a2ecbb7e4a9be76614e3%2Ctenancy.kcp.io%2Fphase%3DInitializing&resourceVersion=3861" latency="520.92µs" userAgent="virtual-workspaces/v1.24.3+kcp (linux/amd64) kubernetes/1f918f5/initializingworkspaces-virtual-workspace" audit-ID="4bae6306-9f90-4a28-b9d3-2a7c70391de2" srcIP="[::1]:57604" resp=200
I0120 15:45:36.811418   38441 httplog.go:131] "HTTP" verb="LIST" URI="/services/initializingworkspaces/system:apibindings/clusters/%2A/apis/core.kcp.io/v1alpha1/logicalclusters?resourceVersion=3861" latency="109.449805ms" userAgent="kcp/v1.24.3+kcp (linux/amd64) kubernetes/1f918f5/kcp-apibinder-initializer" audit-ID="ef7cd986-08ab-4b5e-8d0b-864e3b26364e" srcIP="10.1.0.36:56398" resp=200 addedInfo=<

        &{kcp-shard-0  [system:masters system:authenticated] map[]} is acting as &{system:apiserver 17fe5dc1-5ca5-45d3-be87-f20346c02209 [system:masters system:authenticated] map[]}
 >
I0120 15:45:51.694521   38674 reflector.go:536] k8s.io/client-go@v0.0.0-20230113194502-8f8fb5fa4eea/tools/cache/reflector.go:167: Watch close - *v1alpha1.LogicalCluster total 699 items received
I0120 15:45:51.694323   38354 httplog.go:131] "HTTP" verb="WATCH" URI="/clusters/%2A/apis/core.kcp.io/v1alpha1/logicalclusters?allowWatchBookmarks=true&resourceVersion=494&timeout=8m42s&timeoutSeconds=522&watch=true" latency="8m42.000832803s" userAgent="kcp-front-proxy/v1.24.3+kcp (linux/amd64) kubernetes/1f918f5" audit-ID="4831d362-1427-4954-acb7-de8ceebdd06a" srcIP="10.1.0.36:36384" resp=200
I0120 15:45:59.483599   38354 httplog.go:131] "HTTP" verb="WATCH" URI="/clusters/%2A/apis/core.kcp.io/v1alpha1/logicalclusters?allowWatchBookmarks=true&resourceVersion=412&timeoutSeconds=544&watch=true" latency="9m4.000518017s" userAgent="kcp/v1.24.3+kcp (linux/amd64) kubernetes/1f918f5/kcp-partial-metadata-informers" audit-ID="ef068ff6-9ed1-449f-b7b0-a120c9a5bb60" srcIP="[::1]:45884" resp=200
I0120 15:46:03.911728   38457 httplog.go:131] "HTTP" verb="WATCH" URI="/services/initializingworkspaces/system:apibindings/clusters/%2A/apis/core.kcp.io/v1alpha1/logicalclusters?allowWatchBookmarks=true&resourceVersion=279&timeout=9m7s&timeoutSeconds=547&watch=true" latency="1m0.002806565s" userAgent="kcp/v1.24.3+kcp (linux/amd64) kubernetes/1f918f5/kcp-apibinder-initializer" audit-ID="a32a850e-13e2-47f9-8707-de168d253fcb" srcIP="10.1.0.36:50728" resp=200 addedInfo=<

ncdc commented 1 year ago

Can repro the 1-minute timeout easily locally with cmd/sharded-test-server. Looking into why some reverse proxy somewhere is terminating watches after 1 minute.

ncdc commented 1 year ago

Fun fact: if you use http.DefaultTransport and the request's Context doesn't have a Done() channel assigned (it's not cancelable), the code assigns a 1-minute timeout to avoid leaking goroutines...

https://github.com/kcp-dev/kcp/blob/ab2757a1b3521f1a4eee4abeb4343c5c10942431/pkg/server/config.go#L353

ncdc commented 1 year ago

Hmm, that's not it. Still digging

ncdc commented 1 year ago

Aha, so:

kcp receives a watch request for /services/initializingworkspaces/...
The k8s RequestInfoFactory doesn't know how to parse this URL, so it returns a RequestInfo that only has these fields filled in:

IsResourceRequest: false
Path: /services/initializingworkspaces/...
Verb: "get" (tolower of http method, which is GET)

Because of this, the WithRequestDeadline portion of the handler chain doesn't see this as a long-running request, and installs a 60-second deadline.

kcp-dev / kcp