kcp-dev / kcp

Kubernetes-like control planes for form-factors and use-cases beyond Kubernetes and container workloads.
https://kcp.io
Apache License 2.0
2.32k stars 375 forks source link

bug: flake: TestSyncerLifecycle #2278

Closed stevekuznetsov closed 1 year ago

stevekuznetsov commented 1 year ago

Describe the bug

I1027 17:33:59.650858   30405 spec_controller.go:267] "processing key" syncTarget.workspace="root:e2e-org-zfx2r:e2e-workspace-549q2" syncTarget.name="psyncer-01" syncTarget.key="7qFQgWTy6tlERvRELtSVSH1TZp0wWhDAhtd9gW" reconciler="kcp-workload-syncer-spec" key="root:e2e-org-zfx2r:e2e-workspace-549q2|test-syncer/syncer-test" gvr="apps/v1, Resource=deployments"
E1027 17:33:59.668414   30405 spec_process.go:483] "Error upserting upstream resource to downstream" err="Deployment.apps \"syncer-test\" is invalid: [spec.selector: Invalid value: \"null\": spec.selector in body must be of type object: \"null\", spec.template.spec.containers: Invalid value: \"null\": spec.template.spec.containers in body must be of type array: \"null\"]" syncTarget.workspace="root:e2e-org-zfx2r:e2e-workspace-549q2" syncTarget.name="psyncer-01" syncTarget.key="7qFQgWTy6tlERvRELtSVSH1TZp0wWhDAhtd9gW" reconciler="kcp-workload-syncer-spec" key="root:e2e-org-zfx2r:e2e-workspace-549q2|test-syncer/syncer-test" gvr="apps/v1, Resource=deployments" downstream.name="syncer-test"
E1027 17:33:59.668532   30405 spec_controller.go:274] kcp-workload-syncer-spec failed to sync {{"apps" "v1" "deployments"} "root:e2e-org-zfx2r:e2e-workspace-549q2|test-syncer/syncer-test"}, err: Deployment.apps "syncer-test" is invalid: [spec.selector: Invalid value: "null": spec.selector in body must be of type object: "null", spec.template.spec.containers: Invalid value: "null": spec.template.spec.containers in body must be of type array: "null"]
I1027 17:33:59.730760   30405 controller.go:162] "queueing SyncTarget" syncTarget.workspace="root:e2e-org-vzv8j:e2e-workspace-94zk5" syncTarget.name="psyncer-01" syncTarget.key="bPAk5kxcvoyZ4pglMchwP4fGjuS8v3APUhKUYJ" reconciler="kcp-syncer-resourcesync-controller" key="root:e2e-org-vzv8j:e2e-workspace-94zk5|psyncer-01"
I1027 17:33:59.731011   30405 controller.go:217] "processing key" syncTarget.workspace="root:e2e-org-vzv8j:e2e-workspace-94zk5" syncTarget.name="psyncer-01" syncTarget.key="bPAk5kxcvoyZ4pglMchwP4fGjuS8v3APUhKUYJ" reconciler="kcp-syncer-resourcesync-controller" key="root:e2e-org-vzv8j:e2e-workspace-94zk5|psyncer-01"
I1027 17:33:59.734181   30405 controller.go:392] "Informer is started already" syncTarget.workspace="root:e2e-org-vzv8j:e2e-workspace-94zk5" syncTarget.name="psyncer-01" syncTarget.key="bPAk5kxcvoyZ4pglMchwP4fGjuS8v3APUhKUYJ" reconciler="kcp-syncer-resourcesync-controller" key="root:e2e-org-vzv8j:e2e-workspace-94zk5|psyncer-01" gvr="/v1, Resource=configmaps"
=== CONT  TestSyncerLifecycle
    syncer_test.go:204: 
            Error Trace:    syncer_test.go:204
            Error:          Condition never satisfied
            Test:           TestSyncerLifecycle
            Messages:       downstream deployment kcp-1ruuu6pavo99/syncer-test was not synced

https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/kcp-dev_kcp/2275/pull-ci-kcp-dev-kcp-main-e2e-shared/1585683386830163968

Steps To Reproduce

Run the CI :)

Expected Behaviour

No failure

Additional Context

Likely similar to https://github.com/kcp-dev/contrib-tmc/issues/104

stevekuznetsov commented 1 year ago

/cc @davidfestal /assign @jmprusi

stevekuznetsov commented 1 year ago

Another: https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/kcp-dev_kcp/2275/pull-ci-kcp-dev-kcp-main-e2e-shared/1585690577364586496

stevekuznetsov commented 1 year ago

@ncdc we see this in the audit logs:

{
  "kind": "Event",
  "apiVersion": "audit.k8s.io/v1",
  "level": "RequestResponse",
  "auditID": "e4e88650-404e-49cb-9c4a-e622aae5a0ac",
  "stage": "ResponseComplete",
  "requestURI": "/apis/apps/v1/namespaces/test-syncer/deployments",
  "verb": "create", 
  "userAgent": "syncer.test/v0.0.0 (linux/amd64) kubernetes/$Format",
  "objectRef": {
    "resource": "deployments",
    "namespace": "test-syncer",
    "name": "syncer-test",
    "apiGroup": "apps",
    "apiVersion": "v1"
  },
  "responseStatus": {
    "metadata": {},
    "code": 201
  },
  "requestObject": {
    "apiVersion": "apps/v1",
    "kind": "Deployment",
    "metadata": {
      "creationTimestamp": null,
      "name": "syncer-test"
    }
  },
  "responseObject": {
    "apiVersion": "apps/v1",
    "kind": "Deployment",
    "metadata": {
      "annotations": {
        "kcp.dev/cluster": "root:e2e-org-9ps42:e2e-workspace-88h9q"
      },
      "creationTimestamp": "2022-10-27T23:29:49Z",
      "generation": 1,
      "name": "syncer-test",
      "namespace": "test-syncer",
      "resourceVersion": "13791",
      "uid": "807b2835-4f7d-49cd-83e9-afc4bd110e8b"
    }
  },
}

This is leading us to believe that the *rest.Storage created for this CRD in customresource_handler.go at least temporarily is missing structural schemas or something. We're almost certain that the client is actually passing spec & status, the server is just ignoring them. We've added a logging PR kcp-dev/kcp#2279 to confirm that.

jmprusi commented 1 year ago

After merging: https://github.com/kcp-dev/kcp/pull/2279

Seems that the client is sending the correct information, and something is removing the spec:

=== CONT  TestSyncerLifecycle
    syncer_test.go:186: difference between what we sent and what we got:   &v1.Deployment{
            TypeMeta: v1.TypeMeta{
        -       Kind:       "Deployment",
        +       Kind:       "",
        -       APIVersion: "apps/v1",
        +       APIVersion: "",
            },
            ObjectMeta: v1.ObjectMeta{
                Name:                       "syncer-test",
                GenerateName:               "",
        -       Namespace:                  "",
        +       Namespace:                  "test-syncer",
                SelfLink:                   "",
        -       UID:                        "",
        +       UID:                        "b9a85a25-99fb-4f62-8802-1558d970abc1",
        -       ResourceVersion:            "",
        +       ResourceVersion:            "12586",
        -       Generation:                 0,
        +       Generation:                 1,
        -       CreationTimestamp:          v1.Time{},
        +       CreationTimestamp:          v1.Time{Time: s"2022-10-28 13:39:37 +0000 UTC"},
                DeletionTimestamp:          nil,
                DeletionGracePeriodSeconds: nil,
                Labels:                     nil,
        -       Annotations:                nil,
        +       Annotations:                map[string]string{"kcp.dev/cluster": "root:e2e-org-mt5c4:e2e-workspace-rxljt"},
                OwnerReferences:            nil,
                Finalizers:                 nil,
                ... // 2 identical fields
            },
            Spec: v1.DeploymentSpec{
        -       Replicas: &1,
        +       Replicas: nil,
        -       Selector: s"&LabelSelector{MatchLabels:map[string]string{app: nginx,},MatchExpressions:[]LabelSelectorRequirement{},}",
        +       Selector: nil,
                Template: v1.PodTemplateSpec{
                    ObjectMeta: v1.ObjectMeta{
                        ... // 8 identical fields
                        DeletionTimestamp:          nil,
                        DeletionGracePeriodSeconds: nil,
        -               Labels:                     map[string]string{"app": "nginx"},
        +               Labels:                     nil,
                        Annotations:                nil,
                        OwnerReferences:            nil,
                        ... // 3 identical fields
                    },
                    Spec: v1.PodSpec{
                        Volumes:        nil,
                        InitContainers: nil,
        -               Containers: []v1.Container{
        -                   {
        -                       Name:    "busybox",
        -                       Image:   "ghcr.io/distroless/busybox:latest",
        -                       Command: []string{"/bin/sh", "-ec", "echo \"Going to sleep\"\ntail -f /d"...},
        -                   },
        -               },
        +               Containers:          nil,
                        EphemeralContainers: nil,
                        RestartPolicy:       "",
                        ... // 31 identical fields
                    },
                },
                Strategy:        {},
                MinReadySeconds: 0,
                ... // 3 identical fields
            },
            Status: {},
          }
    syncer_test.go:190: Waiting for upstream deployment test-syncer/syncer-test to get the syncer finalizer

flaky test run: https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/kcp-dev_kcp/2180/pull-ci-kcp-dev-kcp-main-e2e-shared/1585987621723574272

ncdc commented 1 year ago

Suggest checking pruning mechanics maybe?

ncdc commented 1 year ago

Parallelization is maybe an issue (load on server)

ncdc commented 1 year ago

Maybe server side apply

ncdc commented 1 year ago

@stevekuznetsov are you still looking into this?

stevekuznetsov commented 1 year ago

I wanted you and I to dig into this - just need debugging in the CRD handlers to figure out the race.

lionelvillard commented 1 year ago

happened here: https://storage.googleapis.com/origin-ci-test/pr-logs/pull/kcp-dev_kcp/2420/pull-ci-kcp-dev-kcp-main-e2e-shared/1598656084334088192/build-log.txt

ncdc commented 1 year ago

Fixed by kcp-dev/kcp#2505