kudobuilder / kudo

Kubernetes Universal Declarative Operator (KUDO)
https://kudo.dev
Apache License 2.0
1.18k stars 101 forks source link

upgrade-test is flaky in CI, pods are not scheduled in the faulty runs #1783

Open asekretenko opened 3 years ago

asekretenko commented 3 years ago

What happened: Numerous failures in CI accompanied by errors like

    case.go:230: failed in step 0-install-cert-manager-0-16-0
    case.go:232: --- Deployment:cert-manager/cert-manager-webhook
        +++ Deployment:cert-manager/cert-manager-webhook
        ....
        -  readyReplicas: 1
        +  conditions:
        +  - lastTransitionTime: "2021-04-06T18:33:50Z"
        +    lastUpdateTime: "2021-04-06T18:33:50Z"
        +    message: Deployment does not have minimum availability.
        +    reason: MinimumReplicasUnavailable
        +    status: "False"
        +    type: Available
        +  - lastTransitionTime: "2021-04-06T18:33:49Z"
        +    lastUpdateTime: "2021-04-06T18:33:50Z"
        +    message: ReplicaSet "cert-manager-webhook-86c4dcd4b5" is progressing.
        +    reason: ReplicaSetUpdated
        +    status: "True"
        +    type: Progressing
        +  observedGeneration: 1
        +  replicas: 1
        +  unavailableReplicas: 1
        +  updatedReplicas: 1``
...
logger.go:42: 18:44:16 | upgrade-to-current/1-install-operator | test step failed 1-install-operator
    case.go:230: failed in step 1-install-operator
    case.go:232: --- Instance:kuttl-test-still-wildcat/simple-op
        +++ Instance:kuttl-test-still-wildcat/simple-op
        ...
                -status:
        -  planStatus:
        -    deploy:
        -      status: COMPLETE
        +  planExecution: {}

https://app.circleci.com/pipelines/github/kudobuilder/kudo/5632/workflows/19c70868-710c-4fda-8d2e-34e04f9cfd8b/jobs/17186 https://app.circleci.com/pipelines/github/kudobuilder/kudo/5610/workflows/8f2031a0-9b99-4c5b-813b-2afbd775210c/jobs/17084 https://app.circleci.com/pipelines/github/kudobuilder/kudo/5625/workflows/47210bae-fe4a-4118-96de-cf8f6d93c11b/jobs/17151

Scheduler log looks this way (note: this is a COMPLETE log, nothing happens after acquiring lease):

2021-04-06T18:33:11.130542759Z stderr F I0406 18:33:11.117937       1 registry.go:173] Registering SelectorSpread plugin
2021-04-06T18:33:11.130575993Z stderr F I0406 18:33:11.118024       1 registry.go:173] Registering SelectorSpread plugin
2021-04-06T18:33:14.01837032Z stderr F I0406 18:33:14.018236       1 serving.go:331] Generated self-signed cert in-memory
2021-04-06T18:33:19.332429214Z stderr F W0406 18:33:19.332329       1 requestheader_controller.go:193] Unable to get configmap/extension-apiserver-authentication in kube-system.  Usually fixed by 'kubectl create rolebinding -n kube-system ROLEB
INDING_NAME --role=extension-apiserver-authentication-reader --serviceaccount=YOUR_NS:YOUR_SA'
2021-04-06T18:33:19.332483729Z stderr F W0406 18:33:19.332430       1 authentication.go:294] Error looking up in-cluster authentication configuration: configmaps "extension-apiserver-authentication" is forbidden: User "system:kube-scheduler" ca
nnot get resource "configmaps" in API group "" in the namespace "kube-system"
2021-04-06T18:33:19.332498453Z stderr F W0406 18:33:19.332465       1 authentication.go:295] Continuing without authentication configuration. This may treat all requests as anonymous.
2021-04-06T18:33:19.332559995Z stderr F W0406 18:33:19.332521       1 authentication.go:296] To require authentication configuration lookup to succeed, set --authentication-tolerate-lookup-failure=false
2021-04-06T18:33:19.362443671Z stderr F I0406 18:33:19.362356       1 registry.go:173] Registering SelectorSpread plugin
2021-04-06T18:33:19.368980885Z stderr F I0406 18:33:19.368179       1 registry.go:173] Registering SelectorSpread plugin
2021-04-06T18:33:19.37460045Z stderr F I0406 18:33:19.374512       1 secure_serving.go:197] Serving securely on 127.0.0.1:10259
2021-04-06T18:33:19.393569958Z stderr F E0406 18:33:19.393435       1 reflector.go:127] k8s.io/kubernetes/cmd/kube-scheduler/app/server.go:188: Failed to watch *v1.Pod: failed to list *v1.Pod: pods is forbidden: User "system:kube-scheduler" cannot list resource "pods" in API group "" at the cluster scope
2021-04-06T18:33:19.397579158Z stderr F I0406 18:33:19.397500       1 configmap_cafile_content.go:202] Starting client-ca::kube-system::extension-apiserver-authentication::client-ca-file
2021-04-06T18:33:19.397717655Z stderr F I0406 18:33:19.397649       1 shared_informer.go:240] Waiting for caches to sync for client-ca::kube-system::extension-apiserver-authentication::client-ca-file
2021-04-06T18:33:19.397868057Z stderr F I0406 18:33:19.397801       1 tlsconfig.go:240] Starting DynamicServingCertificateController
2021-04-06T18:33:19.398917934Z stderr F E0406 18:33:19.398832       1 reflector.go:127] k8s.io/client-go/informers/factory.go:134: Failed to watch *v1.PersistentVolumeClaim: failed to list *v1.PersistentVolumeClaim: persistentvolumeclaims is forbidden: User "system:kube-scheduler" cannot list resource "persistentvolumeclaims" in API group "" at the cluster scope
2021-04-06T18:33:19.399300139Z stderr F E0406 18:33:19.399216       1 reflector.go:127] k8s.io/client-go/informers/factory.go:134: Failed to watch *v1.StatefulSet: failed to list *v1.StatefulSet: statefulsets.apps is forbidden: User "system:kube-scheduler" cannot list resource "statefulsets" in API group "apps" at the cluster scope
2021-04-06T18:33:19.40767858Z stderr F E0406 18:33:19.407575       1 reflector.go:127] k8s.io/apiserver/pkg/server/dynamiccertificates/configmap_cafile_content.go:206: Failed to watch *v1.ConfigMap: failed to list *v1.ConfigMap: configmaps "extension-apiserver-authentication" is forbidden: User "system:kube-scheduler" cannot list resource "configmaps" in API group "" in the namespace "kube-system"
2021-04-06T18:33:19.408205283Z stderr F E0406 18:33:19.408130       1 reflector.go:127] k8s.io/client-go/informers/factory.go:134: Failed to watch *v1.ReplicationController: failed to list *v1.ReplicationController: replicationcontrollers is forbidden: User "system:kube-scheduler" cannot list resource "replicationcontrollers" in API group "" at the cluster scope
2021-04-06T18:33:19.408459296Z stderr F E0406 18:33:19.408389       1 reflector.go:127] k8s.io/client-go/informers/factory.go:134: Failed to watch *v1.PersistentVolume: failed to list *v1.PersistentVolume: persistentvolumes is forbidden: User "system:kube-scheduler" cannot list resource "persistentvolumes" in API group "" at the cluster scope
2021-04-06T18:33:19.408713732Z stderr F E0406 18:33:19.408647       1 reflector.go:127] k8s.io/client-go/informers/factory.go:134: Failed to watch *v1beta1.PodDisruptionBudget: failed to list *v1beta1.PodDisruptionBudget: poddisruptionbudgets.policy is forbidden: User "system:kube-scheduler" cannot list resource "poddisruptionbudgets" in API group "policy" at the cluster scope
2021-04-06T18:33:19.409091453Z stderr F E0406 18:33:19.409023       1 reflector.go:127] k8s.io/client-go/informers/factory.go:134: Failed to watch *v1.ReplicaSet: failed to list *v1.ReplicaSet: replicasets.apps is forbidden: User "system:kube-scheduler" cannot list resource "replicasets" in API group "apps" at the cluster scope
2021-04-06T18:33:19.409348483Z stderr F E0406 18:33:19.409274       1 reflector.go:127] k8s.io/client-go/informers/factory.go:134: Failed to watch *v1.Service: failed to list *v1.Service: services is forbidden: User "system:kube-scheduler" cannot list resource "services" in API group "" at the cluster scope
2021-04-06T18:33:19.409632135Z stderr F E0406 18:33:19.409573       1 reflector.go:127] k8s.io/client-go/informers/factory.go:134: Failed to watch *v1.StorageClass: failed to list *v1.StorageClass: storageclasses.storage.k8s.io is forbidden: User "system:kube-scheduler" cannot list resource "storageclasses" in API group "storage.k8s.io" at the cluster scope
2021-04-06T18:33:19.409869723Z stderr F E0406 18:33:19.409801       1 reflector.go:127] k8s.io/client-go/informers/factory.go:134: Failed to watch *v1.Node: failed to list *v1.Node: nodes is forbidden: User "system:kube-scheduler" cannot list resource "nodes" in API group "" at the cluster scope
2021-04-06T18:33:19.410109209Z stderr F E0406 18:33:19.410052       1 reflector.go:127] k8s.io/client-go/informers/factory.go:134: Failed to watch *v1.CSINode: failed to list *v1.CSINode: csinodes.storage.k8s.io is forbidden: User "system:kube-scheduler" cannot list resource "csinodes" in API group "storage.k8s.io" at the cluster scope
2021-04-06T18:33:19.410244792Z stderr F E0406 18:33:19.410196       1 reflector.go:127] k8s.io/client-go/informers/factory.go:134: Failed to watch *v1.Pod: failed to list *v1.Pod: pods is forbidden: User "system:kube-scheduler" cannot list resource "pods" in API group "" at the cluster scope
2021-04-06T18:33:19.397579158Z stderr F I0406 18:33:19.397500       1 configmap_cafile_content.go:202] Starting client-ca::kube-system::extension-apiserver-authentication::client-ca-file
2021-04-06T18:33:19.397717655Z stderr F I0406 18:33:19.397649       1 shared_informer.go:240] Waiting for caches to sync for client-ca::kube-system::extension-apiserver-authentication::client-ca-file
2021-04-06T18:33:19.397868057Z stderr F I0406 18:33:19.397801       1 tlsconfig.go:240] Starting DynamicServingCertificateController
2021-04-06T18:33:19.398917934Z stderr F E0406 18:33:19.398832       1 reflector.go:127] k8s.io/client-go/informers/factory.go:134: Failed to watch *v1.PersistentVolumeClaim: failed to list *v1.PersistentVolumeClaim: persistentvolumeclaims is forbidden: User "system:kube-scheduler" cannot list resource "persistentvolumeclaims" in API group "" at the cluster scope
2021-04-06T18:33:19.399300139Z stderr F E0406 18:33:19.399216       1 reflector.go:127] k8s.io/client-go/informers/factory.go:134: Failed to watch *v1.StatefulSet: failed to list *v1.StatefulSet: statefulsets.apps is forbidden: User "system:kube-scheduler" cannot list resource "statefulsets" in API group "apps" at the cluster scope
2021-04-06T18:33:19.40767858Z stderr F E0406 18:33:19.407575       1 reflector.go:127] k8s.io/apiserver/pkg/server/dynamiccertificates/configmap_cafile_content.go:206: Failed to watch *v1.ConfigMap: failed to list *v1.ConfigMap: configmaps "extension-apiserver-authentication" is forbidden: User "system:kube-scheduler" cannot list resource "configmaps" in API group "" in the namespace "kube-system"
2021-04-06T18:33:19.408205283Z stderr F E0406 18:33:19.408130       1 reflector.go:127] k8s.io/client-go/informers/factory.go:134: Failed to watch *v1.ReplicationController: failed to list *v1.ReplicationController: replicationcontrollers is forbidden: User "system:kube-scheduler" cannot list resource "replicationcontrollers" in API group "" at the cluster scope
2021-04-06T18:33:19.408459296Z stderr F E0406 18:33:19.408389       1 reflector.go:127] k8s.io/client-go/informers/factory.go:134: Failed to watch *v1.PersistentVolume: failed to list *v1.PersistentVolume: persistentvolumes is forbidden: User "system:kube-scheduler" cannot list resource "persistentvolumes" in API group "" at the cluster scope
2021-04-06T18:33:19.408713732Z stderr F E0406 18:33:19.408647       1 reflector.go:127] k8s.io/client-go/informers/factory.go:134: Failed to watch *v1beta1.PodDisruptionBudget: failed to list *v1beta1.PodDisruptionBudget: poddisruptionbudgets.policy is forbidden: User "system:kube-scheduler" cannot list resource "poddisruptionbudgets" in API group "policy" at the cluster scope
2021-04-06T18:33:19.409091453Z stderr F E0406 18:33:19.409023       1 reflector.go:127] k8s.io/client-go/informers/factory.go:134: Failed to watch *v1.ReplicaSet: failed to list *v1.ReplicaSet: replicasets.apps is forbidden: User "system:kube-scheduler" cannot list resource "replicasets" in API group "apps" at the cluster scope
2021-04-06T18:33:19.409348483Z stderr F E0406 18:33:19.409274       1 reflector.go:127] k8s.io/client-go/informers/factory.go:134: Failed to watch *v1.Service: failed to list *v1.Service: services is forbidden: User "system:kube-scheduler" cannot list resource "services" in API group "" at the cluster scope
2021-04-06T18:33:19.409632135Z stderr F E0406 18:33:19.409573       1 reflector.go:127] k8s.io/client-go/informers/factory.go:134: Failed to watch *v1.StorageClass: failed to list *v1.StorageClass: storageclasses.storage.k8s.io is forbidden: User "system:kube-scheduler" cannot list resource "storageclasses" in API group "storage.k8s.io" at the cluster scope
2021-04-06T18:33:19.409869723Z stderr F E0406 18:33:19.409801       1 reflector.go:127] k8s.io/client-go/informers/factory.go:134: Failed to watch *v1.Node: failed to list *v1.Node: nodes is forbidden: User "system:kube-scheduler" cannot list resource "nodes" in API group "" at the cluster scope
2021-04-06T18:33:19.410109209Z stderr F E0406 18:33:19.410052       1 reflector.go:127] k8s.io/client-go/informers/factory.go:134: Failed to watch *v1.CSINode: failed to list *v1.CSINode: csinodes.storage.k8s.io is forbidden: User "system:kube-scheduler" cannot list resource "csinodes" in API group "storage.k8s.io" at the cluster scope
2021-04-06T18:33:19.410244792Z stderr F E0406 18:33:19.410196       1 reflector.go:127] k8s.io/client-go/informers/factory.go:134: Failed to watch *v1.Pod: failed to list *v1.Pod: pods is forbidden: User "system:kube-scheduler" cannot list resource "pods" in API group "" at the cluster scope
2021-04-06T18:33:20.217911583Z stderr F E0406 18:33:20.217763       1 reflector.go:127] k8s.io/apiserver/pkg/server/dynamiccertificates/configmap_cafile_content.go:206: Failed to watch *v1.ConfigMap: failed to list *v1.ConfigMap: configmaps "extension-apiserver-authentication" is forbidden: User "system:kube-scheduler" cannot list resource "configmaps" in API group "" in the namespace "kube-system"
2021-04-06T18:33:20.33541209Z stderr F E0406 18:33:20.335273       1 reflector.go:127] k8s.io/client-go/informers/factory.go:134: Failed to watch *v1.Node: failed to list *v1.Node: nodes is forbidden: User "system:kube-scheduler" cannot list resource "nodes" in API group "" at the cluster scope
2021-04-06T18:33:20.426911957Z stderr F E0406 18:33:20.426712       1 reflector.go:127] k8s.io/client-go/informers/factory.go:134: Failed to watch *v1.ReplicationController: failed to list *v1.ReplicationController: replicationcontrollers is forbidden: User "system:kube-scheduler" cannot list resource "replicationcontrollers" in API group "" at the cluster scope
2021-04-06T18:33:20.492015045Z stderr F E0406 18:33:20.491886       1 reflector.go:127] k8s.io/client-go/informers/factory.go:134: Failed to watch *v1.StorageClass: failed to list *v1.StorageClass: storageclasses.storage.k8s.io is forbidden: User "system:kube-scheduler" cannot list resource "storageclasses" in API group "storage.k8s.io" at the cluster scope
2021-04-06T18:33:22.59812414Z stderr F I0406 18:33:22.597963       1 shared_informer.go:247] Caches are synced for client-ca::kube-system::extension-apiserver-authentication::client-ca-file 
2021-04-06T18:33:23.275812829Z stderr F I0406 18:33:23.275299       1 leaderelection.go:243] attempting to acquire leader lease  kube-system/kube-scheduler...
2021-04-06T18:33:23.300250453Z stderr F I0406 18:33:23.300142       1 leaderelection.go:253] successfully acquired lease kube-system/kube-scheduler

What you expected to happen: Scheduler to always correctly authenticate, set watches and schedule pods; the tests not to fail as a result.

How to reproduce it (as minimally and precisely as possible): No idea yet. I would give a good breakfast to know that.

Anything else we need to know?: I would not be surprised if this is in fact a more general kuttl bug (but why do other test suites not suffer then? or maybe they also do?) or an even more general cluster bootstrapping bug.

There is another flake in upgrade-tests (https://github.com/kudobuilder/kudo/issues/1736), most likely they are not related.

Environment:

asekretenko commented 3 years ago

Apparently this flake is rather old, and dates back to some point before the 0.18.0 release.

The earliest flake of this kind that I could extract from Circle CI: https://app.circleci.com/pipelines/github/kudobuilder/kudo/5544/workflows/8743fa9f-aaee-46dd-9b8c-d489a2bcc672/jobs/16762

This doesn't mean there are no older flakes of this kind; just viewing them manually is becoming the more and more difficult due to Circle CI limitations.