kudobuilder / kudo

Kubernetes Universal Declarative Operator (KUDO)
https://kudo.dev
Apache License 2.0
1.18k stars 103 forks source link

e2e tests fail due to resource starvation #1182

Open porridge opened 4 years ago

porridge commented 4 years ago

What happened:

That test failed twice in a row:

The failure was:

case.go:155: resource Instance:kudo-test-fond-cod/zk: .status.aggregatedStatus.status: value mismatch, expected: COMPLETE != actual: IN_PROGRESS
case.go:155: resource StatefulSet:kudo-test-fond-cod/zk-zookeeper: .status.readyReplicas: value mismatch, expected: 3 != actual: 2

Event dump:

            logger.go:37: 09:53:07 | zookeeper-upgrade-test | 2019-12-16 09:48:02 +0000 UTC Normal  PlanStarted Execution of plan deploy started
            logger.go:37: 09:53:07 | zookeeper-upgrade-test | 2019-12-16 09:48:11 +0000 UTC Normal  WaitForFirstConsumer    waiting for first consumer to be created before binding
            logger.go:37: 09:53:07 | zookeeper-upgrade-test | 2019-12-16 09:48:11 +0000 UTC Normal  WaitForFirstConsumer    waiting for first consumer to be created before binding
            logger.go:37: 09:53:07 | zookeeper-upgrade-test | 2019-12-16 09:48:11 +0000 UTC Normal  WaitForFirstConsumer    waiting for first consumer to be created before binding
            logger.go:37: 09:53:07 | zookeeper-upgrade-test | 2019-12-16 09:48:11 +0000 UTC Warning FailedScheduling    0/1 nodes are available: 1 node(s) had taints that the pod didn't tolerate.
            logger.go:37: 09:53:07 | zookeeper-upgrade-test | 2019-12-16 09:48:11 +0000 UTC Warning FailedScheduling    0/1 nodes are available: 1 node(s) had taints that the pod didn't tolerate.
            logger.go:37: 09:53:07 | zookeeper-upgrade-test | 2019-12-16 09:48:11 +0000 UTC Normal  SuccessfulCreate    create Claim zk-datadir-zk-zookeeper-0 Pod zk-zookeeper-0 in StatefulSet zk-zookeeper success
            logger.go:37: 09:53:07 | zookeeper-upgrade-test | 2019-12-16 09:48:11 +0000 UTC Normal  SuccessfulCreate    create Pod zk-zookeeper-0 in StatefulSet zk-zookeeper successful
            logger.go:37: 09:53:07 | zookeeper-upgrade-test | 2019-12-16 09:48:11 +0000 UTC Normal  SuccessfulCreate    create Claim zk-datadir-zk-zookeeper-1 Pod zk-zookeeper-1 in StatefulSet zk-zookeeper success
            logger.go:37: 09:53:07 | zookeeper-upgrade-test | 2019-12-16 09:48:11 +0000 UTC Normal  SuccessfulCreate    create Pod zk-zookeeper-1 in StatefulSet zk-zookeeper successful
            logger.go:37: 09:53:07 | zookeeper-upgrade-test | 2019-12-16 09:48:11 +0000 UTC Normal  SuccessfulCreate    create Claim zk-datadir-zk-zookeeper-2 Pod zk-zookeeper-2 in StatefulSet zk-zookeeper success
            logger.go:37: 09:53:07 | zookeeper-upgrade-test | 2019-12-16 09:48:12 +0000 UTC Warning FailedScheduling    0/1 nodes are available: 1 node(s) had taints that the pod didn't tolerate.
            logger.go:37: 09:53:07 | zookeeper-upgrade-test | 2019-12-16 09:48:12 +0000 UTC Normal  SuccessfulCreate    create Pod zk-zookeeper-2 in StatefulSet zk-zookeeper successful
            logger.go:37: 09:53:07 | zookeeper-upgrade-test | 2019-12-16 09:48:54 +0000 UTC Normal  ExternalProvisioning    waiting for a volume to be created, either by external provisioner "rancher.io/local-path" or manually created by system administrator
            logger.go:37: 09:53:07 | zookeeper-upgrade-test | 2019-12-16 09:48:54 +0000 UTC Normal  ExternalProvisioning    waiting for a volume to be created, either by external provisioner "rancher.io/local-path" or manually created by system administrator
            logger.go:37: 09:53:07 | zookeeper-upgrade-test | 2019-12-16 09:48:54 +0000 UTC Normal  ExternalProvisioning    waiting for a volume to be created, either by external provisioner "rancher.io/local-path" or manually created by system administrator
            logger.go:37: 09:53:07 | zookeeper-upgrade-test | 2019-12-16 09:48:58 +0000 UTC Normal  Provisioning    External provisioner is provisioning volume for claim "kudo-test-fond-cod/zk-datadir-zk-zookeeper-0"
            logger.go:37: 09:53:07 | zookeeper-upgrade-test | 2019-12-16 09:48:59 +0000 UTC Normal  Provisioning    External provisioner is provisioning volume for claim "kudo-test-fond-cod/zk-datadir-zk-zookeeper-1"
            logger.go:37: 09:53:07 | zookeeper-upgrade-test | 2019-12-16 09:49:00 +0000 UTC Normal  Provisioning    External provisioner is provisioning volume for claim "kudo-test-fond-cod/zk-datadir-zk-zookeeper-2"
            logger.go:37: 09:53:07 | zookeeper-upgrade-test | 2019-12-16 09:49:02 +0000 UTC Normal  ProvisioningSucceeded   Successfully provisioned volume pvc-928b5f78-c695-450b-8db7-bbf5b09232e3
            logger.go:37: 09:53:07 | zookeeper-upgrade-test | 2019-12-16 09:49:02 +0000 UTC Normal  Scheduled   Successfully assigned kudo-test-fond-cod/zk-zookeeper-0 to kind-control-plane
            logger.go:37: 09:53:07 | zookeeper-upgrade-test | 2019-12-16 09:49:03 +0000 UTC Normal  ProvisioningSucceeded   Successfully provisioned volume pvc-7cb63aa0-9066-4452-a972-737230d038b5
            logger.go:37: 09:53:07 | zookeeper-upgrade-test | 2019-12-16 09:49:03 +0000 UTC Normal  Scheduled   Successfully assigned kudo-test-fond-cod/zk-zookeeper-1 to kind-control-plane
            logger.go:37: 09:53:07 | zookeeper-upgrade-test | 2019-12-16 09:49:04 +0000 UTC Normal  ProvisioningSucceeded   Successfully provisioned volume pvc-3e6d54cb-d725-4ccc-8f79-d590897d1691
            logger.go:37: 09:53:07 | zookeeper-upgrade-test | 2019-12-16 09:49:04 +0000 UTC Normal  Pulling Pulling image "zookeeper:3.4.14"
            logger.go:37: 09:53:07 | zookeeper-upgrade-test | 2019-12-16 09:49:04 +0000 UTC Warning OutOfcpu    Node didn't have enough resource: cpu, requested: 300, used: 1750, capacity: 2000
            logger.go:37: 09:53:07 | zookeeper-upgrade-test | 2019-12-16 09:49:04 +0000 UTC Normal  Scheduled   Successfully assigned kudo-test-fond-cod/zk-zookeeper-2 to kind-control-plane
            logger.go:37: 09:53:07 | zookeeper-upgrade-test | 2019-12-16 09:49:06 +0000 UTC Normal  Pulling Pulling image "zookeeper:3.4.14"
            logger.go:37: 09:53:07 | zookeeper-upgrade-test | 2019-12-16 09:49:06 +0000 UTC Normal  Scheduled   Successfully assigned kudo-test-fond-cod/zk-zookeeper-2 to kind-control-plane
            logger.go:37: 09:53:07 | zookeeper-upgrade-test | 2019-12-16 09:49:06 +0000 UTC Warning OutOfcpu    Node didn't have enough resource: cpu, requested: 300, used: 1750, capacity: 2000
            logger.go:37: 09:53:07 | zookeeper-upgrade-test | 2019-12-16 09:49:06 +0000 UTC Warning RecreatingFailedPod StatefulSet kudo-test-fond-cod/zk-zookeeper is recreating failed Pod zk-zookeeper-2
            logger.go:37: 09:53:07 | zookeeper-upgrade-test | 2019-12-16 09:49:06 +0000 UTC Normal  SuccessfulDelete    delete Pod zk-zookeeper-2 in StatefulSet zk-zookeeper successful
            logger.go:37: 09:53:07 | zookeeper-upgrade-test | 2019-12-16 09:49:07 +0000 UTC Warning OutOfcpu    Node didn't have enough resource: cpu, requested: 300, used: 1750, capacity: 2000
            logger.go:37: 09:53:07 | zookeeper-upgrade-test | 2019-12-16 09:49:07 +0000 UTC Normal  Scheduled   Successfully assigned kudo-test-fond-cod/zk-zookeeper-2 to kind-control-plane
            logger.go:37: 09:53:07 | zookeeper-upgrade-test | 2019-12-16 09:49:08 +0000 UTC Normal  Scheduled   Successfully assigned kudo-test-fond-cod/zk-zookeeper-2 to kind-control-plane
            logger.go:37: 09:53:07 | zookeeper-upgrade-test | 2019-12-16 09:49:08 +0000 UTC Warning OutOfcpu    Node didn't have enough resource: cpu, requested: 300, used: 1750, capacity: 2000
            logger.go:37: 09:53:07 | zookeeper-upgrade-test | 2019-12-16 09:49:09 +0000 UTC Warning FailedScheduling    0/1 nodes are available: 1 Insufficient cpu.
            logger.go:37: 09:53:07 | zookeeper-upgrade-test | 2019-12-16 09:49:12 +0000 UTC Normal  Pulled  Successfully pulled image "zookeeper:3.4.14"
            logger.go:37: 09:53:07 | zookeeper-upgrade-test | 2019-12-16 09:49:12 +0000 UTC Normal  Created Created container kubernetes-zookeeper
            logger.go:37: 09:53:07 | zookeeper-upgrade-test | 2019-12-16 09:49:12 +0000 UTC Normal  Started Started container kubernetes-zookeeper
            logger.go:37: 09:53:07 | zookeeper-upgrade-test | 2019-12-16 09:49:12 +0000 UTC Normal  Pulled  Successfully pulled image "zookeeper:3.4.14"
            logger.go:37: 09:53:07 | zookeeper-upgrade-test | 2019-12-16 09:49:12 +0000 UTC Normal  Created Created container kubernetes-zookeeper
            logger.go:37: 09:53:07 | zookeeper-upgrade-test | 2019-12-16 09:49:13 +0000 UTC Normal  Started Started container kubernetes-zookeeper
            logger.go:37: 09:53:07 | zookeeper-upgrade-test | Deleting namespace: kudo-test-fond-cod

What you expected to happen:

Ideally, not be flaky.

When it fails, we need more information to debug this. The events alone are not enough. It seems like there might be some resource shortage based on the Node didn't have enough resource messages, but it is not clear whether it's a permanent or transient issue.

In case of failure, we also need:

How to reproduce it (as minimally and precisely as possible):

:man_shrugging:

Anything else we need to know?:

Perhaps this is related to https://github.com/kudobuilder/kudo/issues/1114

porridge commented 4 years ago

FTR this was worked around by decreasing the resource requirements of ZK in https://github.com/kudobuilder/operators/pull/162

porridge commented 4 years ago

We think that this might be an interaction with another (kafka) test which is running in parallel, consuming precious resources.