E2E Failures due to Karpenter node consolidation

rifelpet commented 9 months ago

/kind flake

The Karpenter prow jobs are flaky because of Karpenter's aggressive node churn.

This run failed while waiting for the cluster to pass validation. Nodes were terminated during validation according to the karpenter logs:

2023-11-29T08:18:07.067Z    INFO    controller.termination  cordoned node   {"commit": "637a642", "node": "i-097c54da93cda5b1a"}
2023-11-29T08:18:07.500Z    INFO    controller.termination  deleted node    {"commit": "637a642", "node": "i-097c54da93cda5b1a"}
2023-11-29T08:18:07.873Z    INFO    controller.machine.termination  deleted machine {"commit": "637a642", "machine": "nodes-2mgkk", "provisioner": "nodes", "node": "i-097c54da93cda5b1a", "provider-id": "aws:///ap-southeast-1b/i-097c54da93cda5b1a"}

This run passed validation but had flakey e2e tests that timed out while waiting for pods to be scheduled and running. The karpenter logs reveal multiple terminations and launches during the e2e suite:

2023-11-27T16:26:33.865Z    INFO    controller.machine.lifecycle    launched machine    {"commit": "637a642", "machine": "nodes-q26rn", "provisioner": "nodes", "provider-id": "aws:///ca-central-1d/i-09133cb122f1bdfc4", "instance-type": "t4g.small", "zone": "ca-central-1d", "capacity-type": "spot", "allocatable": {"cpu":"1930m","ephemeral-storage":"17Gi","memory":"1359Mi","pods":"11"}}
2023-11-27T16:28:23.000Z    INFO    controller.machine.lifecycle    launched machine    {"commit": "637a642", "machine": "nodes-ts47b", "provisioner": "nodes", "provider-id": "aws:///ca-central-1d/i-070ecd4067d8245a5", "instance-type": "t4g.small", "zone": "ca-central-1d", "capacity-type": "spot", "allocatable": {"cpu":"1930m","ephemeral-storage":"17Gi","memory":"1359Mi","pods":"11"}}
2023-11-27T16:33:35.500Z    INFO    controller.machine.termination  deleted machine {"commit": "637a642", "machine": "nodes-ts47b", "provisioner": "nodes", "node": "i-070ecd4067d8245a5", "provider-id": "aws:///ca-central-1d/i-070ecd4067d8245a5"}

2023-11-27T16:35:12 e2e tests started

2023-11-27T16:41:30.026Z    INFO    controller.machine.lifecycle    launched machine    {"commit": "637a642", "machine": "nodes-gk7q9", "provisioner": "nodes", "provider-id": "aws:///ca-central-1d/i-039cf7c143e5e967b", "instance-type": "t4g.medium", "zone": "ca-central-1d", "capacity-type": "spot", "allocatable": {"cpu":"1930m","ephemeral-storage":"17Gi","memory":"3187Mi","pods":"17"}}
2023-11-27T16:42:58.639Z    INFO    controller.machine.termination  deleted machine {"commit": "637a642", "machine": "nodes-q26rn", "provisioner": "nodes", "node": "i-09133cb122f1bdfc4", "provider-id": "aws:///ca-central-1d/i-09133cb122f1bdfc4"}
2023-11-27T16:43:04.284Z    INFO    controller.machine.lifecycle    launched machine    {"commit": "637a642", "machine": "nodes-2jnxv", "provisioner": "nodes", "provider-id": "aws:///ca-central-1d/i-046145be80afd6a9b", "instance-type": "t4g.small", "zone": "ca-central-1d", "capacity-type": "spot", "allocatable": {"cpu":"1930m","ephemeral-storage":"17Gi","memory":"1359Mi","pods":"11"}}
2023-11-27T16:43:14.226Z    INFO    controller.machine.lifecycle    launched machine    {"commit": "637a642", "machine": "nodes-zgt9g", "provisioner": "nodes", "provider-id": "aws:///ca-central-1d/i-0a8f2ffefe5eca0a2", "instance-type": "t4g.medium", "zone": "ca-central-1d", "capacity-type": "spot", "allocatable": {"cpu":"1930m","ephemeral-storage":"17Gi","memory":"3187Mi","pods":"17"}}
2023-11-27T16:46:11.915Z    INFO    controller.machine.termination  deleted machine {"commit": "637a642", "machine": "nodes-zgt9g", "provisioner": "nodes", "node": "i-0a8f2ffefe5eca0a2", "provider-id": "aws:///ca-central-1d/i-0a8f2ffefe5eca0a2"}
2023-11-27T16:50:11.603Z    INFO    controller.machine.termination  deleted machine {"commit": "637a642", "machine": "nodes-2jnxv", "provisioner": "nodes", "node": "i-046145be80afd6a9b", "provider-id": "aws:///ca-central-1d/i-046145be80afd6a9b"}

2023-11-27T16:52:25 e2e tests finished

Example test failure because of pods pending:

   [FAILED] Timed out after 300.001s.
  Expected
      <v1.PodPhase>: Pending
  to equal
      <v1.PodPhase>: Succeeded
  In [It] at: test/e2e/common/node/runtime.go:157 @ 11/27/23 16:40:23.477

The k/k e2e suite also inspects the state of the cluster at the beginning to determine # of nodes, zone spread, etc.

We should aim for node stability during e2e tests. We currently enable karpenter's consolidation but should probably disable it for e2e:

https://github.com/kubernetes/kops/blob/03779878069ec0362e33fe308d3dd789e8d516d5/upup/models/cloudup/resources/addons/karpenter.sh/k8s-1.19.yaml.template#L1815-L1817

Note that newer karpenter versions introduced a v1beta1 API with significant changes so any new InstanceGroup API fields should probably reflect the v1beta1 API's terminology and/or schema

k8s-triage-robot commented 6 months ago

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

rifelpet commented 6 months ago

/remove-lifecycle stale

k8s-triage-robot commented 3 months ago

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

rifelpet commented 2 months ago

/remove-lifecycle stale

kubernetes / kops

E2E Failures due to Karpenter node consolidation #16142