kubernetes / kops

Kubernetes Operations (kOps) - Production Grade k8s Installation, Upgrades and Management
https://kops.sigs.k8s.io/
Apache License 2.0
15.66k stars 4.61k forks source link

GCE cluster deletion fails when unable to list InstanceGroupManagers #16594

Open learnitall opened 1 month ago

learnitall commented 1 month ago

/kind bug

1. What kops version are you running? The command kops version, will display this information.

Client version: 1.29.0-alpha.2 (git-v1.29.0-alpha.2)

2. What Kubernetes version are you running? kubectl version will print the version if a cluster is running or provide the Kubernetes version specified as a kops flag.

1.29.3

3. What cloud provider are you using?

GCE

4. What commands did you run? What is the simplest way to reproduce this issue?

All commands were run as part of a GitHub Actions Workflow, see https://github.com/cilium/cilium/actions/runs/9308220473/workflow for the workflow file and https://github.com/cilium/cilium/actions/runs/9308220473/job/25621148334 for the failed run.

5. What happened after the commands executed?

The cluster was created successfully, but the delete command failed to execute after receiving a request quota error from the GCE API.

6. What did you expect to happen?

Kops would detect that the received error was related to a request quota and continue retrying until the request was successful.

7. Please provide your cluster manifest. Execute kops get --name my.example.com -o yaml to display your cluster manifest. You may want to remove your cluster name and other sensitive information.

apiVersion: kops.k8s.io/v1alpha2
kind: Cluster
metadata:
  creationTimestamp: "2024-05-30T19:42:15Z"
  name: scale-100-9308220473-1.k8s.local
spec:
  api:
    loadBalancer:
      type: Public
  authorization:
    rbac: {}
  channel: stable
  cloudConfig: {}
  cloudProvider: gce
  configBase: ***/scale-100-9308220473-1.k8s.local
  etcdClusters:
  - cpuRequest: 200m
    etcdMembers:
    - instanceGroup: control-plane-us-west1-a-1
      name: etcd-1
    - instanceGroup: control-plane-us-west1-a-2
      name: etcd-2
    - instanceGroup: control-plane-us-west1-a-3
      name: etcd-3
    - instanceGroup: control-plane-us-west1-a-4
      name: etcd-4
    - instanceGroup: control-plane-us-west1-a-5
      name: etcd-5
    manager:
      backupRetentionDays: 90
      listenMetricsURLs:
      - http://localhost:2382
    memoryRequest: 100Mi
    name: main
  - cpuRequest: 100m
    etcdMembers:
    - instanceGroup: control-plane-us-west1-a-1
      name: etcd-1
    - instanceGroup: control-plane-us-west1-a-2
      name: etcd-2
    - instanceGroup: control-plane-us-west1-a-3
      name: etcd-3
    - instanceGroup: control-plane-us-west1-a-4
      name: etcd-4
    - instanceGroup: control-plane-us-west1-a-5
      name: etcd-5
    manager:
      backupRetentionDays: 90
    memoryRequest: 100Mi
    name: events
  iam:
    allowContainerRegistry: true
    legacy: false
  kubeAPIServer:
    anonymousAuth: true
    enableContentionProfiling: true
    enableProfiling: true
  kubeControllerManager:
    enableContentionProfiling: true
    enableProfiling: true
  kubeScheduler:
    authorizationAlwaysAllowPaths:
    - /metrics
    - /healthz
    enableContentionProfiling: true
    enableProfiling: true
  kubelet:
    anonymousAuth: false
  kubernetesApiAccess:
  - 0.0.0.0/0
  - ::/0
  kubernetesVersion: 1.29.3
  networking:
    cni: {}
  nonMasqueradeCIDR: 100.64.0.0/10
  project: ***
  sshAccess:
  - 0.0.0.0/0
  - ::/0
  subnets:
  - cidr: 10.16.0.0/16
    name: us-west1
    region: us-west1
    type: Public
  topology:
    dns:
      type: None

8. Please run the commands with most verbose logging by adding the -v 10 flag. Paste the logs into this report, or in a gist and provide the gist link here.

Working on this now.

9. Anything else do we need to know?

Automated retries due to quota limits is super helpful for using kOps in CI workflows, since we don't have to wrap it in logic to retry operations when such limits are hit. Thanks!