kube-flannel breaks in Kubernetes upgrade to 1.25 due to namespace mismatch

mkoepke-xion commented 1 year ago

/kind bug

1. What kops version are you running? The command kops version, will display this information.

1.23.2 -> 1.25.3

2. What Kubernetes version are you running? kubectl version will print the version if a cluster is running or provide the Kubernetes version specified as a kops flag.

1.23.16 -> 1.24.11 -> 1.25.7

3. What cloud provider are you using?

openstack

4. What commands did you run? What is the simplest way to reproduce this issue?

- Install Kubernetes 1.23.16 using kOps 1.23
- Upgrade kOps to 1.25
- Upgrade Kubernetes to 1.24.11 (kops edit cluster, kops update cluster --yes, kops rolling-update  cluster --yes)
- Upgrade Kubernetes to 1.25.7 (kops edit cluster, kops update cluster --yes, kops rolling-update  cluster --yes)

5. What happened after the commands executed?

Rolling update broke on first master-node with:

I0308 12:56:01.617102    1882 instancegroups.go:533] Cluster did not pass validation, will retry in "30s": system-node-critical pod "kube-flannel-ds-mggm2" is not ready (kube-flannel).
I0308 12:56:33.321708    1882 instancegroups.go:530] Cluster did not pass validation within deadline: system-node-critical pod "kube-flannel-ds-mggm2" is not ready (kube-flannel).
E0308 12:56:33.322263    1882 instancegroups.go:482] Cluster did not validate within 15m0s

The issue was, that there were two daemonsets kube-flannel running now. One in kube-system and one in kube-flannel:

# kubectl get daemonsets.apps -A
NAMESPACE      NAME                       DESIRED   CURRENT   READY   UP-TO-DATE   AVAILABLE   NODE SELECTOR   AGE
kube-flannel   kube-flannel-ds            6         6         6       6            6           <none>          23m
kube-system    csi-cinder-nodeplugin      6         6         6       6            6           <none>          20h
kube-system    kops-controller            3         3         3       3            3           <none>          20h
kube-system    kube-flannel-ds            6         6         5       6            5           <none>          20h
kube-system    openstack-cloud-provider   3         3         3       3            3           <none>          20h
xxx   promtail                   3         3         3       3            3           <none>          20h

6. What did you expect to happen?

I expected the upgrade to go smoothly as the update to kubernetes 1.24 was also done using kops 1.25. I would have expected kube-flannel to stay in kube-system. If kOps wants to move it to kube-flannel namespace, I would have expected kOps to clean it up in kube-system. Alternatively I would have expected a hint in the Breaking changes section.

7. Please provide your cluster manifest. Execute kops get --name my.example.com -o yaml to display your cluster manifest. You may want to remove your cluster name and other sensitive information.

apiVersion: kops.k8s.io/v1alpha2
kind: Cluster
metadata:
  creationTimestamp: "2023-03-07T16:50:25Z"
  generation: 2
  name: dev.kops-1-25.k8s.local
spec:
  addons:
  - manifest: xxxx/dev/dev.kops-1-25.k8s.local/logging.xxx/addon.yaml
  - manifest: xxxx/dev/dev.kops-1-25.k8s.local/namespace.xxx/addon.yaml
  - manifest: xxxx/dev/dev.kops-1-25.k8s.local/kube-state-metric.xxx/addon.yaml
  - manifest: xxxx/dev/dev.kops-1-25.k8s.local/metrics.xxx/addon.yaml
  - manifest: xxxx/dev/dev.kops-1-25.k8s.local/service-ip-label-webhook.admission-webhooks.xxx/addon.yaml
  - manifest: xxxx/dev/dev.kops-1-25.k8s.local/storage.xxx/addon.yaml
  - manifest: xxxx/dev/dev.kops-1-25.k8s.local/dashboard.xxx/addon.yaml
  - manifest: xxxx/dev/dev.kops-1-25.k8s.local/ingress.xxx/addon.yaml
  - manifest: xxxx/dev/dev.kops-1-25.k8s.local/quality-assurance.xxx/addon.yaml
  - manifest: xxxx/dev/dev.kops-1-25.k8s.local/auth.xxx/addon.yaml
  api:
    loadBalancer:
      type: Public
  authorization:
    rbac: {}
  certManager:
    enabled: false
  channel: stable
  cloudConfig:
    openstack:
      blockStorage:
        bs-version: v2
        createStorageClass: false
        ignore-volume-az: true
        override-volume-az: nova
      loadbalancer:
        floatingNetwork: public
        floatingNetworkID: 91371e55-9cc1-4ed0-bbdc-a7476669b4bd
        manageSecurityGroups: true
        method: ROUND_ROBIN
        provider: haproxy
        useOctavia: false
      monitor:
        delay: 1m
        maxRetries: 3
        timeout: 30s
      router:
        externalNetwork: public
  cloudControllerManager:
    clusterName: dev.kops-1-25.k8s.local
    image: k8scloudprovider/openstack-cloud-controller-manager:v1.19.2
  cloudProvider: openstack
  configBase: xxx
  containerRuntime: containerd
  docker:
    registryMirrors:
    - xxx
  etcdClusters:
  - cpuRequest: 200m
    etcdMembers:
    - instanceGroup: master-zone-01
      name: etcd-zone-01
      volumeSize: 2
      volumeType: fast-1000
    - instanceGroup: master-zone-02
      name: etcd-zone-02
      volumeSize: 2
      volumeType: fast-1000
    - instanceGroup: master-zone-03
      name: etcd-zone-03
      volumeSize: 2
      volumeType: fast-1000
    manager:
      env:
      - name: ETCD_MANAGER_HOURLY_BACKUPS_RETENTION
        value: 7d
      - name: ETCD_MANAGER_DAILY_BACKUPS_RETENTION
        value: 14d
    memoryRequest: 100Mi
    name: main
    provider: Manager
  - cpuRequest: 100m
    etcdMembers:
    - instanceGroup: master-zone-01
      name: etcd-zone-01
      volumeSize: 2
      volumeType: fast-1000
    - instanceGroup: master-zone-02
      name: etcd-zone-02
      volumeSize: 2
      volumeType: fast-1000
    - instanceGroup: master-zone-03
      name: etcd-zone-03
      volumeSize: 2
      volumeType: fast-1000
    manager:
      env:
      - name: ETCD_MANAGER_HOURLY_BACKUPS_RETENTION
        value: 7d
      - name: ETCD_MANAGER_DAILY_BACKUPS_RETENTION
        value: 14d
    memoryRequest: 100Mi
    name: events
    provider: Manager
  iam:
    allowContainerRegistry: true
    legacy: false
  kubeAPIServer:
    allowPrivileged: true
    oidcClientID: xxx
    oidcGroupsClaim: xxx
    oidcIssuerURL: xxx
    oidcUsernameClaim: xxx
  kubeProxy:
    metricsBindAddress: 0.0.0.0
  kubelet:
    anonymousAuth: false
    authenticationTokenWebhook: true
    authorizationMode: Webhook
  kubernetesApiAccess:
  - 0.0.0.0/0
  kubernetesVersion: 1.25.7
  masterInternalName: api.internal.dev.kops-1-25.k8s.local
  masterPublicName: api.dev.kops-1-25.k8s.local
  metricsServer:
    enabled: true
    insecure: true
  networkCIDR: 10.0.0.0/20
  networking:
    flannel:
      backend: vxlan
  nonMasqueradeCIDR: 100.64.0.0/10
  sshAccess:
  - 0.0.0.0/0
  sshKeyName: xxx
  subnets:
  - cidr: 10.0.1.0/24
    name: zone01
    type: Private
    zone: local_zone_01
  - cidr: 10.0.2.0/24
    name: zone02
    type: Private
    zone: local_zone_02
  - cidr: 10.0.3.0/24
    name: zone03
    type: Private
    zone: local_zone_03
  - cidr: 10.0.15.0/29
    name: utility-zone01
    type: Utility
    zone: local_zone_01
  topology:
    bastion:
      bastionPublicName: bastion.dev.kops-1-25.k8s.local
    dns:
      type: Private
    masters: private
    nodes: private

8. Please run the commands with most verbose logging by adding the -v 10 flag. Paste the logs into this report, or in a gist and provide the gist link here.

9. Anything else do we need to know?

Deleting the old kube-flannel daemonset in kube-system resolved the issue. I found an older instance of kube-flannel being broken: https://github.com/kubernetes/kops/issues/12388. I used that command given there:

kubectl --namespace=kube-system delete daemonsets.apps kube-flannel-ds

Restarted rolling-update and this time it succeeded:

I0308 13:25:59.233487    1931 rollingupdate.go:214] Rolling update completed for cluster "dev.kops-1-25.k8s.local"!

k8s-triage-robot commented 1 year ago

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot commented 1 year ago

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle rotten
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

mkoepke-xion commented 1 year ago

/remove-lifecycle rotten

gamer22026 commented 1 year ago

This just happened to us upgrading from v1.24.11 to v1.25.11 using latest kops 1.26.4. Either kops should cleanup the remnants of kube-flannel in kube-system namespace, or at the very least document that it must be manually removed after upgrading.

k8s-triage-robot commented 8 months ago

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot commented 7 months ago

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle rotten
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

k8s-triage-robot commented 6 months ago

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue with /reopen
Mark this issue as fresh with /remove-lifecycle rotten
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

k8s-ci-robot commented 6 months ago

@k8s-triage-robot: Closing this issue, marking it as "Not Planned".

In response to [this](https://github.com/kubernetes/kops/issues/15204#issuecomment-2016816528): >The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs. > >This bot triages issues according to the following rules: >- After 90d of inactivity, `lifecycle/stale` is applied >- After 30d of inactivity since `lifecycle/stale` was applied, `lifecycle/rotten` is applied >- After 30d of inactivity since `lifecycle/rotten` was applied, the issue is closed > >You can: >- Reopen this issue with `/reopen` >- Mark this issue as fresh with `/remove-lifecycle rotten` >- Offer to help out with [Issue Triage][1] > >Please send feedback to sig-contributor-experience at [kubernetes/community](https://github.com/kubernetes/community). > >/close not-planned > >[1]: https://www.kubernetes.dev/docs/guide/issue-triage/ Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.

kubernetes / kops

kube-flannel breaks in Kubernetes upgrade to 1.25 due to namespace mismatch #15204