siderolabs / omni

SaaS-simple deployment of Kubernetes - on your own hardware.
Other
401 stars 23 forks source link

[bug] after patch controle plane stuck in reconfiguring #239

Closed aladante closed 1 month ago

aladante commented 1 month ago

Is there an existing issue for this?

Current Behavior

I have provisioned a cluster with omni. Accidetly put kube-system under

cluster:
  apiServer:
    admissionControl:
      - name: PodSecurity
        configuration:
          apiVersion: pod-security.admission.config.k8s.io/v1alpha1
          defaults:
            audit: baseline
            audit-version: latest
            enforce: baseline
            enforce-version: latest
            warn: baseline
            warn-version: latest
          exemptions:
            namespaces:
              - kube-system

The controle plane hang indefinitely in status configuring and api isn't accessible anymore.

Expected Behavior

The changes are applied and the double entry of kube-system is removed.

Steps To Reproduce

talos 1.6.7 kubernetes 1.28.6 hcloud image

controller patch

 machine:
  kubelet:
    extraArgs:
      cloud-provider: external
      rotate-server-certificates: true
  features:
    kubernetesTalosAPIAccess:
      enabled: true
      allowedRoles:
        - os:reader
      allowedKubernetesNamespaces:
        - kube-system
cluster:
  apiServer:
    admissionControl:
      - name: PodSecurity
        configuration:
          apiVersion: pod-security.admission.config.k8s.io/v1alpha1
          defaults:
            audit: baseline
            audit-version: latest
            enforce: baseline
            enforce-version: latest
            warn: baseline
            warn-version: latest
          exemptions:
            namespaces:
              - cert-manager
              - cilium
              - longhorn
              - openebs
              - monitoring
            runtimeClasses: []
            usernames: []
          kind: PodSecurityConfiguration

worker patch

# only apply to workers
machine:
  kubelet:
    extraArgs:
      cloud-provider: external
      rotate-server-certificates: true
    extraMounts:
      - destination: /var/lib/longhorn
        type: bind
        source: /var/lib/longhorn
        options:
          - bind
          - rshared
          - rw
      - destination: /var/lib/openebs
        type: bind
        source: /var/lib/openebs
        options:
          - bind
          - rshared
          - rw

cluster patch

cluster:
  network:
    cni:
      name: none
  proxy:
    disabled: true
  externalCloudProvider:
    enabled: true
    manifests:
      - https://raw.githubusercontent.com/siderolabs/talos-cloud-controller-manager/main/docs/deploy/cloud-controller-manager.yml
  1. Deploy patches
  2. add namespace kube-system to admissionControl
  3. try to revert step two because it breaks the omni provisioning

after this controle plans will be unavailable without any way for the user to recover.

What browsers are you seeing the problem on?

Firefox

Anything else?

No response

smira commented 1 month ago

Talos machine configuration patches merge list by default, so having kube-system merged with default machine config will result in duplicate kube-system. If you fix the patch, the controlplane will be up.

Documentation.

Unix4ever commented 1 month ago

The actual bug here is that Omni doesn't allow you to revert the malformed patch if all kube-apiservers are down. That we should fix for sure.