nberlee / talos

Friendly fork for Turing RK1 on Talos
https://www.talos.dev
Mozilla Public License 2.0
77 stars 0 forks source link

Cilium crash with Talos 1.8.1 after operator rollout #9

Closed rducom closed 2 weeks ago

rducom commented 2 weeks ago

Bug Report

Cilium is crashing after rollout [CRIT] Dead loop on virtual device cilium_vxlan, fix it urgently!. I didn't find a way to restore system beyond this point (except flashing again..)

Minimal repro :

Then all nodes (CP & Workers) are goin into the dead loop.

Description

Logs

a loop of :

kern: crit: [2024-10-30T23:49:59.75932735Z]: Dead loop on virtual device cilium_vxlan, fix it urgently! kern: crit: [2024-10-30T23:49:59.79143835Z]: Dead loop on virtual device cilium_vxlan, fix it urgently! kern: crit: [2024-10-30T23:50:00.01505335Z]: Dead loop on virtual device cilium_vxlan, fix it urgently! kern: crit: [2024-10-30T23:50:01.03912835Z]: Dead loop on virtual device cilium_vxlan, fix it urgently! kern: warning: [2024-10-30T23:50:03.08683535Z]: net_ratelimit: 7 callbacks suppressed kern: crit: [2024-10-30T23:50:03.08685035Z]: Dead loop on virtual device cilium_vxlan, fix it urgently! kern: crit: [2024-10-30T23:50:03.75894435Z]: Dead loop on virtual device cilium_vxlan, fix it urgently!

Environment

rducom commented 2 weeks ago

Other users are having the same issue on upstream https://github.com/siderolabs/talos/issues/9102 Nico, since it’s not specific to your fork, I'm closing the issue here.

rducom commented 2 weeks ago

For others, here's a working mc patch :

machine:
  install:
    disk: /dev/mmcblk0
  kernel:
    modules:
        - name: rockchip-cpufreq
cluster:
  etcd:
      advertisedSubnets:
          - 192.168.1.0/24
  allowSchedulingOnControlPlanes: true
  network:
    cni:
      name: none
    podSubnets:
      - 10.42.0.0/20
    serviceSubnets:
      - 10.42.16.0/20
  proxy:
    disabled: true
  apiServer:
    admissionControl:
      - name: PodSecurity
        configuration:
          exemptions:
            namespaces:
              - cilium-test-1
              - rook-ceph

And install cilium with :

helm repo add cilium https://helm.cilium.io/
helm repo update cilium
CILIUM_LATEST=$(helm search repo cilium --versions --output yaml | yq '.[0].version')
helm install cilium cilium/cilium \
    --version ${CILIUM_LATEST} \
    --namespace kube-system \
    --set securityContext.capabilities.ciliumAgent="{CHOWN,KILL,NET_ADMIN,NET_RAW,IPC_LOCK,SYS_ADMIN,SYS_RESOURCE,DAC_OVERRIDE,FOWNER,SETGID,SETUID}" \
    --set securityContext.capabilities.cleanCiliumState="{NET_ADMIN,SYS_ADMIN,SYS_RESOURCE}" \
    --set cgroup.autoMount.enabled=false \
    --set cgroup.hostRoot=/sys/fs/cgroup \
    --set l2announcements.enabled=true \
    --set kubeProxyReplacement=true \
    --set loadBalancer.acceleration=native \
    --set k8sServiceHost=127.0.0.1 \
    --set k8sServicePort=7445 \
    --set bpf.masquerade=true \
    --set ingressController.enabled=true \
    --set ingressController.default=true \
    --set ingressController.loadbalancerMode=dedicated \
    --set ipam.mode=cluster-pool \
    --set ipam.operator.clusterPoolIPv4PodCIDRList="10.42.32.0/20" \
    --set hubble.relay.enabled=true \
    --set hubble.ui.enabled=true \
    --set gatewayAPI.enabled=true \
    --set bgpControlPlane.enabled=true