projectcalico / calico

Cloud native networking and network security
https://docs.tigera.io/calico/latest/about/
Apache License 2.0
5.87k stars 1.31k forks source link

CrashLoopBackoff with mount-bpffs init container, calico v3.23.2, operator v1.27.7 #6279

Closed techmunk closed 2 years ago

techmunk commented 2 years ago

I'm currently spinning up fresh talos linux clusters. I'm install the CNI layer from the manifests at https://projectcalico.docs.tigera.io/archive/v3.23/manifests/tigera-operator.yaml which uses the operator version v1.27.7 since a day or two ago, which install calico v3.23.2.

Along with the CNI manifest above, I'm also adding the following inlineManifests. (Replace kubaapi_ip, and pod_subnet vars.).

apiVersion: crd.projectcalico.org/v1
kind: FelixConfiguration
metadata:
  name: default
spec:
  wireguardEnabled: true
---
apiVersion: operator.tigera.io/v1
kind: Installation
metadata:
  name: default
spec:
  calicoNetwork:
    linuxDataplane: BPF
    ipPools:
      - blockSize: 26
        cidr: {{ pod_subnet }}
        encapsulation: VXLANCrossSubnet
        natOutgoing: Enabled
        nodeSelector: all()
---
apiVersion: operator.tigera.io/v1
kind: APIServer
metadata:
  name: default
spec: {}

Deploying this results in the following errors in the mount-bpffs container:

kubectl logs --container=mount-bpffs -n calico-system calico-node-hmv2l
2022-06-27 03:20:48.241 [INFO][1] init/startup.go 425: Early log level set to info
2022-06-27 03:20:48.242 [INFO][1] init/calico-init_linux.go 58: Checking if BPF filesystem is mounted.
2022-06-27 03:20:48.242 [INFO][1] init/calico-init_linux.go 70: BPF filesystem is mounted.
2022-06-27 03:20:48.243 [INFO][1] init/calico-init_linux.go 95: Checking if Cgroup2 filesystem is mounted.
2022-06-27 03:20:48.243 [ERROR][1] init/calico-init_linux.go 49: Failed to mount cgroup2 filesystem. error=failed to open /initproc/mountinfo: open /initproc/mountinfo: no such file or directory

Looking deeper into this, it seems that the mount changes recently introduced by https://github.com/projectcalico/calico/pull/6240/files, have not made it into the release-1.27 branch of the operator, but are on master via https://github.com/tigera/operator/pull/1957.

I then switched to installing the CNI via https://raw.githubusercontent.com/projectcalico/calico/master/manifests/tigera-operator.yaml, which uses the master version of the operator. This fails with a slight different message:

2022-06-27 02:39:41.972 [INFO][1] init/startup.go 425: Early log level set to info
2022-06-27 02:39:41.973 [INFO][1] init/calico-init_linux.go 58: Checking if BPF filesystem is mounted.
2022-06-27 02:39:41.973 [INFO][1] init/calico-init_linux.go 70: BPF filesystem is mounted.
2022-06-27 02:39:41.973 [INFO][1] init/calico-init_linux.go 95: Checking if Cgroup2 filesystem is mounted.
2022-06-27 02:39:41.974 [INFO][1] init/calico-init_linux.go 123: Cgroup2 filesystem is not mounted. Trying to mount it...
2022-06-27 02:39:41.984 [ERROR][1] init/calico-init_linux.go 128: Mouting cgroup2 fs failed. output: [84 114 121 105 110 103 32 116 111 32 109 111 117 110 116 32 114 111 111 116 32 99 103 114 111 117 112 32 102 115 46 10 70 97 105 108 101 100 32 116 111 32 109 111 117 110 116 32 67 103 114 111 117 112 32 102 105 108 101 115 121 115 116 101 109 46 32 101 114 114 58 32 110 111 32 115 117 99 104 32 102 105 108 101 32 111 114 32 100 105 114 101 99 116 111 114 121 10]
2022-06-27 02:39:41.985 [ERROR][1] init/calico-init_linux.go 49: Failed to mount cgroup2 filesystem. error=failed to mount cgroup2 filesystem: exit status 1

The string of numbers appears to be decimal encoded ascii, which decodes to

 Trying to mount root cgroup fs.
Failed to mount Cgroup filesystem. err: no such file or directory

Not sure why it's output as it is.

In summary, there might be two errors in this issue. 1. the current operator version installs a version of calico for which the manifests do not appear to be correctly generated for, and 2, I can't actually get the calico-node pods to start as the mount-bpffs init container fails to start, even using versions where I believe this should be supported. I'm not sure if I shoudl've made the issue here or in the operator repo.

Expected Behavior

The calico node pods would start.

Current Behavior

The calico node pods do not start.

Possible Solution

Steps to Reproduce (for bugs)

  1. Create patch.yaml
    
    - op: add
    path: /cluster/network/cni
    value:
    name: custom
    urls:
      - https://raw.githubusercontent.com/projectcalico/calico/master/manifests/tigera-operator.yaml

Switching out the CNI URL in patch.yaml will switch between master and v1.27.7 of the operator.

This will create a KIND cluster, so docker will be needed on the test machine. The following will cleanup any setup that was made. (CTRL+C to exit cluster creation process). ./talosctl-linux-amd64 cluster destroy You may also wish to delete ~/.talos as well, as this directory is also created.

I don't believe this is specific to talos, but that is my current setup.

Reverting to the operator vers v1.27.5 and everything "works" again.

Context

I'm currently spinning this up in a test cluster, but am concerned what would happen if I did this in the production cluster.

Your Environment

xpflying commented 2 years ago

I encountered the the same problem

caseydavenport commented 2 years ago

@mazdakn is this because we're missing this PR? https://github.com/tigera/operator/pull/2028

PanKaker commented 2 years ago

+1 Confirm the same issue.

Just install cluster with kubeadm and perform quick start installation with operator mode. When I've tried to change mode to eBPF - got the same error

mazdakn commented 2 years ago

@caseydavenport yes, exactly. We need to merge in that PR.

caseydavenport commented 2 years ago

@mazdakn merged it. Will cut a new operator release ASAP.

mazdakn commented 2 years ago

Thanks @caseydavenport

caseydavenport commented 2 years ago

Just pushed v1.27.8 that has the fix included

techmunk commented 2 years ago

@caseydavenport This is still not working with v1.27.8. I'm now getting the same error I was when using the master branch of the operator as previously mentioned in my initial post.

2022-06-29 01:15:43.605 [INFO][1] init/startup.go 425: Early log level set to info
2022-06-29 01:15:43.605 [INFO][1] init/calico-init_linux.go 58: Checking if BPF filesystem is mounted.
2022-06-29 01:15:43.605 [INFO][1] init/calico-init_linux.go 70: BPF filesystem is mounted.
2022-06-29 01:15:43.605 [INFO][1] init/calico-init_linux.go 95: Checking if Cgroup2 filesystem is mounted.
2022-06-29 01:15:43.605 [INFO][1] init/calico-init_linux.go 123: Cgroup2 filesystem is not mounted. Trying to mount it...
2022-06-29 01:15:43.607 [ERROR][1] init/calico-init_linux.go 128: Mouting cgroup2 fs failed. output: [84 114 121 105 110 103 32 116 111 32 109 111 117 110 116 32 114 111 111 116 32 99 103 114 111 117 112 32 102 115 46 10 70 97 105 108 101 100 32 116 111 32 109 111 117 110 116 32 67 103 114 111 117 112 32 102 105 108 101 115 121 115 116 101 109 46 32 101 114 114 58 32 110 111 32 115 117 99 104 32 102 105 108 101 32 111 114 32 100 105 114 101 99 116 111 114 121 10]
2022-06-29 01:15:43.607 [ERROR][1] init/calico-init_linux.go 49: Failed to mount cgroup2 filesystem. error=failed to mount cgroup2 filesystem: exit status 1
Error from server (BadRequest): container "install-cni" in pod "calico-node-km22b" is waiting to start: PodInitializing

I've used the manifest from https://projectcalico.docs.tigera.io/archive/v3.23/manifests/tigera-operator.yaml and changed the version to v1.27.8 in the deployment. (Both the image tag, and the env var).

Is there something else I should be trying? Do I have something else wrong?

techmunk commented 2 years ago

Some extra details that might help: kubectl --kubeconfig kc get pod -n calico-system calico-node-km22b -o yaml

apiVersion: v1
kind: Pod
metadata:
  annotations:
    hash.operator.tigera.io/cni-config: c2f0b770793383a73909f6dd179f645be6f6db35
    hash.operator.tigera.io/tigera-ca-private: 7fa6847d245305dbe47dc37fd6288edd46fc845f
  creationTimestamp: "2022-06-29T01:14:26Z"
  generateName: calico-node-
  labels:
    controller-revision-hash: 959f5cc4b
    k8s-app: calico-node
    pod-template-generation: "1"
  name: calico-node-km22b
  namespace: calico-system
  ownerReferences:
  - apiVersion: apps/v1
    blockOwnerDeletion: true
    controller: true
    kind: DaemonSet
    name: calico-node
    uid: fb1a5bcd-45c9-4d73-8b7e-3fa06a14ae24
  resourceVersion: "3879"
  uid: e1361944-de86-482d-9838-e72880a7481a
spec:
  affinity:
    nodeAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        nodeSelectorTerms:
        - matchFields:
          - key: metadata.name
            operator: In
            values:
            - talos-default-worker-1
  containers:
  - env:
    - name: DATASTORE_TYPE
      value: kubernetes
    - name: WAIT_FOR_DATASTORE
      value: "true"
    - name: CLUSTER_TYPE
      value: k8s,operator,bgp
    - name: CALICO_DISABLE_FILE_LOGGING
      value: "false"
    - name: FELIX_DEFAULTENDPOINTTOHOSTACTION
      value: ACCEPT
    - name: FELIX_HEALTHENABLED
      value: "true"
    - name: FELIX_HEALTHPORT
      value: "9099"
    - name: NODENAME
      valueFrom:
        fieldRef:
          apiVersion: v1
          fieldPath: spec.nodeName
    - name: NAMESPACE
      valueFrom:
        fieldRef:
          apiVersion: v1
          fieldPath: metadata.namespace
    - name: FELIX_TYPHAK8SNAMESPACE
      value: calico-system
    - name: FELIX_TYPHAK8SSERVICENAME
      value: calico-typha
    - name: FELIX_TYPHACAFILE
      value: /etc/pki/tls/certs/tigera-ca-bundle.crt
    - name: FELIX_TYPHACERTFILE
      value: /node-certs/tls.crt
    - name: FELIX_TYPHAKEYFILE
      value: /node-certs/tls.key
    - name: FELIX_TYPHACN
      value: typha-server
    - name: CALICO_MANAGE_CNI
      value: "true"
    - name: CALICO_IPV4POOL_CIDR
      value: 10.244.0.0/16
    - name: CALICO_IPV4POOL_VXLAN
      value: CrossSubnet
    - name: CALICO_IPV4POOL_BLOCK_SIZE
      value: "26"
    - name: CALICO_IPV4POOL_NODE_SELECTOR
      value: all()
    - name: FELIX_BPFENABLED
      value: "true"
    - name: CALICO_NETWORKING_BACKEND
      value: bird
    - name: IP
      value: autodetect
    - name: IP_AUTODETECTION_METHOD
      value: first-found
    - name: IP6
      value: none
    - name: FELIX_IPV6SUPPORT
      value: "false"
    - name: KUBERNETES_SERVICE_HOST
      value: 10.96.0.1
    - name: KUBERNETES_SERVICE_PORT
      value: "443"
    image: docker.io/calico/node:v3.23.2
    imagePullPolicy: IfNotPresent
    lifecycle:
      preStop:
        exec:
          command:
          - /bin/calico-node
          - -shutdown
    livenessProbe:
      failureThreshold: 3
      httpGet:
        host: localhost
        path: /liveness
        port: 9099
        scheme: HTTP
      periodSeconds: 10
      successThreshold: 1
      timeoutSeconds: 10
    name: calico-node
    readinessProbe:
      exec:
        command:
        - /bin/calico-node
        - -bird-ready
        - -felix-ready
      failureThreshold: 3
      periodSeconds: 10
      successThreshold: 1
      timeoutSeconds: 5
    resources: {}
    securityContext:
      privileged: true
    terminationMessagePath: /dev/termination-log
    terminationMessagePolicy: File
    volumeMounts:
    - mountPath: /lib/modules
      name: lib-modules
      readOnly: true
    - mountPath: /run/xtables.lock
      name: xtables-lock
    - mountPath: /var/run/nodeagent
      name: policysync
    - mountPath: /etc/pki/tls/certs/
      name: tigera-ca-bundle
      readOnly: true
    - mountPath: /node-certs
      name: node-certs
      readOnly: true
    - mountPath: /var/run/calico
      name: var-run-calico
    - mountPath: /var/lib/calico
      name: var-lib-calico
    - mountPath: /sys/fs/bpf
      name: bpffs
    - mountPath: /var/log/calico/cni
      name: cni-log-dir
    - mountPath: /host/etc/cni/net.d
      name: cni-net-dir
    - mountPath: /var/run/secrets/kubernetes.io/serviceaccount
      name: kube-api-access-c6wbz
      readOnly: true
  dnsPolicy: ClusterFirst
  enableServiceLinks: true
  hostNetwork: true
  initContainers:
  - image: docker.io/calico/pod2daemon-flexvol:v3.23.2
    imagePullPolicy: IfNotPresent
    name: flexvol-driver
    resources: {}
    securityContext:
      privileged: true
    terminationMessagePath: /dev/termination-log
    terminationMessagePolicy: File
    volumeMounts:
    - mountPath: /host/driver
      name: flexvol-driver-host
    - mountPath: /var/run/secrets/kubernetes.io/serviceaccount
      name: kube-api-access-c6wbz
      readOnly: true
  - command:
    - calico-node
    - -init
    image: docker.io/calico/node:v3.23.2
    imagePullPolicy: IfNotPresent
    name: mount-bpffs
    resources: {}
    securityContext:
      privileged: true
    terminationMessagePath: /dev/termination-log
    terminationMessagePolicy: File
    volumeMounts:
    - mountPath: /sys/fs
      mountPropagation: Bidirectional
      name: sys-fs
    - mountPath: /initproc
      name: init-proc
      readOnly: true
    - mountPath: /var/run/secrets/kubernetes.io/serviceaccount
      name: kube-api-access-c6wbz
      readOnly: true
  - command:
    - /opt/cni/bin/install
    env:
    - name: CNI_CONF_NAME
      value: 10-calico.conflist
    - name: SLEEP
      value: "false"
    - name: CNI_NET_DIR
      value: /etc/cni/net.d
    - name: CNI_NETWORK_CONFIG
      valueFrom:
        configMapKeyRef:
          key: config
          name: cni-config
    - name: KUBERNETES_SERVICE_HOST
      value: 10.96.0.1
    - name: KUBERNETES_SERVICE_PORT
      value: "443"
    image: docker.io/calico/cni:v3.23.2
    imagePullPolicy: IfNotPresent
    name: install-cni
    resources: {}
    securityContext:
      privileged: true
    terminationMessagePath: /dev/termination-log
    terminationMessagePolicy: File
    volumeMounts:
    - mountPath: /host/opt/cni/bin
      name: cni-bin-dir
    - mountPath: /host/etc/cni/net.d
      name: cni-net-dir
    - mountPath: /var/run/secrets/kubernetes.io/serviceaccount
      name: kube-api-access-c6wbz
      readOnly: true
  nodeName: talos-default-worker-1
  nodeSelector:
    kubernetes.io/os: linux
  preemptionPolicy: PreemptLowerPriority
  priority: 2000001000
  priorityClassName: system-node-critical
  restartPolicy: Always
  schedulerName: default-scheduler
  securityContext: {}
  serviceAccount: calico-node
  serviceAccountName: calico-node
  terminationGracePeriodSeconds: 5
  tolerations:
  - key: CriticalAddonsOnly
    operator: Exists
  - effect: NoSchedule
    operator: Exists
  - effect: NoExecute
    operator: Exists
  - effect: NoExecute
    key: node.kubernetes.io/not-ready
    operator: Exists
  - effect: NoExecute
    key: node.kubernetes.io/unreachable
    operator: Exists
  - effect: NoSchedule
    key: node.kubernetes.io/disk-pressure
    operator: Exists
  - effect: NoSchedule
    key: node.kubernetes.io/memory-pressure
    operator: Exists
  - effect: NoSchedule
    key: node.kubernetes.io/pid-pressure
    operator: Exists
  - effect: NoSchedule
    key: node.kubernetes.io/unschedulable
    operator: Exists
  - effect: NoSchedule
    key: node.kubernetes.io/network-unavailable
    operator: Exists
  volumes:
  - hostPath:
      path: /lib/modules
      type: ""
    name: lib-modules
  - hostPath:
      path: /run/xtables.lock
      type: FileOrCreate
    name: xtables-lock
  - hostPath:
      path: /var/run/nodeagent
      type: DirectoryOrCreate
    name: policysync
  - configMap:
      defaultMode: 420
      name: tigera-ca-bundle
    name: tigera-ca-bundle
  - name: node-certs
    secret:
      defaultMode: 420
      secretName: node-certs
  - hostPath:
      path: /var/run/calico
      type: ""
    name: var-run-calico
  - hostPath:
      path: /var/lib/calico
      type: ""
    name: var-lib-calico
  - hostPath:
      path: /sys/fs
      type: DirectoryOrCreate
    name: sys-fs
  - hostPath:
      path: /sys/fs/bpf
      type: Directory
    name: bpffs
  - hostPath:
      path: /proc/1
      type: ""
    name: init-proc
  - hostPath:
      path: /opt/cni/bin
      type: ""
    name: cni-bin-dir
  - hostPath:
      path: /etc/cni/net.d
      type: ""
    name: cni-net-dir
  - hostPath:
      path: /var/log/calico/cni
      type: ""
    name: cni-log-dir
  - hostPath:
      path: /usr/libexec/kubernetes/kubelet-plugins/volume/exec/nodeagent~uds
      type: DirectoryOrCreate
    name: flexvol-driver-host
  - name: kube-api-access-c6wbz
    projected:
      defaultMode: 420
      sources:
      - serviceAccountToken:
          expirationSeconds: 3607
          path: token
      - configMap:
          items:
          - key: ca.crt
            path: ca.crt
          name: kube-root-ca.crt
      - downwardAPI:
          items:
          - fieldRef:
              apiVersion: v1
              fieldPath: metadata.namespace
            path: namespace
status:
  conditions:
  - lastProbeTime: null
    lastTransitionTime: "2022-06-29T01:14:26Z"
    message: 'containers with incomplete status: [mount-bpffs install-cni]'
    reason: ContainersNotInitialized
    status: "False"
    type: Initialized
  - lastProbeTime: null
    lastTransitionTime: "2022-06-29T01:14:26Z"
    message: 'containers with unready status: [calico-node]'
    reason: ContainersNotReady
    status: "False"
    type: Ready
  - lastProbeTime: null
    lastTransitionTime: "2022-06-29T01:14:26Z"
    message: 'containers with unready status: [calico-node]'
    reason: ContainersNotReady
    status: "False"
    type: ContainersReady
  - lastProbeTime: null
    lastTransitionTime: "2022-06-29T01:14:26Z"
    status: "True"
    type: PodScheduled
  containerStatuses:
  - image: docker.io/calico/node:v3.23.2
    imageID: ""
    lastState: {}
    name: calico-node
    ready: false
    restartCount: 0
    started: false
    state:
      waiting:
        reason: PodInitializing
  hostIP: 10.5.0.3
  initContainerStatuses:
  - containerID: containerd://61bb26c0e31cdde87a7740e220b2f6db4d4fae8ee5832895b3d877ad5f8db2ae
    image: docker.io/calico/pod2daemon-flexvol:v3.23.2
    imageID: docker.io/calico/pod2daemon-flexvol@sha256:2df980eccdfd61dae0090f354f82a643747d2f58fbd5b47e1bdade363bcb0e65
    lastState: {}
    name: flexvol-driver
    ready: true
    restartCount: 0
    state:
      terminated:
        containerID: containerd://61bb26c0e31cdde87a7740e220b2f6db4d4fae8ee5832895b3d877ad5f8db2ae
        exitCode: 0
        finishedAt: "2022-06-29T01:14:36Z"
        reason: Completed
        startedAt: "2022-06-29T01:14:36Z"
  - containerID: containerd://9f7a54353b9b2ecfb6c9c2e383b28eb30deff506fec1eb8c6061db00ca4c0cef
    image: docker.io/calico/node:v3.23.2
    imageID: docker.io/calico/node@sha256:b4ac0660c297b3a582ef2f4a0d7ef86f954ad5497b704b41d82fa99418e7a51e
    lastState:
      terminated:
        containerID: containerd://9f7a54353b9b2ecfb6c9c2e383b28eb30deff506fec1eb8c6061db00ca4c0cef
        exitCode: 3
        finishedAt: "2022-06-29T01:36:17Z"
        reason: Error
        startedAt: "2022-06-29T01:36:17Z"
    name: mount-bpffs
    ready: false
    restartCount: 9
    state:
      waiting:
        message: back-off 5m0s restarting failed container=mount-bpffs pod=calico-node-km22b_calico-system(e1361944-de86-482d-9838-e72880a7481a)
        reason: CrashLoopBackOff
  - image: docker.io/calico/cni:v3.23.2
    imageID: ""
    lastState: {}
    name: install-cni
    ready: false
    restartCount: 0
    state:
      waiting:
        reason: PodInitializing
  phase: Pending
  podIP: 10.5.0.3
  podIPs:
  - ip: 10.5.0.3
  qosClass: BestEffort
  startTime: "2022-06-29T01:14:26Z"

kubectl --kubeconfig kc describe pod -n calico-system calico-node-km22b

Name:                 calico-node-km22b
Namespace:            calico-system
Priority:             2000001000
Priority Class Name:  system-node-critical
Node:                 talos-default-worker-1/10.5.0.3
Start Time:           Wed, 29 Jun 2022 11:14:26 +1000
Labels:               controller-revision-hash=959f5cc4b
                      k8s-app=calico-node
                      pod-template-generation=1
Annotations:          hash.operator.tigera.io/cni-config: c2f0b770793383a73909f6dd179f645be6f6db35
                      hash.operator.tigera.io/tigera-ca-private: 7fa6847d245305dbe47dc37fd6288edd46fc845f
Status:               Pending
IP:                   10.5.0.3
IPs:
  IP:           10.5.0.3
Controlled By:  DaemonSet/calico-node
Init Containers:
  flexvol-driver:
    Container ID:   containerd://61bb26c0e31cdde87a7740e220b2f6db4d4fae8ee5832895b3d877ad5f8db2ae
    Image:          docker.io/calico/pod2daemon-flexvol:v3.23.2
    Image ID:       docker.io/calico/pod2daemon-flexvol@sha256:2df980eccdfd61dae0090f354f82a643747d2f58fbd5b47e1bdade363bcb0e65
    Port:           <none>
    Host Port:      <none>
    State:          Terminated
      Reason:       Completed
      Exit Code:    0
      Started:      Wed, 29 Jun 2022 11:14:36 +1000
      Finished:     Wed, 29 Jun 2022 11:14:36 +1000
    Ready:          True
    Restart Count:  0
    Environment:    <none>
    Mounts:
      /host/driver from flexvol-driver-host (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-c6wbz (ro)
  mount-bpffs:
    Container ID:  containerd://b275a1caae90e469dcf1567079db1ad748f45a45d1873cea4a7bf71267c5baf2
    Image:         docker.io/calico/node:v3.23.2
    Image ID:      docker.io/calico/node@sha256:b4ac0660c297b3a582ef2f4a0d7ef86f954ad5497b704b41d82fa99418e7a51e
    Port:          <none>
    Host Port:     <none>
    Command:
      calico-node
      -init
    State:          Waiting
      Reason:       CrashLoopBackOff
    Last State:     Terminated
      Reason:       Error
      Exit Code:    3
      Started:      Wed, 29 Jun 2022 11:18:06 +1000
      Finished:     Wed, 29 Jun 2022 11:18:06 +1000
    Ready:          False
    Restart Count:  5
    Environment:    <none>
    Mounts:
      /initproc from init-proc (ro)
      /sys/fs from sys-fs (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-c6wbz (ro)
  install-cni:
    Container ID:
    Image:         docker.io/calico/cni:v3.23.2
    Image ID:
    Port:          <none>
    Host Port:     <none>
    Command:
      /opt/cni/bin/install
    State:          Waiting
      Reason:       PodInitializing
    Ready:          False
    Restart Count:  0
    Environment:
      CNI_CONF_NAME:            10-calico.conflist
      SLEEP:                    false
      CNI_NET_DIR:              /etc/cni/net.d
      CNI_NETWORK_CONFIG:       <set to the key 'config' of config map 'cni-config'>  Optional: false
      KUBERNETES_SERVICE_HOST:  10.96.0.1
      KUBERNETES_SERVICE_PORT:  443
    Mounts:
      /host/etc/cni/net.d from cni-net-dir (rw)
      /host/opt/cni/bin from cni-bin-dir (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-c6wbz (ro)
Containers:
  calico-node:
    Container ID:
    Image:          docker.io/calico/node:v3.23.2
    Image ID:
    Port:           <none>
    Host Port:      <none>
    State:          Waiting
      Reason:       PodInitializing
    Ready:          False
    Restart Count:  0
    Liveness:       http-get http://localhost:9099/liveness delay=0s timeout=10s period=10s #success=1 #failure=3
    Readiness:      exec [/bin/calico-node -bird-ready -felix-ready] delay=0s timeout=5s period=10s #success=1 #failure=3
    Environment:
      DATASTORE_TYPE:                     kubernetes
      WAIT_FOR_DATASTORE:                 true
      CLUSTER_TYPE:                       k8s,operator,bgp
      CALICO_DISABLE_FILE_LOGGING:        false
      FELIX_DEFAULTENDPOINTTOHOSTACTION:  ACCEPT
      FELIX_HEALTHENABLED:                true
      FELIX_HEALTHPORT:                   9099
      NODENAME:                            (v1:spec.nodeName)
      NAMESPACE:                          calico-system (v1:metadata.namespace)
      FELIX_TYPHAK8SNAMESPACE:            calico-system
      FELIX_TYPHAK8SSERVICENAME:          calico-typha
      FELIX_TYPHACAFILE:                  /etc/pki/tls/certs/tigera-ca-bundle.crt
      FELIX_TYPHACERTFILE:                /node-certs/tls.crt
      FELIX_TYPHAKEYFILE:                 /node-certs/tls.key
      FELIX_TYPHACN:                      typha-server
      CALICO_MANAGE_CNI:                  true
      CALICO_IPV4POOL_CIDR:               10.244.0.0/16
      CALICO_IPV4POOL_VXLAN:              CrossSubnet
      CALICO_IPV4POOL_BLOCK_SIZE:         26
      CALICO_IPV4POOL_NODE_SELECTOR:      all()
      FELIX_BPFENABLED:                   true
      CALICO_NETWORKING_BACKEND:          bird
      IP:                                 autodetect
      IP_AUTODETECTION_METHOD:            first-found
      IP6:                                none
      FELIX_IPV6SUPPORT:                  false
      KUBERNETES_SERVICE_HOST:            10.96.0.1
      KUBERNETES_SERVICE_PORT:            443
    Mounts:
      /etc/pki/tls/certs/ from tigera-ca-bundle (ro)
      /host/etc/cni/net.d from cni-net-dir (rw)
      /lib/modules from lib-modules (ro)
      /node-certs from node-certs (ro)
      /run/xtables.lock from xtables-lock (rw)
      /sys/fs/bpf from bpffs (rw)
      /var/lib/calico from var-lib-calico (rw)
      /var/log/calico/cni from cni-log-dir (rw)
      /var/run/calico from var-run-calico (rw)
      /var/run/nodeagent from policysync (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-c6wbz (ro)
Conditions:
  Type              Status
  Initialized       False
  Ready             False
  ContainersReady   False
  PodScheduled      True
Volumes:
  lib-modules:
    Type:          HostPath (bare host directory volume)
    Path:          /lib/modules
    HostPathType:
  xtables-lock:
    Type:          HostPath (bare host directory volume)
    Path:          /run/xtables.lock
    HostPathType:  FileOrCreate
  policysync:
    Type:          HostPath (bare host directory volume)
    Path:          /var/run/nodeagent
    HostPathType:  DirectoryOrCreate
  tigera-ca-bundle:
    Type:      ConfigMap (a volume populated by a ConfigMap)
    Name:      tigera-ca-bundle
    Optional:  false
  node-certs:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  node-certs
    Optional:    false
  var-run-calico:
    Type:          HostPath (bare host directory volume)
    Path:          /var/run/calico
    HostPathType:
  var-lib-calico:
    Type:          HostPath (bare host directory volume)
    Path:          /var/lib/calico
    HostPathType:
  sys-fs:
    Type:          HostPath (bare host directory volume)
    Path:          /sys/fs
    HostPathType:  DirectoryOrCreate
  bpffs:
    Type:          HostPath (bare host directory volume)
    Path:          /sys/fs/bpf
    HostPathType:  Directory
  init-proc:
    Type:          HostPath (bare host directory volume)
    Path:          /proc/1
    HostPathType:
  cni-bin-dir:
    Type:          HostPath (bare host directory volume)
    Path:          /opt/cni/bin
    HostPathType:
  cni-net-dir:
    Type:          HostPath (bare host directory volume)
    Path:          /etc/cni/net.d
    HostPathType:
  cni-log-dir:
    Type:          HostPath (bare host directory volume)
    Path:          /var/log/calico/cni
    HostPathType:
  flexvol-driver-host:
    Type:          HostPath (bare host directory volume)
    Path:          /usr/libexec/kubernetes/kubelet-plugins/volume/exec/nodeagent~uds
    HostPathType:  DirectoryOrCreate
  kube-api-access-c6wbz:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
QoS Class:                   BestEffort
Node-Selectors:              kubernetes.io/os=linux
Tolerations:                 :NoSchedule op=Exists
                             :NoExecute op=Exists
                             CriticalAddonsOnly op=Exists
                             node.kubernetes.io/disk-pressure:NoSchedule op=Exists
                             node.kubernetes.io/memory-pressure:NoSchedule op=Exists
                             node.kubernetes.io/network-unavailable:NoSchedule op=Exists
                             node.kubernetes.io/not-ready:NoExecute op=Exists
                             node.kubernetes.io/pid-pressure:NoSchedule op=Exists
                             node.kubernetes.io/unreachable:NoExecute op=Exists
                             node.kubernetes.io/unschedulable:NoSchedule op=Exists
Events:
  Type     Reason     Age                    From               Message
  ----     ------     ----                   ----               -------
  Normal   Scheduled  6m4s                   default-scheduler  Successfully assigned calico-system/calico-node-km22b to talos-default-worker-1
  Normal   Pulling    6m4s                   kubelet            Pulling image "docker.io/calico/pod2daemon-flexvol:v3.23.2"
  Normal   Pulled     5m55s                  kubelet            Successfully pulled image "docker.io/calico/pod2daemon-flexvol:v3.23.2" in 9.346252348s
  Normal   Created    5m55s                  kubelet            Created container flexvol-driver
  Normal   Started    5m55s                  kubelet            Started container flexvol-driver
  Normal   Pulling    5m54s                  kubelet            Pulling image "docker.io/calico/node:v3.23.2"
  Normal   Pulled     5m28s                  kubelet            Successfully pulled image "docker.io/calico/node:v3.23.2" in 26.027490637s
  Normal   Created    4m48s (x4 over 5m28s)  kubelet            Created container mount-bpffs
  Normal   Started    4m48s (x4 over 5m28s)  kubelet            Started container mount-bpffs
  Normal   Pulled     4m48s (x3 over 5m27s)  kubelet            Container image "docker.io/calico/node:v3.23.2" already present on machine
  Warning  BackOff    60s (x22 over 5m26s)   kubelet            Back-off restarting failed container
mazdakn commented 2 years ago

@techmunk can you check the list of files under /var/run/calico/ ? is there a directory named cgroup?

Also can you check if cgroupv2 mounted?

techmunk commented 2 years ago

@mazdakn /var/run/calico does not exist on the host yet. This would not get created until the init containers complete correct? unless one of the init containers also mount it? As the mount-bpffs init container is the one that is failing.

/proc/filesystems contains nodev cgroup2, and /sys/fs/cgroup/cgroup.controllers exists, which I believe cgroupv2 is mounted?

The system works fine with calico v3.23.1.

Is there anything further I could provide to help diagnose this issue? I don't have direct shell access on the host, but I can read just about anything from the filesystem (proc, sys, run etc..)

techmunk commented 2 years ago

Some extra info that might help, /proc/1/mountinfo contains the following lines..

33 171 0:28 / /sys/fs/cgroup rw,nosuid,nodev,noexec,relatime shared:14 - cgroup2 cgroup rw
171 22 0:28 / /sys/fs/cgroup rw,nosuid,nodev,noexec,relatime shared:14 - cgroup2 cgroup rw
techmunk commented 2 years ago

Looking at https://github.com/projectcalico/calico/blob/0213c7227db0ff10390029a5e5ba8db25cfbe2a1/node/pkg/nodeinit/calico-init_linux.go#L112, I believe that should mean it should be working with the above mountinfo? I might not be reading the code correctly.

mazdakn commented 2 years ago

that lines check if the fs type is cgroup2 and mount point is /var/run/calico/cgroup. We mount the cgroup fs at /var/run/calico/cgroup

techmunk commented 2 years ago

@mazdakn I did some more digging, and I was able to get to the bottom of the issue I'm facing.

The error I'm getting is because the directory /run/calico/cgroup does not exist on the hosts at the time the mount-bpffs init container runs.

Adding the following mounts to a static pod, then allows the command mountns /run/calico/cgroup to succeed.

spec:
  containers:
    - initContainers:
        - name: mount-bpffs
#        .....
          volumeMounts:
           - mountPath: /run/calico/cgroup
             mountPropagation: Bidirectional
             name: run-calico-cgroup
#        .....
  volumes:
    - hostPath:
        path: /run/calico/cgroup
        type: ""
      name: run-calico-cgroup

Is this something calico or the tigera operator will resolve? Or do I need to ensure the /run/calico/cgroup directory exists on the host prior to installing the operator manifests?

mazdakn commented 2 years ago

@techmunk yes, that is correct. That directory should exist on host however that should happen by the code in Calico. I am looking into this.

mazdakn commented 2 years ago

@techmunk I could reproduce this the way you described and also separately in a kind cluster. This looks to be related to kind clusters. I am working on a fix for it.

caseydavenport commented 2 years ago

@mazdakn do we just need to write the code to ensure that dir exists before trying to access it?

mazdakn commented 2 years ago

@caseydavenport yes, I am just checking why it just happens in kind clusters. Anyway, the fix is either to add a new volume as @techmunk mentioned, or in the new mountns binary make sure that the directory exists. I personally prefer to use mountns for this. The first approach needs a change in operator, the second one in felix. Do you have any preference?

PanKaker commented 2 years ago

+1 fix didn't help. I've installed a kubernetes cluster with kubeadm and then installed calico with operator mode. After switching to eBPF mode - get the same error

fasaxc commented 2 years ago

+1 fix didn't help

Please can you be more specific, what fix did you try? @mazdakn has a PR up to fix this, was that what you tried (should we hold off on the PR)?

PanKaker commented 2 years ago

Hi! I've decided to downgrade the version to 3.23.1. I can't provide any details right now.

BTW: if somehow possible to get tigera-operator.yaml manifest for 3.23.1?

caseydavenport commented 2 years ago

BTW: if somehow possible to get tigera-operator.yaml manifest for 3.23.1?

There's a bundle attached to the GitHub release: https://github.com/projectcalico/calico/releases/tag/v3.23.1

PanKaker commented 2 years ago

BTW: if somehow possible to get tigera-operator.yaml manifest for 3.23.1?

There's a bundle attached to the GitHub release: https://github.com/projectcalico/calico/releases/tag/v3.23.1

Sorry, I meant: tigera-operator.yaml file (same as here https://projectcalico.docs.tigera.io/getting-started/kubernetes/quickstart). Not the helm chart.

Is that possible to include this file under each release too? Thank you

caseydavenport commented 2 years ago

@PanKaker tigera-operator.yaml is included within the release artifacts. e.g., this bundle: https://github.com/projectcalico/calico/releases/download/v3.23.1/release-v3.23.1.tgz

release-v3.23.1.tgz: container images, binaries, and kubernetes manifests.

PanKaker commented 2 years ago

Hi, Thank you for your respond. Yes, I found this artifact in releas.tgz, but the size of it - always around 1GB, which is a quite big file.

My question: is that possible to put tigera-operator.yaml under each release as a separate file? Thank you

caseydavenport commented 2 years ago

is that possible to put tigera-operator.yaml under each release as a separate file?

This is a separate enhancement, but will be available starting in v3.24.0. If you want to use the manifest from an earlier release, unfortunately you need to download the tgz I linked above.

PanKaker commented 2 years ago

is that possible to put tigera-operator.yaml under each release as a separate file?

This is a separate enhancement, but will be available starting in v3.24.0. If you want to use the manifest from an earlier release, unfortunately you need to download the tgz I linked above.

That's great. Thank you!

mazdakn commented 2 years ago

This is fixed in v3.23.3 with operator 1.27.12