microsoft / retina

eBPF distributed networking observability tool for Kubernetes
https://retina.sh
MIT License
2.68k stars 194 forks source link

retina-agent pod initialization failed #153

Closed wenhuwang closed 5 months ago

wenhuwang commented 5 months ago

Describe the bug

installation commands: make helm-install-with-operator

retina-agent pod status as follows:

# k -n kube-system get pods retina-agent-5lwhj
NAME                 READY   STATUS                  RESTARTS        AGE
retina-agent-5lwhj   0/1     Init:CrashLoopBackOff   6 (3m37s ago)   11m

init-retina containers logs is:

# k -n kube-system logs retina-agent-5lwhj init-retina
ts=2024-03-27T01:19:37.004Z level=info caller=bpf/setup_linux.go:62 msg="BPF filesystem mounted successfully" goversion=go1.21.8 os=linux arch=amd64 numcores=48 hostname=SHTL165006033 podname= version=v0.0.1 path=/sys/fs/bpf
ts=2024-03-27T01:19:37.004Z level=info caller=bpf/setup_linux.go:69 msg="Deleted existing filter map file" goversion=go1.21.8 os=linux arch=amd64 numcores=48 hostname=SHTL165006033 podname= version=v0.0.1 path=/sys/fs/bpf Map name=retina_filter_map
ts=2024-03-27T01:19:37.004Z level=error caller=filter/filter_map_linux.go:54 msg="loadFiltermanagerObjects failed" goversion=go1.21.8 os=linux arch=amd64 numcores=48 hostname=SHTL165006033 podname= version=v0.0.1 error="field RetinaFilterMap: map retina_filter_map: map create: operation not permitted (MEMLOCK may be too low, consider rlimit.RemoveMemlock)"
ts=2024-03-27T01:19:37.005Z level=panic caller=bpf/setup_linux.go:75 msg="Failed to initialize filter map" goversion=go1.21.8 os=linux arch=amd64 numcores=48 hostname=SHTL165006033 podname= version=v0.0.1 error="field RetinaFilterMap: map retina_filter_map: map create: operation not permitted (MEMLOCK may be too low, consider rlimit.RemoveMemlock)"
panic: Failed to initialize filter map [recovered]
    panic: Failed to initialize filter map

goroutine 1 [running]:
github.com/microsoft/retina/pkg/telemetry.TrackPanic()
    /go/src/github.com/microsoft/retina/pkg/telemetry/telemetry.go:112 +0x209
panic({0xb338a0?, 0xc000219130?})
    /usr/local/go/src/runtime/panic.go:914 +0x21f
go.uber.org/zap/zapcore.CheckWriteAction.OnWrite(0x1?, 0x1?, {0x0?, 0x0?, 0xc00013bb60?})
    /go/pkg/mod/go.uber.org/zap@v1.26.0/zapcore/entry.go:196 +0x54
go.uber.org/zap/zapcore.(*CheckedEntry).Write(0xc00024e0d0, {0xc00023a9c0, 0x1, 0x1})
    /go/pkg/mod/go.uber.org/zap@v1.26.0/zapcore/entry.go:262 +0x3ec
go.uber.org/zap.(*Logger).Panic(0xd6b9e0?, {0xc69630?, 0xd6b900?}, {0xc00023a9c0, 0x1, 0x1})
    /go/pkg/mod/go.uber.org/zap@v1.26.0/logger.go:284 +0x51
github.com/microsoft/retina/pkg/bpf.Setup(0xc000593ec8)
    /go/src/github.com/microsoft/retina/pkg/bpf/setup_linux.go:75 +0x6e5
main.main()
    /go/src/github.com/microsoft/retina/init/retina/main_linux.go:33 +0x214

Expected behavior retina-agent pod status is normal.

Platform (please complete the following information):

wenhuwang commented 5 months ago

I guess there needs to allow the current process to lock memory for eBPF resource. If there is no problem with this solution, i can deal with this issues.

vakalapa commented 5 months ago

@wenhuwang is it possible to describe daemonset you applied as yaml here, i want to see the permissions applied on init container.

parkjeongryul commented 5 months ago

Same here. I installed with basic mode. make helm-install

$ k logs retina-agent-7n7xc -n kube-system -c init-retina

ts=2024-03-27T16:04:56.840Z level=info caller=bpf/setup_linux.go:62 msg="BPF filesystem mounted successfully" goversion=go1.21.8 os=linux arch=amd64 numcores=32 hostname=jrpark-w-4hbc podname= version=v0.0.1 path=/sys/fs/bpf
ts=2024-03-27T16:04:56.840Z level=info caller=bpf/setup_linux.go:69 msg="Deleted existing filter map file" goversion=go1.21.8 os=linux arch=amd64 numcores=32 hostname=jrpark-w-4hbc podname= version=v0.0.1 path=/sys/fs/bpf Map name=retina_filter_map
ts=2024-03-27T16:04:56.841Z level=error caller=filter/filter_map_linux.go:54 msg="loadFiltermanagerObjects failed" goversion=go1.21.8 os=linux arch=amd64 numcores=32 hostname=jrpark-w-4hbc podname= version=v0.0.1 error="field RetinaFilterMap: map retina_filter_map: map create: operation not permitted (MEMLOCK may be too low, consider rlimit.RemoveMemlock)"
ts=2024-03-27T16:04:56.841Z level=panic caller=bpf/setup_linux.go:75 msg="Failed to initialize filter map" goversion=go1.21.8 os=linux arch=amd64 numcores=32 hostname=jrpark-w-4hbc podname= version=v0.0.1 error="field RetinaFilterMap: map retina_filter_map: map create: operation not permitted (MEMLOCK may be too low, consider rlimit.RemoveMemlock)"
panic: Failed to initialize filter map [recovered]
    panic: Failed to initialize filter map

goroutine 1 [running]:
github.com/microsoft/retina/pkg/telemetry.TrackPanic()
    /go/src/github.com/microsoft/retina/pkg/telemetry/telemetry.go:112 +0x209
panic({0xb338a0?, 0xc000231180?})
    /usr/local/go/src/runtime/panic.go:914 +0x21f
go.uber.org/zap/zapcore.CheckWriteAction.OnWrite(0x1?, 0x1?, {0x0?, 0x0?, 0xc0000bfb80?})
    /go/pkg/mod/go.uber.org/zap@v1.26.0/zapcore/entry.go:196 +0x54
go.uber.org/zap/zapcore.(*CheckedEntry).Write(0xc000252340, {0xc00024c9c0, 0x1, 0x1})
    /go/pkg/mod/go.uber.org/zap@v1.26.0/zapcore/entry.go:262 +0x3ec
go.uber.org/zap.(*Logger).Panic(0xd6b9e0?, {0xc69630?, 0xd6b900?}, {0xc00024c9c0, 0x1, 0x1})
    /go/pkg/mod/go.uber.org/zap@v1.26.0/logger.go:284 +0x51
github.com/microsoft/retina/pkg/bpf.Setup(0xc000291ec8)
    /go/src/github.com/microsoft/retina/pkg/bpf/setup_linux.go:75 +0x6e5
main.main()
$ k get ds -n kube-system retina-agent -o yaml

apiVersion: apps/v1
kind: DaemonSet
metadata:
  annotations:
    deprecated.daemonset.template.generation: "1"
    meta.helm.sh/release-name: retina
    meta.helm.sh/release-namespace: kube-system
  creationTimestamp: "2024-03-27T16:00:51Z"
  generation: 1
  labels:
    app.kubernetes.io/managed-by: Helm
    k8s-app: retina
  name: retina-agent
  namespace: kube-system
  resourceVersion: "8080926"
  uid: ee956376-f701-4d2f-bfc2-0055c5c48a0b
spec:
  revisionHistoryLimit: 10
  selector:
    matchLabels:
      app: retina
  template:
    metadata:
      annotations:
        checksum/config: 1aa5dfa2b1c3bc86cd80d7e983d27ffc4668458df1a51541f906e4827abc2e62
        prometheus.io/port: "10093"
        prometheus.io/scrape: "true"
      creationTimestamp: null
      labels:
        app: retina
        k8s-app: retina
    spec:
      containers:
      - args:
        - --health-probe-bind-address=:18081
        - --metrics-bind-address=:18080
        - --config
        - /retina/config/config.yaml
        command:
        - /retina/controller
        env:
        - name: POD_NAME
          valueFrom:
            fieldRef:
              apiVersion: v1
              fieldPath: metadata.name
        - name: NODE_NAME
          valueFrom:
            fieldRef:
              apiVersion: v1
              fieldPath: spec.nodeName
        - name: NODE_IP
          valueFrom:
            fieldRef:
              apiVersion: v1
              fieldPath: status.hostIP
        image: ghcr.io/microsoft/retina/retina-agent:v0.0.1
        imagePullPolicy: Always
        livenessProbe:
          failureThreshold: 3
          httpGet:
            path: /metrics
            port: 10093
            scheme: HTTP
          initialDelaySeconds: 30
          periodSeconds: 30
          successThreshold: 1
          timeoutSeconds: 1
        name: retina
        ports:
        - containerPort: 10093
          hostPort: 10093
          protocol: TCP
        resources:
          limits:
            cpu: 500m
            memory: 300Mi
        securityContext:
          capabilities:
            add:
            - SYS_ADMIN
            - SYS_RESOURCE
            - NET_ADMIN
            - IPC_LOCK
          privileged: false
        terminationMessagePath: /dev/termination-log
        terminationMessagePolicy: File
        volumeMounts:
        - mountPath: /sys/fs/bpf
          name: bpf
        - mountPath: /sys/fs/cgroup
          name: cgroup
        - mountPath: /retina/config
          name: config
        - mountPath: /sys/kernel/debug
          name: debug
        - mountPath: /tmp
          name: tmp
        - mountPath: /sys/kernel/tracing
          name: trace
      dnsPolicy: ClusterFirst
      hostNetwork: true
      initContainers:
      - image: ghcr.io/microsoft/retina/retina-init:v0.0.1
        imagePullPolicy: Always
        name: init-retina
        resources: {}
        securityContext:
          privileged: true
        terminationMessagePath: /dev/termination-log
        terminationMessagePolicy: FallbackToLogsOnError
        volumeMounts:
        - mountPath: /sys/fs/bpf
          mountPropagation: Bidirectional
          name: bpf
      nodeSelector:
        kubernetes.io/os: linux
      restartPolicy: Always
      schedulerName: default-scheduler
      securityContext: {}
      serviceAccount: retina-agent
      serviceAccountName: retina-agent
      terminationGracePeriodSeconds: 90
      volumes:
      - hostPath:
          path: /sys/fs/bpf
          type: ""
        name: bpf
      - hostPath:
          path: /sys/fs/cgroup
          type: ""
        name: cgroup
      - configMap:
          defaultMode: 420
          name: retina-config
        name: config
      - hostPath:
          path: /sys/kernel/debug
          type: ""
        name: debug
      - emptyDir: {}
        name: tmp
      - hostPath:
          path: /sys/kernel/tracing
          type: ""
        name: trace
  updateStrategy:
    rollingUpdate:
      maxSurge: 0
      maxUnavailable: 1
    type: RollingUpdate
status:
  currentNumberScheduled: 10
  desiredNumberScheduled: 10
  numberMisscheduled: 0
  numberReady: 0
  numberUnavailable: 10
  observedGeneration: 1
  updatedNumberScheduled: 10

Platform (please complete the following information):

jimassa commented 5 months ago

This bug is caused by the same issue as https://github.com/microsoft/retina/issues/115 It is happening on arm64 nodes and caused byfork/exec /bin/clang: no such file or directory" errorVerbose="fork/exec /bin/clang: no such file or directory when trying to reconcile dropreason plugin.

rbtr commented 5 months ago

This bug is caused by the same issue as #115 It is happening on arm64 nodes and caused byfork/exec /bin/clang: no such file or directory" errorVerbose="fork/exec /bin/clang: no such file or directory when trying to reconcile dropreason plugin.

The clang issue was fixed here before v0.0.2. Is the solution here just to upgrade to v0.0.2?

parkjeongryul commented 5 months ago

Architecture of our node is amd64.

$ k get nodes jrpark-w-4hb7 -o yaml | grep architecture
    architecture: amd64

The clang issue was https://github.com/microsoft/retina/pull/133 before v0.0.2. Is the solution here just to upgrade to v0.0.2?

I just tried upgrading to v0.0.2 and it didn't fix the issue.

$ k get ds retina-agent -n kube-system -o yaml | grep image:
        image: ghcr.io/microsoft/retina/retina-agent:v0.0.2
      - image: ghcr.io/microsoft/retina/retina-init:v0.0.2
$ k logs retina-agent-2jnf6 -n kube-system -c init-retina

ts=2024-03-27T23:47:36.410Z level=info caller=bpf/setup_linux.go:62 msg="BPF filesystem mounted successfully" goversion=go1.21.8 os=linux arch=amd64 numcores=32 hostname=jrpark-w-4hbd podname= version=v0.0.2 path=/sys/fs/bpf
ts=2024-03-27T23:47:36.410Z level=info caller=bpf/setup_linux.go:69 msg="Deleted existing filter map file" goversion=go1.21.8 os=linux arch=amd64 numcores=32 hostname=jrpark-w-4hbd podname= version=v0.0.2 path=/sys/fs/bpf Map name=retina_filter_map
ts=2024-03-27T23:47:36.411Z level=error caller=filter/filter_map_linux.go:54 msg="loadFiltermanagerObjects failed" goversion=go1.21.8 os=linux arch=amd64 numcores=32 hostname=jrpark-w-4hbd podname= version=v0.0.2 error="field RetinaFilterMap: map retina_filter_map: map create: operation not permitted (MEMLOCK may be too low, consider rlimit.RemoveMemlock)"
ts=2024-03-27T23:47:36.411Z level=panic caller=bpf/setup_linux.go:75 msg="Failed to initialize filter map" goversion=go1.21.8 os=linux arch=amd64 numcores=32 hostname=jrpark-w-4hbd podname= version=v0.0.2 error="field RetinaFilterMap: map retina_filter_map: map create: operation not permitted (MEMLOCK may be too low, consider rlimit.RemoveMemlock)"
panic: Failed to initialize filter map

goroutine 1 [running]:
go.uber.org/zap/zapcore.CheckWriteAction.OnWrite(0x1?, 0x1?, {0x0?, 0x0?, 0xc0000bfb60?})
    /go/pkg/mod/go.uber.org/zap@v1.26.0/zapcore/entry.go:196 +0x54
go.uber.org/zap/zapcore.(*CheckedEntry).Write(0xc000264000, {0xc00024c9c0, 0x1, 0x1})
    /go/pkg/mod/go.uber.org/zap@v1.26.0/zapcore/entry.go:262 +0x3ec
go.uber.org/zap.(*Logger).Panic(0xd6ba00?, {0xc69630?, 0xd6b920?}, {0xc00024c9c0, 0x1, 0x1})
    /go/pkg/mod/go.uber.org/zap@v1.26.0/logger.go:284 +0x51
github.com/microsoft/retina/pkg/bpf.Setup(0xc0001e9ec0)
    /go/src/github.com/microsoft/retina/pkg/bpf/setup_linux.go:75 +0x6e5
main.main()
    /go/src/github.com/microsoft/retina/init/retina/main_linux.go:40 +0x24a
wenhuwang commented 5 months ago

@wenhuwang is it possible to describe daemonset you applied as yaml here, i want to see the permissions applied on init container.

@vakalapa retina-agent daemonset yaml:

kind: DaemonSet
apiVersion: apps/v1
metadata:
  name: retina-agent
  namespace: kube-system
  labels:
    app.kubernetes.io/managed-by: Helm
    k8s-app: retina
  annotations:
    deprecated.daemonset.template.generation: '20'
    field.cattle.io/publicEndpoints: 'null'
    meta.helm.sh/release-name: retina
    meta.helm.sh/release-namespace: kube-system
spec:
  selector:
    matchLabels:
      app: retina
  template:
    metadata:
      creationTimestamp: null
      labels:
        app: retina
        k8s-app: retina
      annotations:
        checksum/config: 48f843d88ced90f531a61ed0ee4f1e0f9bf256a47ac281655788542bf0f520fb
        kubesphere.io/restartedAt: '2024-03-27T09:42:56.531Z'
        prometheus.io/port: '10093'
        prometheus.io/scrape: 'true'
    spec:
      volumes:
        - name: bpf
          hostPath:
            path: /sys/fs/bpf
            type: ''
        - name: cgroup
          hostPath:
            path: /sys/fs/cgroup
            type: ''
        - name: config
          configMap:
            name: retina-config
            defaultMode: 420
        - name: debug
          hostPath:
            path: /sys/kernel/debug
            type: ''
        - name: tmp
          emptyDir: {}
        - name: trace
          hostPath:
            path: /sys/kernel/tracing
            type: ''
      initContainers:
        - name: init-retina
          image: '*****/retina-init:v0.0.1'
          resources: {}
          volumeMounts:
            - name: bpf
              mountPath: /sys/fs/bpf
              mountPropagation: Bidirectional
          terminationMessagePath: /dev/termination-log
          terminationMessagePolicy: FallbackToLogsOnError
          imagePullPolicy: Always
          securityContext:
            privileged: true
      containers:
        - name: retina
          image: '***/retina-agent:v0.0.1'
          command:
            - /retina/controller
          args:
            - '--health-probe-bind-address=:18081'
            - '--metrics-bind-address=:18080'
            - '--config'
            - /retina/config/config.yaml
          ports:
            - hostPort: 10093
              containerPort: 10093
              protocol: TCP
          env:
            - name: POD_NAME
              valueFrom:
                fieldRef:
                  apiVersion: v1
                  fieldPath: metadata.name
            - name: NODE_NAME
              valueFrom:
                fieldRef:
                  apiVersion: v1
                  fieldPath: spec.nodeName
            - name: NODE_IP
              valueFrom:
                fieldRef:
                  apiVersion: v1
                  fieldPath: status.hostIP
          resources:
            limits:
              cpu: 500m
              memory: 300Mi
          volumeMounts:
            - name: bpf
              mountPath: /sys/fs/bpf
            - name: cgroup
              mountPath: /sys/fs/cgroup
            - name: config
              mountPath: /retina/config
            - name: debug
              mountPath: /sys/kernel/debug
            - name: tmp
              mountPath: /tmp
            - name: trace
              mountPath: /sys/kernel/tracing
          livenessProbe:
            httpGet:
              path: /metrics
              port: 10093
              scheme: HTTP
            initialDelaySeconds: 30
            timeoutSeconds: 1
            periodSeconds: 30
            successThreshold: 1
            failureThreshold: 3
          terminationMessagePath: /dev/termination-log
          terminationMessagePolicy: File
          imagePullPolicy: Always
          securityContext:
            capabilities:
              add:
                - SYS_ADMIN
                - SYS_RESOURCE
                - NET_ADMIN
                - IPC_LOCK
            privileged: false
      restartPolicy: Always
      terminationGracePeriodSeconds: 90
      dnsPolicy: ClusterFirst
      nodeSelector:
        kubernetes.io/os: linux
      serviceAccountName: retina-agent
      serviceAccount: retina-agent
      hostNetwork: true
      securityContext: {}

I have solved this issues by removing lock memory limit for eBPF resource. add the follow code to location https://github.com/microsoft/retina/blob/main/pkg/plugin/filter/filter_map_linux.go#L46

        if err := rlimit.RemoveMemlock(); err != nil {
        f.l.Error("remove memlock failed", zap.Error(err))
        return f, err
    }

Could you please assign this issues to me?

snguyen64 commented 5 months ago

We would most likely see this error for kernal versions < 5.11 and be able to reproduce https://pkg.go.dev/github.com/cilium/ebpf/rlimit