Closed wenhuwang closed 5 months ago
I guess there needs to allow the current process to lock memory for eBPF resource. If there is no problem with this solution, i can deal with this issues.
@wenhuwang is it possible to describe daemonset you applied as yaml here, i want to see the permissions applied on init container.
Same here.
I installed with basic mode. make helm-install
$ k logs retina-agent-7n7xc -n kube-system -c init-retina
ts=2024-03-27T16:04:56.840Z level=info caller=bpf/setup_linux.go:62 msg="BPF filesystem mounted successfully" goversion=go1.21.8 os=linux arch=amd64 numcores=32 hostname=jrpark-w-4hbc podname= version=v0.0.1 path=/sys/fs/bpf
ts=2024-03-27T16:04:56.840Z level=info caller=bpf/setup_linux.go:69 msg="Deleted existing filter map file" goversion=go1.21.8 os=linux arch=amd64 numcores=32 hostname=jrpark-w-4hbc podname= version=v0.0.1 path=/sys/fs/bpf Map name=retina_filter_map
ts=2024-03-27T16:04:56.841Z level=error caller=filter/filter_map_linux.go:54 msg="loadFiltermanagerObjects failed" goversion=go1.21.8 os=linux arch=amd64 numcores=32 hostname=jrpark-w-4hbc podname= version=v0.0.1 error="field RetinaFilterMap: map retina_filter_map: map create: operation not permitted (MEMLOCK may be too low, consider rlimit.RemoveMemlock)"
ts=2024-03-27T16:04:56.841Z level=panic caller=bpf/setup_linux.go:75 msg="Failed to initialize filter map" goversion=go1.21.8 os=linux arch=amd64 numcores=32 hostname=jrpark-w-4hbc podname= version=v0.0.1 error="field RetinaFilterMap: map retina_filter_map: map create: operation not permitted (MEMLOCK may be too low, consider rlimit.RemoveMemlock)"
panic: Failed to initialize filter map [recovered]
panic: Failed to initialize filter map
goroutine 1 [running]:
github.com/microsoft/retina/pkg/telemetry.TrackPanic()
/go/src/github.com/microsoft/retina/pkg/telemetry/telemetry.go:112 +0x209
panic({0xb338a0?, 0xc000231180?})
/usr/local/go/src/runtime/panic.go:914 +0x21f
go.uber.org/zap/zapcore.CheckWriteAction.OnWrite(0x1?, 0x1?, {0x0?, 0x0?, 0xc0000bfb80?})
/go/pkg/mod/go.uber.org/zap@v1.26.0/zapcore/entry.go:196 +0x54
go.uber.org/zap/zapcore.(*CheckedEntry).Write(0xc000252340, {0xc00024c9c0, 0x1, 0x1})
/go/pkg/mod/go.uber.org/zap@v1.26.0/zapcore/entry.go:262 +0x3ec
go.uber.org/zap.(*Logger).Panic(0xd6b9e0?, {0xc69630?, 0xd6b900?}, {0xc00024c9c0, 0x1, 0x1})
/go/pkg/mod/go.uber.org/zap@v1.26.0/logger.go:284 +0x51
github.com/microsoft/retina/pkg/bpf.Setup(0xc000291ec8)
/go/src/github.com/microsoft/retina/pkg/bpf/setup_linux.go:75 +0x6e5
main.main()
$ k get ds -n kube-system retina-agent -o yaml
apiVersion: apps/v1
kind: DaemonSet
metadata:
annotations:
deprecated.daemonset.template.generation: "1"
meta.helm.sh/release-name: retina
meta.helm.sh/release-namespace: kube-system
creationTimestamp: "2024-03-27T16:00:51Z"
generation: 1
labels:
app.kubernetes.io/managed-by: Helm
k8s-app: retina
name: retina-agent
namespace: kube-system
resourceVersion: "8080926"
uid: ee956376-f701-4d2f-bfc2-0055c5c48a0b
spec:
revisionHistoryLimit: 10
selector:
matchLabels:
app: retina
template:
metadata:
annotations:
checksum/config: 1aa5dfa2b1c3bc86cd80d7e983d27ffc4668458df1a51541f906e4827abc2e62
prometheus.io/port: "10093"
prometheus.io/scrape: "true"
creationTimestamp: null
labels:
app: retina
k8s-app: retina
spec:
containers:
- args:
- --health-probe-bind-address=:18081
- --metrics-bind-address=:18080
- --config
- /retina/config/config.yaml
command:
- /retina/controller
env:
- name: POD_NAME
valueFrom:
fieldRef:
apiVersion: v1
fieldPath: metadata.name
- name: NODE_NAME
valueFrom:
fieldRef:
apiVersion: v1
fieldPath: spec.nodeName
- name: NODE_IP
valueFrom:
fieldRef:
apiVersion: v1
fieldPath: status.hostIP
image: ghcr.io/microsoft/retina/retina-agent:v0.0.1
imagePullPolicy: Always
livenessProbe:
failureThreshold: 3
httpGet:
path: /metrics
port: 10093
scheme: HTTP
initialDelaySeconds: 30
periodSeconds: 30
successThreshold: 1
timeoutSeconds: 1
name: retina
ports:
- containerPort: 10093
hostPort: 10093
protocol: TCP
resources:
limits:
cpu: 500m
memory: 300Mi
securityContext:
capabilities:
add:
- SYS_ADMIN
- SYS_RESOURCE
- NET_ADMIN
- IPC_LOCK
privileged: false
terminationMessagePath: /dev/termination-log
terminationMessagePolicy: File
volumeMounts:
- mountPath: /sys/fs/bpf
name: bpf
- mountPath: /sys/fs/cgroup
name: cgroup
- mountPath: /retina/config
name: config
- mountPath: /sys/kernel/debug
name: debug
- mountPath: /tmp
name: tmp
- mountPath: /sys/kernel/tracing
name: trace
dnsPolicy: ClusterFirst
hostNetwork: true
initContainers:
- image: ghcr.io/microsoft/retina/retina-init:v0.0.1
imagePullPolicy: Always
name: init-retina
resources: {}
securityContext:
privileged: true
terminationMessagePath: /dev/termination-log
terminationMessagePolicy: FallbackToLogsOnError
volumeMounts:
- mountPath: /sys/fs/bpf
mountPropagation: Bidirectional
name: bpf
nodeSelector:
kubernetes.io/os: linux
restartPolicy: Always
schedulerName: default-scheduler
securityContext: {}
serviceAccount: retina-agent
serviceAccountName: retina-agent
terminationGracePeriodSeconds: 90
volumes:
- hostPath:
path: /sys/fs/bpf
type: ""
name: bpf
- hostPath:
path: /sys/fs/cgroup
type: ""
name: cgroup
- configMap:
defaultMode: 420
name: retina-config
name: config
- hostPath:
path: /sys/kernel/debug
type: ""
name: debug
- emptyDir: {}
name: tmp
- hostPath:
path: /sys/kernel/tracing
type: ""
name: trace
updateStrategy:
rollingUpdate:
maxSurge: 0
maxUnavailable: 1
type: RollingUpdate
status:
currentNumberScheduled: 10
desiredNumberScheduled: 10
numberMisscheduled: 0
numberReady: 0
numberUnavailable: 10
observedGeneration: 1
updatedNumberScheduled: 10
This bug is caused by the same issue as https://github.com/microsoft/retina/issues/115
It is happening on arm64 nodes and caused byfork/exec /bin/clang: no such file or directory" errorVerbose="fork/exec /bin/clang: no such file or directory
when trying to reconcile dropreason plugin.
This bug is caused by the same issue as #115 It is happening on arm64 nodes and caused by
fork/exec /bin/clang: no such file or directory" errorVerbose="fork/exec /bin/clang: no such file or directory
when trying to reconcile dropreason plugin.
The clang issue was fixed here before v0.0.2. Is the solution here just to upgrade to v0.0.2?
Architecture of our node is amd64.
$ k get nodes jrpark-w-4hb7 -o yaml | grep architecture
architecture: amd64
The clang issue was https://github.com/microsoft/retina/pull/133 before v0.0.2. Is the solution here just to upgrade to v0.0.2?
I just tried upgrading to v0.0.2 and it didn't fix the issue.
$ k get ds retina-agent -n kube-system -o yaml | grep image:
image: ghcr.io/microsoft/retina/retina-agent:v0.0.2
- image: ghcr.io/microsoft/retina/retina-init:v0.0.2
$ k logs retina-agent-2jnf6 -n kube-system -c init-retina
ts=2024-03-27T23:47:36.410Z level=info caller=bpf/setup_linux.go:62 msg="BPF filesystem mounted successfully" goversion=go1.21.8 os=linux arch=amd64 numcores=32 hostname=jrpark-w-4hbd podname= version=v0.0.2 path=/sys/fs/bpf
ts=2024-03-27T23:47:36.410Z level=info caller=bpf/setup_linux.go:69 msg="Deleted existing filter map file" goversion=go1.21.8 os=linux arch=amd64 numcores=32 hostname=jrpark-w-4hbd podname= version=v0.0.2 path=/sys/fs/bpf Map name=retina_filter_map
ts=2024-03-27T23:47:36.411Z level=error caller=filter/filter_map_linux.go:54 msg="loadFiltermanagerObjects failed" goversion=go1.21.8 os=linux arch=amd64 numcores=32 hostname=jrpark-w-4hbd podname= version=v0.0.2 error="field RetinaFilterMap: map retina_filter_map: map create: operation not permitted (MEMLOCK may be too low, consider rlimit.RemoveMemlock)"
ts=2024-03-27T23:47:36.411Z level=panic caller=bpf/setup_linux.go:75 msg="Failed to initialize filter map" goversion=go1.21.8 os=linux arch=amd64 numcores=32 hostname=jrpark-w-4hbd podname= version=v0.0.2 error="field RetinaFilterMap: map retina_filter_map: map create: operation not permitted (MEMLOCK may be too low, consider rlimit.RemoveMemlock)"
panic: Failed to initialize filter map
goroutine 1 [running]:
go.uber.org/zap/zapcore.CheckWriteAction.OnWrite(0x1?, 0x1?, {0x0?, 0x0?, 0xc0000bfb60?})
/go/pkg/mod/go.uber.org/zap@v1.26.0/zapcore/entry.go:196 +0x54
go.uber.org/zap/zapcore.(*CheckedEntry).Write(0xc000264000, {0xc00024c9c0, 0x1, 0x1})
/go/pkg/mod/go.uber.org/zap@v1.26.0/zapcore/entry.go:262 +0x3ec
go.uber.org/zap.(*Logger).Panic(0xd6ba00?, {0xc69630?, 0xd6b920?}, {0xc00024c9c0, 0x1, 0x1})
/go/pkg/mod/go.uber.org/zap@v1.26.0/logger.go:284 +0x51
github.com/microsoft/retina/pkg/bpf.Setup(0xc0001e9ec0)
/go/src/github.com/microsoft/retina/pkg/bpf/setup_linux.go:75 +0x6e5
main.main()
/go/src/github.com/microsoft/retina/init/retina/main_linux.go:40 +0x24a
@wenhuwang is it possible to describe daemonset you applied as yaml here, i want to see the permissions applied on init container.
@vakalapa retina-agent daemonset yaml:
kind: DaemonSet
apiVersion: apps/v1
metadata:
name: retina-agent
namespace: kube-system
labels:
app.kubernetes.io/managed-by: Helm
k8s-app: retina
annotations:
deprecated.daemonset.template.generation: '20'
field.cattle.io/publicEndpoints: 'null'
meta.helm.sh/release-name: retina
meta.helm.sh/release-namespace: kube-system
spec:
selector:
matchLabels:
app: retina
template:
metadata:
creationTimestamp: null
labels:
app: retina
k8s-app: retina
annotations:
checksum/config: 48f843d88ced90f531a61ed0ee4f1e0f9bf256a47ac281655788542bf0f520fb
kubesphere.io/restartedAt: '2024-03-27T09:42:56.531Z'
prometheus.io/port: '10093'
prometheus.io/scrape: 'true'
spec:
volumes:
- name: bpf
hostPath:
path: /sys/fs/bpf
type: ''
- name: cgroup
hostPath:
path: /sys/fs/cgroup
type: ''
- name: config
configMap:
name: retina-config
defaultMode: 420
- name: debug
hostPath:
path: /sys/kernel/debug
type: ''
- name: tmp
emptyDir: {}
- name: trace
hostPath:
path: /sys/kernel/tracing
type: ''
initContainers:
- name: init-retina
image: '*****/retina-init:v0.0.1'
resources: {}
volumeMounts:
- name: bpf
mountPath: /sys/fs/bpf
mountPropagation: Bidirectional
terminationMessagePath: /dev/termination-log
terminationMessagePolicy: FallbackToLogsOnError
imagePullPolicy: Always
securityContext:
privileged: true
containers:
- name: retina
image: '***/retina-agent:v0.0.1'
command:
- /retina/controller
args:
- '--health-probe-bind-address=:18081'
- '--metrics-bind-address=:18080'
- '--config'
- /retina/config/config.yaml
ports:
- hostPort: 10093
containerPort: 10093
protocol: TCP
env:
- name: POD_NAME
valueFrom:
fieldRef:
apiVersion: v1
fieldPath: metadata.name
- name: NODE_NAME
valueFrom:
fieldRef:
apiVersion: v1
fieldPath: spec.nodeName
- name: NODE_IP
valueFrom:
fieldRef:
apiVersion: v1
fieldPath: status.hostIP
resources:
limits:
cpu: 500m
memory: 300Mi
volumeMounts:
- name: bpf
mountPath: /sys/fs/bpf
- name: cgroup
mountPath: /sys/fs/cgroup
- name: config
mountPath: /retina/config
- name: debug
mountPath: /sys/kernel/debug
- name: tmp
mountPath: /tmp
- name: trace
mountPath: /sys/kernel/tracing
livenessProbe:
httpGet:
path: /metrics
port: 10093
scheme: HTTP
initialDelaySeconds: 30
timeoutSeconds: 1
periodSeconds: 30
successThreshold: 1
failureThreshold: 3
terminationMessagePath: /dev/termination-log
terminationMessagePolicy: File
imagePullPolicy: Always
securityContext:
capabilities:
add:
- SYS_ADMIN
- SYS_RESOURCE
- NET_ADMIN
- IPC_LOCK
privileged: false
restartPolicy: Always
terminationGracePeriodSeconds: 90
dnsPolicy: ClusterFirst
nodeSelector:
kubernetes.io/os: linux
serviceAccountName: retina-agent
serviceAccount: retina-agent
hostNetwork: true
securityContext: {}
I have solved this issues by removing lock memory limit for eBPF resource. add the follow code to location https://github.com/microsoft/retina/blob/main/pkg/plugin/filter/filter_map_linux.go#L46
if err := rlimit.RemoveMemlock(); err != nil {
f.l.Error("remove memlock failed", zap.Error(err))
return f, err
}
Could you please assign this issues to me?
We would most likely see this error for kernal versions < 5.11 and be able to reproduce https://pkg.go.dev/github.com/cilium/ebpf/rlimit
Describe the bug
installation commands:
make helm-install-with-operator
retina-agent pod status as follows:
init-retina containers logs is:
Expected behavior retina-agent pod status is normal.
Platform (please complete the following information):