microsoft / retina

eBPF distributed networking observability tool for Kubernetes
https://retina.sh
MIT License
2.74k stars 211 forks source link

agent crashes with loadFiltermanagerObjects failed #463

Open iarlyy opened 5 months ago

iarlyy commented 5 months ago

Describe the bug retina-agent fails to start and crashes with the following error:

$ kubectl logs retina-agent-mdbr7 -n retina
Defaulted container "retina" out of: retina, init-retina (init)
starting Retina v0.0.12loading config /retina/config/config.yaml
init client-go
init logger
ts=2024-06-11T13:03:17.165Z level=info caller=metrics/metrics.go:169 msg="Metrics initialized"
ts=2024-06-11T13:03:17.165Z level=info caller=controller/main.go:138 msg="telemetry disabled"
ts=2024-06-11T13:03:17.176Z level=info caller=controller/main.go:213 msg="Kubernetes server version: v1.27.13-eks-3af4770"
ts=2024-06-11T13:03:17.177Z level=debug caller=pubsub/pubsub.go:76 msg="subscribed to topic" topic=apiserver uuid=8f6b81ba-eef8-4a70-9d7d-002ac3862fc7
ts=2024-06-11T13:03:17.177Z level=error caller=filter/filter_map_linux.go:61 msg="loadFiltermanagerObjects failed" error="field RetinaFilterMap: map retina_filter_map: load pinned map: permission denied"
ts=2024-06-11T13:03:17.177Z level=error caller=controller/main.go:226 msg="unable to create filter manager{error 26 0  failed to initialize filter map: field RetinaFilterMap: map retina_filter_map: load pinned map: permission denied}"

To Reproduce Steps to reproduce the behavior:

Installation command:

helm template retina oci://ghcr.io/microsoft/retina/charts/retina \
    --version v0.0.12 \
    --namespace retina \
    --set image.tag=v0.0.12 \
    --set operator.tag=v0.0.12 \
    --set logLevel=debug \
    --set os.windows=false \
    --set operator.enabled=false \
    --skip-crds \
    --set enablePodLevel=true \
    --set remoteContext=true \
    --set enabledPlugin_linux="\[packetforward\,linuxutil\,dns\]" | kubectl apply -f -

It seems some issue with pod level toggle. If i set it to false, pods start normally.

Expected behavior Clean initialization of retina-agent pods.

Platform (please complete the following information):

Thanks for any light in figuring it out what is happening.

anubhabMajumdar commented 5 months ago

@iarlyy Thanks for raising the issue. We have encountered this when init container fails to create the pinned map the log mentions. Can you update us with the following information:

iarlyy commented 5 months ago

logs from one of the init-retina containers:

ts=2024-06-17T09:23:40.651Z level=info caller=bpf/setup_linux.go:61 msg="BPF filesystem mounted successfully" version=v0.0.12 path=/sys/fs/bpf
ts=2024-06-17T09:23:40.651Z level=info caller=bpf/setup_linux.go:68 msg="Deleted existing filter map file" version=v0.0.12 path=/sys/fs/bpf Map name=retina_filter_map
ts=2024-06-17T09:23:40.652Z level=info caller=bpf/setup_linux.go:76 msg="Filter map initialized successfully" version=v0.0.12 path=/sys/fs/bpf Map name=retina_filter_map

I will install bpftool in one of the nodes to collect the requested information.

iarlyy commented 5 months ago

@anubhabMajumdar is there an alternative way to get this information? I am just unable to compile bpftool.

iarlyy commented 4 months ago

my bad, runAsUser attribute and BPF,PERFMON (not sure if those are needed) capabilities were missing.

          securityContext:
            runAsUser: 0
            capabilities:
              add:
              - SYS_ADMIN
              - SYS_RESOURCE
              - NET_ADMIN
              - IPC_LOCK
              - BPF
              - PERFMON
timraymond commented 4 months ago

I'm going to reopen this since we should probably be adding those capabilities to the manifests if they're not getting added. Otherwise, we should document that there are additional steps required. Thanks for your investigation @iarlyy !