stackrox / collector

Runtime data collection for the StackRox Kubernetes Security Platform using eBPF
Apache License 2.0
52 stars 24 forks source link

CrashLoopBackopff in Collector's Deamon Set on OpenShift 4.9 #1013

Closed Balaji-MP closed 1 year ago

Balaji-MP commented 1 year ago

Hello Team, received the following error while deploying the collector in openshift 4.9. Initially thought this is a permission issue and added the required SCC to collector's service account, but still the issue persists.

terminate called after throwing an instance of 'scap_open_exception'
  what():  can't create map: Permission denied
collector[0x448f7d]
/lib64/libc.so.6(+0x4eb80)[0x7f726981fb80]
/lib64/libc.so.6(gsignal+0x10f)[0x7f726981faff]
/lib64/libc.so.6(abort+0x127)[0x7f72697f2ea5]
/lib64/libstdc++.so.6(+0x9009b)[0x7f726a1c109b]
/lib64/libstdc++.so.6(+0x9653c)[0x7f726a1c753c]
/lib64/libstdc++.so.6(+0x96597)[0x7f726a1c7597]
/lib64/libstdc++.so.6(+0x967f8)[0x7f726a1c77f8]
/usr/local/lib/libsinsp-wrapper.so(+0x240ef5)[0x7f726c82cef5]
/usr/local/lib/libsinsp-wrapper.so(_ZN5sinsp4openERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEE+0x36)[0x7f726c866c16]
collector[0x4d2b34]
collector[0x46631c]
collector[0x442bec]
/lib64/libc.so.6(__libc_start_main+0xe5)[0x7f726980bd85]
collector[0x448e2e]
Caught signal 6 (SIGABRT): Aborted
/bootstrap.sh: line 94:    10 Aborted                 eval exec "$@"
erthalion commented 1 year ago

@Balaji-MP It definitely looks like lack of permissions to load eBPF probe. Just in case, could you share the definition of DaemonSet and the SecurityContext you've got in the end?

Balaji-MP commented 1 year ago

@erthalion here is the definition and security context within in it

`apiVersion: apps/v1 kind: DaemonSet metadata: annotations: deprecated.daemonset.template.generation: "4" email: support@stackrox.com meta.helm.sh/release-name: stackrox-secured-cluster-services meta.helm.sh/release-namespace: rhacs-operator owner: stackrox creationTimestamp: "2023-02-16T08:25:19Z" generation: 4 labels: app: collector app.kubernetes.io/component: collector app.kubernetes.io/instance: stackrox-secured-cluster-services app.kubernetes.io/managed-by: Helm app.kubernetes.io/name: stackrox app.kubernetes.io/part-of: stackrox-secured-cluster-services app.kubernetes.io/version: 3.73.2 auto-upgrade.stackrox.io/component: sensor helm.sh/chart: stackrox-secured-cluster-services-73.2.0 service: collector name: collector namespace: rhacs-operator ownerReferences:

erthalion commented 1 year ago

@Balaji-MP any chance to do kubectl describe ds collector to get the events as well?

Balaji-MP commented 1 year ago

@erthalion here is the events, current state of the pod is CrashLoopBackOff

`Events: Type Reason Age From Message


Normal SuccessfulCreate 10s daemonset-controller Created pod: collector-jst65 Normal SuccessfulCreate 3s daemonset-controller Created pod: collector-x86fj`

Balaji-MP commented 1 year ago

@erthalion I guess, the permission issue is caused because of the eval in line 94. I might be wrong, any thoughts on this ?

erthalion commented 1 year ago

bootstrap.sh (including the eval part) is only responsible for starting Collector. The issue you observe is happening when Collector tries to load eBPF probes.

Balaji-MP commented 1 year ago

@erthalion any thoughts on this one ?

erthalion commented 1 year ago

What happens if you remove this part from the security context?

seLinuxOptions:
  type: container_runtime_t
Balaji-MP commented 1 year ago

same error and nothing changed.

erthalion commented 1 year ago

@Balaji-MP what about the SCC, you haven't posted it yet, can you show scc/stackrox-collector?

Balaji-MP commented 1 year ago

@erthalion here is the security context in stackrox-collector

securityContext: runAsUser: 1000 runAsGroup: 3000 fsGroup: 2000 containers:

erthalion commented 1 year ago

@erthalion here is the security context in stackrox-collector

securityContext: runAsUser: 1000 runAsGroup: 3000 fsGroup: 2000 containers:

There is also a SecurityContextConstraints (SCC), which should have more information, e.g. if a privileged containers are allowed and similar. Having said that, can you describe more your Openshift setup, is there anything special?

Balaji-MP commented 1 year ago

@erthalion here is the SCC applied for this collector

`runAsUser: type: RunAsAny seLinuxContext: type: RunAsAny seccompProfiles:

My cluster is standard and no additional restriction are in place.

Balaji-MP commented 1 year ago

@erthalion can you please share the directory location where the collector will create the map ??

erthalion commented 1 year ago

@erthalion can you please share the directory location where the collector will create the map ??

It's a BPF map, so it's not located on the filesystem. The problem here is your Openshift setup somehow prevent Collector from executing the bpf syscall, we need to find out why is that.

here is the SCC applied for this collector

runAsUser:
type: RunAsAny
seLinuxContext:
type: RunAsAny
seccompProfiles:

'*'
supplementalGroups:
type: RunAsAny

This doesn't look complete, isn't there anything saying something like below?

allowPrivilegeEscalation: true
allowPrivilegedContainer: true
Balaji-MP commented 1 year ago

@erthalion no I don't see anything related to allowPriviledged escalation / container.

erthalion commented 1 year ago

no I don't see anything related to allowPriviledged escalation / container.

That sounds strange to me. So the output of oc get scc/stackrox-collector -o yaml doesn't show anything else except what you've posted?

Balaji-MP commented 1 year ago

Yes, that's correct

porridge commented 1 year ago

@stackrox/collector-team any updates on this issue?

erthalion commented 1 year ago

Unfortunately no, nobody had a capacity to look further into it.

porridge commented 1 year ago

@Balaji-MP TBH Openshift 4.9 is quite dated... might even be out of support? Would it be feasible for you to upgrade to a more recent version?

Balaji-MP commented 1 year ago

@porridge let me update to the latest version and can check. In the mean time, do you have a recommended version or above ?

porridge commented 1 year ago

4.12 would be my first choice

Balaji-MP commented 1 year ago

@porridge You are correct, after upgrading to version 4.12 it fixed the issue.

porridge commented 1 year ago

Awesome! Let us know if you need anything else.