Container and k8s Resource scan get OOMkilled inside k8s

czunker commented 1 year ago

Describe the bug When running the container scan and the k8s resource scan inside a k8s Cluster, both scans get OOMkilled.

mondoo-client-containers-scan-28222987-v9xmj                      0/1     OOMKilled   0          10m
mondoo-client-k8s-scan-now-mcvfw                                  0/1     Error       0          39s
mondoo-client-scan-api-6bbc458dd9-wkdhl                           0/1     OOMKilled   0          13m

mondoo-client-k8s-scan-now-mcvfw is a manually triggered k8s resource scan:

kubectl -n mondoo-operator create job --from=cronjob.batch/mondoo-client-k8s-scan mondoo-client-k8s-scan-now

I had to increase the memory limit for the Scan API to 1GB to get it working. A limit of 700MB wasn't enough.

The container scan had a memory limit of 300M when it got OOMkilled.

The issue starting with seeing this error in the logs of the Scan API:

2023-08-30T05:01:59Z FTL could not enable scan queue error="unable to create queue segment in /tmp/cnspec-queue/disk-queue: unable to load queue segment in /tmp/cnspec-queue/disk-queue: segment file /tmp/cnspec-queue/disk-queue/0000000000046.dque is corrupted: error reading gob data from file: EOF"

Perhaps the queue file grew too big?

To Reproduce Steps to reproduce the behavior:

Deploy the operator on a GKE cluster
Start scanning
Note the error

Expected behavior The scans should run without being killed and a reduced memory limit.

Screenshots or CLI Output The GCP metrics aren't that helpful:

Perhaps the interval of 60s is too big:

container/memory/used_bytes ... Sampled every 60 seconds.

https://cloud.google.com/monitoring/api/metrics_kubernetes

Desktop (please complete the following information):

latest operator v1.15.2 latest cnspec v8 image

czunker commented 1 year ago

I've created a separate issue for the corrupted queue file: https://github.com/mondoohq/mondoo-operator/issues/852

imilchev commented 8 months ago

we merged multiple memory improvements over the last few weeks. I will close this issue

mondoohq / cnspec

Container and k8s Resource scan get OOMkilled inside k8s #707