stackrox / collector

Runtime data collection for the StackRox Kubernetes Security Platform using eBPF
Apache License 2.0
51 stars 24 forks source link

Scale tests with CO-RE BPF #1321

Open erthalion opened 1 year ago

erthalion commented 1 year ago

The ideal result is:

JoukoVirtanen commented 1 year ago

Currently for testing the releases two long running clusters are created. One of them has load generated by kube-burner. That kube-burner runs berserker containers that generate process and listening endpoints load. Collector runs in the same cluster with the CORE_BPF collection method. The config files used by kube-burner can be found at https://github.com/stackrox/stackrox/tree/master/scripts/release-tools/kube-burner-configs

The long running cluster for 4.3.0-rc1 is currently running and it is being monitored on a loop with kubectl -n stackrox top pod and by getting the metrics from the collector and sensor pods.

Here are some of the relevant PRs that has contributed to this work.

ROX-19857: long running collector should have listening endpoints load https://github.com/stackrox/stackrox/pull/7929

Jv rox 17741 long running cluster should include collector https://github.com/stackrox/actions/pull/20

Jv rox 19896 long running collector should use core bpf https://github.com/stackrox/actions/pull/34

Jv rox 17741 long running cluster should include collector kube burner configs https://github.com/stackrox/test-gh-actions/pull/116

I will add here the results from the long running cluster with real load.

Let me know if anything else is needed.

JoukoVirtanen commented 1 year ago

output_plot

output_plot_cpu

The above are the plots of memory and CPU usage for the 4.3 long running cluster.

JoukoVirtanen commented 9 months ago

I did the following to create a long running cluster for master

cdrox
git checkout master
smart-branch jv-test-long-running-with-tag-2
git commit -m "Empty commit to trigger ci" --allow-empty
git tag -a 0.0.8 -m "Test tag for long running cluster"
git push origin 0.0.8
git push origin HEAD

The master commit was ca0b6ba29d4ab50f34b5f022b64078a18e3482de

I then created a PR and waited for the images to be built and pushed.

I then went to https://github.com/stackrox/test-gh-actions/actions/workflows/create-clusters.yml clicked on "Run workflow", changed the version to 0.0.8, and selected "Create a long-running cluster on RC1". I waited for the github action to finish.

To get the Grafana plots I did the following

infractl artifacts long-real-load-0-0-8 --download-dir /tmp/artifacts-long-real-load-0-0-8
export KUBECONFIG=/tmp/artifacts-long-real-load-0-0-8/kubeconfig
kubectl -n stackrox port-forward service/monitoring 48443:8443 > /dev/null 2>&1 &

Go to https://localhost:48443/?orgId=1 in your browser. Enter admin for the username and stackrox for the password. In the toolbar on the left select Dashboard->Manage. Click on Core Dashboard. After about 7 days the core dashboard showed the following

Screenshot from 2024-02-08 15-41-22

Note that with release versions it is not possible to do profiling as it is disabled. With this version I was able to do profiling, though it doesn't seem right. I checked out the collector commit in COLLECTOR_VERSION and built it locally. I then did the following to get the profiles and visualize one of them

cdrox
./scripts/secured-cluster-diagnostics.sh
cd /tmp/k8s-service-logs/stackrox/metrics/
pprof /home/jvirtane/projects/collector/cmake-build/collector/collector collector-zhl6m-heap.prof -web

Screenshot from 2024-02-08 15-52-52

Screenshot from 2024-02-08 15-53-09

Screenshot from 2024-02-08 15-53-28