Open erthalion opened 1 year ago
Currently for testing the releases two long running clusters are created. One of them has load generated by kube-burner. That kube-burner runs berserker containers that generate process and listening endpoints load. Collector runs in the same cluster with the CORE_BPF collection method. The config files used by kube-burner can be found at https://github.com/stackrox/stackrox/tree/master/scripts/release-tools/kube-burner-configs
The long running cluster for 4.3.0-rc1 is currently running and it is being monitored on a loop with kubectl -n stackrox top pod
and by getting the metrics from the collector and sensor pods.
Here are some of the relevant PRs that has contributed to this work.
ROX-19857: long running collector should have listening endpoints load https://github.com/stackrox/stackrox/pull/7929
Jv rox 17741 long running cluster should include collector https://github.com/stackrox/actions/pull/20
Jv rox 19896 long running collector should use core bpf https://github.com/stackrox/actions/pull/34
Jv rox 17741 long running cluster should include collector kube burner configs https://github.com/stackrox/test-gh-actions/pull/116
I will add here the results from the long running cluster with real load.
Let me know if anything else is needed.
The above are the plots of memory and CPU usage for the 4.3 long running cluster.
I did the following to create a long running cluster for master
cdrox
git checkout master
smart-branch jv-test-long-running-with-tag-2
git commit -m "Empty commit to trigger ci" --allow-empty
git tag -a 0.0.8 -m "Test tag for long running cluster"
git push origin 0.0.8
git push origin HEAD
The master commit was ca0b6ba29d4ab50f34b5f022b64078a18e3482de
I then created a PR and waited for the images to be built and pushed.
I then went to https://github.com/stackrox/test-gh-actions/actions/workflows/create-clusters.yml clicked on "Run workflow", changed the version to 0.0.8, and selected "Create a long-running cluster on RC1". I waited for the github action to finish.
To get the Grafana plots I did the following
infractl artifacts long-real-load-0-0-8 --download-dir /tmp/artifacts-long-real-load-0-0-8
export KUBECONFIG=/tmp/artifacts-long-real-load-0-0-8/kubeconfig
kubectl -n stackrox port-forward service/monitoring 48443:8443 > /dev/null 2>&1 &
Go to https://localhost:48443/?orgId=1 in your browser. Enter admin for the username and stackrox for the password. In the toolbar on the left select Dashboard->Manage. Click on Core Dashboard. After about 7 days the core dashboard showed the following
Note that with release versions it is not possible to do profiling as it is disabled. With this version I was able to do profiling, though it doesn't seem right. I checked out the collector commit in COLLECTOR_VERSION and built it locally. I then did the following to get the profiles and visualize one of them
cdrox
./scripts/secured-cluster-diagnostics.sh
cd /tmp/k8s-service-logs/stackrox/metrics/
pprof /home/jvirtane/projects/collector/cmake-build/collector/collector collector-zhl6m-heap.prof -web
The ideal result is:
Incorporate relevant workload generator into the KubeBurner
Perform ACS Scale tests using the configuration above against a cluster with
core_bpf
collection methodCollect resources usage metrics from the test for further analysis
Verify memory consumption