Open dimakr opened 5 months ago
@dimakr what is the actual outcome, that the test is slow ? stress command is failing ?
I.e. can a test still finish and pass, like that ?
@fruch Currently Nemesis tests executed on docker backend locally / on AWS are flaky (can pass/fail/timeout) and it is likely that the described issue is causing this. The monitoring shows that db-node containers are overloaded (their load is constantly up to 100%) during the stress command execution.
We need to reduce the default load parameters (as set in longevity-5gb-1h-nemesis.yaml) adjusting them appropriately for the resources available to db-node containers. And fix this issue.
https://github.com/scylladb/scylla-cluster-tests/blob/master/sdcm/cluster_docker.py#L60
Please, use specific commits in the links like this one. Currently the given ref will be correct only for some time while that file doesn't get change.
Example of how it should have looked like: https://github.com/scylladb/scylla-cluster-tests/blob/2c2369d/sdcm/cluster_docker.py#L60
https://github.com/scylladb/scylla-cluster-tests/blob/master/sdcm/cluster_docker.py#L60
Please, use specific commits in the links like this one. Currently the given ref will be correct only for some time while that file doesn't get change.
Example of how it should have looked like: https://github.com/scylladb/scylla-cluster-tests/blob/2c2369d/sdcm/cluster_docker.py#L60
Right click -> get permanent link, would work the best.
@fruch Currently Nemesis tests executed on docker backend locally / on AWS are flaky (can pass/fail/timeout) and it is likely that the described issue is causing this. The monitoring shows that db-node containers are overloaded (their load is constantly up to 100%) during the stress command execution.
We need to reduce the default load parameters (as set in longevity-5gb-1h-nemesis.yaml) adjusting them appropriately for the resources available to db-node containers. And fix this issue.
Let's start with having a tuned down version of the basis, like 100mb, and rate limited to 1000 rps, 5 threads
And see if it works better
Then we'll circle back and see how to fix this issue, since it's a bit tricky to refactor this specific part in a way we can control it from configuration, or based on configuration.
Issue description
Even though a specific value can be set for
--smp
option when starting scylla from command line, it has no effect if cluster is created on docker backend. The created scylla containers use only one host CPU core as this value is currently hardcoded: https://github.com/scylladb/scylla-cluster-tests/blob/master/sdcm/cluster_docker.py#L60Steps to Reproduce
docker stats --no-stream --format "table {{.Name}}\t{{.CPUPerc}}"
command to see CPUs utilization by containersExpected behavior: CPU utilization by db-node containers is (up to) 200%.
Actual behavior: CPU utilization by db-node containers is ~100% indicating that the containers utilize only one core.
Impact
The issue prevents from utilizing available CPU resources of a local machine, when running tests on docker backend locally. It could be also required to throttle the stress commands load, otherwise the test can fail due to docker container performance degradation.
How frequently does it reproduce?
When tests are started on docker backend.
Installation details
SCT Version: master Scylla version: 5.4.4 Environment: local machine with 20 cores. Test config: longevity-5gb-1h-nemesis.yaml Additional test config parameters: