scylladb / scylla-cluster-tests

Tests for Scylla Clusters
GNU Affero General Public License v3.0
55 stars 93 forks source link

'smp' scylla command line argument has no effect when cluster is created on docker backend #7307

Open dimakr opened 5 months ago

dimakr commented 5 months ago

Issue description

Even though a specific value can be set for --smp option when starting scylla from command line, it has no effect if cluster is created on docker backend. The created scylla containers use only one host CPU core as this value is currently hardcoded: https://github.com/scylladb/scylla-cluster-tests/blob/master/sdcm/cluster_docker.py#L60

Steps to Reproduce

  1. Start a test on docker backend with a scenario that puts substantial load onto the cluster
    export SCT_APPEND_SCYLLA_ARGS=' --smp 2 --memory 4G'
    export SCT_CONFIG_FILES='["configurations/nemesis/longevity-5gb-1h-nemesis.yaml","configurations/nemesis/AbortRepairMonkey.yaml"]'
    ./sct.py run-test longevity_test.LongevityTest.test_custom_time --backend docker`
  2. Wait for the stress commands to start
  3. Run docker stats --no-stream --format "table {{.Name}}\t{{.CPUPerc}}" command to see CPUs utilization by containers

Expected behavior: CPU utilization by db-node containers is (up to) 200%.

Actual behavior: CPU utilization by db-node containers is ~100% indicating that the containers utilize only one core.

longevity-5gb-1h-AbortRepairMonkey--db-node-e9201320-2                  100.76%
longevity-5gb-1h-AbortRepairMonkey--db-node-e9201320-1                  103.39%
longevity-5gb-1h-AbortRepairMonkey--db-node-e9201320-0                  105.97%

Impact

The issue prevents from utilizing available CPU resources of a local machine, when running tests on docker backend locally. It could be also required to throttle the stress commands load, otherwise the test can fail due to docker container performance degradation.

How frequently does it reproduce?

When tests are started on docker backend.

Installation details

SCT Version: master Scylla version: 5.4.4 Environment: local machine with 20 cores. Test config: longevity-5gb-1h-nemesis.yaml Additional test config parameters:

monitor_swap_size: 0

send_email: false
enable_argus: false

server_encrypt: false
client_encrypt: false
fruch commented 5 months ago

@dimakr what is the actual outcome, that the test is slow ? stress command is failing ?

I.e. can a test still finish and pass, like that ?

dimakr commented 5 months ago

@fruch Currently Nemesis tests executed on docker backend locally / on AWS are flaky (can pass/fail/timeout) and it is likely that the described issue is causing this. The monitoring shows that db-node containers are overloaded (their load is constantly up to 100%) during the stress command execution.

We need to reduce the default load parameters (as set in longevity-5gb-1h-nemesis.yaml) adjusting them appropriately for the resources available to db-node containers. And fix this issue.

vponomaryov commented 5 months ago

https://github.com/scylladb/scylla-cluster-tests/blob/master/sdcm/cluster_docker.py#L60

Please, use specific commits in the links like this one. Currently the given ref will be correct only for some time while that file doesn't get change.

Example of how it should have looked like: https://github.com/scylladb/scylla-cluster-tests/blob/2c2369d/sdcm/cluster_docker.py#L60

fruch commented 5 months ago

https://github.com/scylladb/scylla-cluster-tests/blob/master/sdcm/cluster_docker.py#L60

Please, use specific commits in the links like this one. Currently the given ref will be correct only for some time while that file doesn't get change.

Example of how it should have looked like: https://github.com/scylladb/scylla-cluster-tests/blob/2c2369d/sdcm/cluster_docker.py#L60

Right click -> get permanent link, would work the best.

https://github.com/scylladb/scylla-cluster-tests/blob/4571d22e2517c7706e5296d376f43a348b456dfc/sdcm/cluster_docker.py#L60

fruch commented 5 months ago

@fruch Currently Nemesis tests executed on docker backend locally / on AWS are flaky (can pass/fail/timeout) and it is likely that the described issue is causing this. The monitoring shows that db-node containers are overloaded (their load is constantly up to 100%) during the stress command execution.

We need to reduce the default load parameters (as set in longevity-5gb-1h-nemesis.yaml) adjusting them appropriately for the resources available to db-node containers. And fix this issue.

Let's start with having a tuned down version of the basis, like 100mb, and rate limited to 1000 rps, 5 threads

And see if it works better

Then we'll circle back and see how to fix this issue, since it's a bit tricky to refactor this specific part in a way we can control it from configuration, or based on configuration.