`kill_stress_thread` failing to kill cassandra-stress process on the loader

KnifeyMoloko commented 2 years ago

Installation details

Kernel Version: 5.13.0-1031-aws Scylla version (or git commit hash): 5.0.0-20220711.1ad59d6a7 with build-id 3b91e26d78f1705b7066f9deb1cb73a3c03956ff Cluster size: 6 nodes (i3.4xlarge)

Scylla Nodes used in this run:

longevity-10gb-3h-5-0-db-node-2378acf8-6 (3.216.133.49 | 10.0.3.177) (shards: 14)
longevity-10gb-3h-5-0-db-node-2378acf8-5 (3.235.101.94 | 10.0.2.135) (shards: 14)
longevity-10gb-3h-5-0-db-node-2378acf8-4 (44.200.207.138 | 10.0.3.79) (shards: 14)
longevity-10gb-3h-5-0-db-node-2378acf8-3 (52.73.129.42 | 10.0.3.195) (shards: 14)
longevity-10gb-3h-5-0-db-node-2378acf8-2 (44.204.128.35 | 10.0.3.208) (shards: 14)
longevity-10gb-3h-5-0-db-node-2378acf8-1 (35.153.39.53 | 10.0.0.35) (shards: 14)

OS / Image: ami-085f2a90a0f67349a (aws: us-east-1)

Test: longevity-10gb-3h-test Test id: 2378acf8-7be9-430c-a8df-9e3f8f8131a6 Test name: scylla-5.0/longevity/longevity-10gb-3h-test Test config file(s):

longevity-10gb-3h.yaml

Issue description

>>>>>>> During the test run one of our db nodes went down due to a SpotTerminationError and started the teardown process for the test. Part of that process is the kill_stress_thread method which should kill the cassandra-stress process on the loaders.

sct.log

< t:2022-07-11 07:48:22,927 f:tester.py       l:186  c:silence              p:DEBUG > Silently running 'stop_resources'
< t:2022-07-11 07:48:22,927 f:tester.py       l:2321 c:LongevityTest        p:DEBUG > Stopping all resources
...
...
...
< t:2022-07-11 07:48:37,629 f:tester.py       l:198  c:silence              p:DEBUG > Finished 'Kill Stress Threads'. No errors were silenced.

6 minutes after that the loaders were still running the stress load:

sct.log

< t:2022-07-11 07:56:24,089 f:base.py         l:228  c:RemoteLibSSH2CmdRunner p:DEBUG > total,     225343848,   80620,   80620,   80620,    12.4,     9.3,    33.7,    45.3,    58.2,    76.2, 2910.0,  0.00576,      0,      0,       0,       0,       0,       0
< t:2022-07-11 07:56:24,266 f:base.py         l:228  c:RemoteLibSSH2CmdRunner p:DEBUG > total,     219607432,   77072,   77072,   77072,    12.9,     9.0,    36.7,    53.0,   109.0,   135.1, 2875.0,  0.00464,      0,      0,       0,       0,       0,       0
< t:2022-07-11 07:56:25,590 f:base.py         l:228  c:RemoteLibSSH2CmdRunner p:DEBUG > com.datastax.driver.core.exceptions.TransportException: [ip-10-0-0-35.ec2.internal/10.0.0.35:9042] Connection has been closed
< t:2022-07-11 07:56:25,766 f:base.py         l:228  c:RemoteLibSSH2CmdRunner p:DEBUG > com.datastax.driver.core.exceptions.TransportException: [ip-10-0-0-35.ec2.internal/10.0.0.35:9042] Connection has been closed
< t:2022-07-11 07:56:25,995 f:base.py         l:228  c:RemoteLibSSH2CmdRunner p:DEBUG > java.io.IOException: Operation x10 on key(s) [344e4f32303033343131]: Error executing: (UnavailableException): Not enough replicas available for query at consistency QUORUM (2 required but only 1 alive)

<<<<<<<

Restore Monitor Stack command: $ hydra investigate show-monitor 2378acf8-7be9-430c-a8df-9e3f8f8131a6
Restore monitor on AWS instance using Jenkins job
Show all stored logs command: $ hydra investigate show-logs 2378acf8-7be9-430c-a8df-9e3f8f8131a6

Logs:

db-cluster-2378acf8.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/2378acf8-7be9-430c-a8df-9e3f8f8131a6/20220711_085619/db-cluster-2378acf8.tar.gz
monitor-set-2378acf8.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/2378acf8-7be9-430c-a8df-9e3f8f8131a6/20220711_085619/monitor-set-2378acf8.tar.gz
loader-set-2378acf8.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/2378acf8-7be9-430c-a8df-9e3f8f8131a6/20220711_085619/loader-set-2378acf8.tar.gz
sct-runner-2378acf8.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/2378acf8-7be9-430c-a8df-9e3f8f8131a6/20220711_085619/sct-runner-2378acf8.tar.gz
parallel-timelines-report-2378acf8.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/2378acf8-7be9-430c-a8df-9e3f8f8131a6/20220711_085619/parallel-timelines-report-2378acf8.tar.gz

Jenkins job URL

fgelcer commented 2 years ago

i don't remember exactly what, but there was an issue recently about the stress tool to return upon termination, so IIUC the best thing we should do in this case, is to ignore (filter out, or decrease severity of) any event happening, specially from loaders, from the moment we have a critical event

fruch commented 2 years ago

I would start with adding enough debugging information of the java processes before and after the kill. I'm guessing we are missing some of the processes or maybe using wrong signal

amoskong commented 1 year ago

worth to check if we get warning like this:
- failed to kill stress-command on ...
We used three pgrep cmds to search cs process ids, need to check if it's perfect.

    def kill_cassandra_stress_thread(self):
        search_cmds = [
            'pgrep -f .*cassandra.*',
            'pgrep -f cassandra.stress',
            'pgrep -f cassandra-stress'
        ]

loader.remoter.run(cmd=f'{filter_cmd} | xargs -I{{}} kill -TERM {{}}', verbose=True, ignore_status=True)
- the status is ignored
- we assume the output of filter_cmd is standardize. It can be confirmed by the verbose log

fgelcer commented 1 year ago

worth to check if we get warning like this:

failed to kill stress-command on ...

We used three pgrep cmds to search cs process ids, need to check if it's perfect.
    def kill_cassandra_stress_thread(self):
        search_cmds = [
            'pgrep -f .*cassandra.*',
            'pgrep -f cassandra.stress',
            'pgrep -f cassandra-stress'
        ]
       loader.remoter.run(cmd=f'{filter_cmd} | xargs -I{{}}  kill -TERM {{}}',
                       verbose=True, ignore_status=True)
the status is ignored

we assume the output of filter_cmd is standardize. It can be confirmed by the verbose log

thank you @amoskong for the feedback

scylladb / scylla-cluster-tests