The tearDown kills stress docker containers before stopping nemesis which may try to use these containers

vponomaryov commented 1 year ago

Prerequisites

[x] Are you rebased to master ?
[x] Is it reproducible ?
[x] Did you perform a cursory search if this issue isn't opened ?

Versions

SCT: master
scylla: 5.3.0~dev-20230316.5705df77a155

Logs

test_id: 99364f24-5e01-4f5f-9acf-90656769f227
job log: https://jenkins.scylladb.com/job/scylla-master/job/longevity/job/longevity-cdc-100gb-4h-test/380/

Description

The no_corrupt_repair nemesis uses loader docker containers to create temporary keyspaces and tables. With the help of the https://github.com/scylladb/scylladb/issues/13322 we can fall into long retry-loop on the get peers info which times out because of that bug. Then, when scheduled load ends, the Scylla cluster starts responding without timeouts and the get peers info requests start passing. It leads to continuation of other loop steps.

And the problem with it is that the tearDown starts in the middle of the table-creation loop and we may get following error:

10:10:13  < t:2023-03-19 08:10:13,335 f:base.py         l:146  c:RemoteLibSSH2CmdRunner p:ERROR > Error executing command: "sudo  docker exec 68817025f98cc624336324c7aa28d89ae8313e02e975db8c4a4e9d817c0123ec /bin/sh -c 'echo TAG: loader_idx:0-cpu_idx:0-keyspace_idx:1; STRESS_TEST_MARKER=Z2BB6I3LRM7YWQUDBDAK; cassandra-stress write no-warmup n=400000 cl=QUORUM -mode native cql3 -schema keyspace=drop_table_during_repair_ks_9 '"'"'replication(factor=6)'"'"' -log interval=5 -node 10.4.3.85,10.4.1.227,10.4.3.164,10.4.1.37,10.4.3.200,10.4.1.244 -errors skip-unsupported-columns'"; Exit status: 1
10:10:13  < t:2023-03-19 08:10:13,337 f:file_logger.py  l:101  c:sdcm.sct_events.file_logger p:INFO  > 2023-03-19 08:10:13.336: (CassandraStressEvent Severity.ERROR) period_type=end event_id=37a94855-c811-4f94-a0a3-c249e1002a3f during_nemesis=NoCorruptRepair duration=0s: node=Node longevity-cdc-100gb-4h-master-loader-node-99364f24-2 [54.195.164.25 | 10.4.0.181] (seed: False)
10:10:13  < t:2023-03-19 08:10:13,337 f:file_logger.py  l:101  c:sdcm.sct_events.file_logger p:INFO  > stress_cmd=cassandra-stress write n=400000 cl=QUORUM -mode native cql3 -schema 'replication(factor=6)' -log interval=5
10:10:13  < t:2023-03-19 08:10:13,337 f:file_logger.py  l:101  c:sdcm.sct_events.file_logger p:INFO  > errors:
10:10:13  < t:2023-03-19 08:10:13,337 f:file_logger.py  l:101  c:sdcm.sct_events.file_logger p:INFO  > 
10:10:13  < t:2023-03-19 08:10:13,337 f:file_logger.py  l:101  c:sdcm.sct_events.file_logger p:INFO  > Stress command completed with bad status 1: Error: No such container: 68817025f98cc624336324c7aa28d89ae8313e02e975db8c4a4e9d817c0123ec

It is caused by the racy deletion of the docker containers in the middle of 2 actions which get done by docker loader class:

Create container, remember it's ID
Run command on the remembered ID

The SCT code:

# sdcm/tester.py
class ClusterTester(db_stats.TestStatsMixin, unittest.TestCase):
    def tearDown(self):                                                                             
        self.teardown_started = True                                                                
        with silence(parent=self, name='Sending test end event'):                                   
            InfoEvent(message="TEST_END").publish()                                                 
         self.log.info('TearDown is starting...')                                                    
         self.stop_timeout_thread()                                                                  
         self.stop_event_analyzer()                                                                  
  ->    self.stop_resources()
        ...

    @silence()                                                                                      
    def stop_resources(self):  # pylint: disable=no-self-use                                        
        self.log.debug('Stopping all resources')                                                    
        with silence(parent=self, name="Kill Stress Threads"):                                      
  ->        self.kill_stress_thread()                                                               

        # Stopping nemesis, using timeout of 30 minutes, since replace/decommission node can take time
        if self.db_cluster:                                                                         
            self.get_nemesis_report(self.db_cluster)                                                
            self.stop_nemesis(self.db_cluster)                                                      
            self.stop_resources_stop_tasks_threads(self.db_cluster)                                 
            self.get_backtraces(self.db_cluster) 
...
    def kill_stress_thread(self):                                                                   
        if self.loaders:  # the test can fail on provision step and loaders are still not provisioned
            self.loaders.kill_stress_thread()
...

# sdcm/cluster.py
class BaseLoaderSet():
    def kill_stress_thread(self):                                                                   
        if self.nodes and self.nodes[0].is_kubernetes():                                            
            for node in self.nodes:                                                                 
                node.remoter.stop()                                                                 
        else:                                                                                       
            if self.params.get("use_prepared_loaders"):                                             
                self.kill_cassandra_stress_thread()                                                 
            else:                                                                                   
     ->         self.kill_docker_loaders()

    def kill_docker_loaders(self):                                                                  
        for loader in self.nodes:                                                                   
            try:                                                                                    
     ->         loader.remoter.run(cmd='docker ps -a -q | xargs docker rm -f', verbose=True, ignore_status=True)
     ->         self.log.info("Killed docker loader on node: %s", loader.name)                      
            except Exception as ex:  # pylint: disable=broad-except                                 
                self.log.warning("failed to kill docker stress command on [%s]: [%s]",              
                        str(loader), str(ex))

And logs:

10:10:12  < t:2023-03-19 08:10:12,285 f:cluster.py      l:4819 c:sdcm.cluster         p:INFO  > Cluster longevity-cdc-100gb-4h-master-loader-set-99364f24 (AMI: ['ami-042cf1bc21e30ce60'] Type: c5.xlarge): Killed docker loader on node: longevity-cdc-100gb-4h-master-loader-node-99364f24-1
...
709920 < t:2023-03-19 08:10:12,286 f:remote_base.py  l:520  c:RemoteLibSSH2CmdRunner p:DEBUG > Running command "docker ps -a -q | xargs docker rm -f"...     
709921 < t:2023-03-19 08:10:12,292 f:base.py         l:228  c:RemoteLibSSH2CmdRunner p:DEBUG > 68817025f98cc624336324c7aa28d89ae8313e02e975db8c4a4e9d817c0123ec
709922 < t:2023-03-19 08:10:12,495 f:base.py         l:228  c:RemoteLibSSH2CmdRunner p:DEBUG > 68817025f98c 
709923 < t:2023-03-19 08:10:12,495 f:base.py         l:228  c:RemoteLibSSH2CmdRunner p:DEBUG > 7da55ebb557a
709924 < t:2023-03-19 08:10:12,495 f:base.py         l:228  c:RemoteLibSSH2CmdRunner p:DEBUG > a631b5dab7b5
709925 < t:2023-03-19 08:10:12,792 f:base.py         l:142  c:RemoteLibSSH2CmdRunner p:DEBUG > Command "sudo  docker run  --network=host --label shell_marker=Z2BB6I3LRM7YWQUDBDAK --entrypoint /bin/bash -v $HOME/       cs-hdr-write-l0-c0-k1-cc7f6453-7412-4b73-af4a-e95f4b8c4606.hdr:/cs-hdr-write-l0-c0-k1-cc7f6453-7412-4b73-af4a-e95f4b8c4606.hdr -d  scylladb/scylla-nightly:5.3.0-dev-0.20230316.5705df77a155 -c 'tail -f /de       v/null'" finished with status 0
...
10:10:13  < t:2023-03-19 08:10:12,996 f:cluster.py      l:4819 c:sdcm.cluster         p:INFO  > Cluster longevity-cdc-100gb-4h-master-loader-set-99364f24 (AMI: ['ami-042cf1bc21e30ce60'] Type: c5.xlarge): Killed docker loader on node: longevity-cdc-100gb-4h-master-loader-node-99364f24-2
...
10:10:13  < t:2023-03-19 08:10:13,337 f:file_logger.py  l:101  c:sdcm.sct_events.file_logger p:INFO  > 2023-03-19 08:10:13.336: (CassandraStressEvent Severity.ERROR) period_type=end event_id=37a94855-c811-4f94-a0a3-c249e1002a3f during_nemesis=NoCorruptRepair duration=0s: node=Node longevity-cdc-100gb-4h-master-loader-node-99364f24-2 [54.195.164.25 | 10.4.0.181] (seed: False)
10:10:13  < t:2023-03-19 08:10:13,337 f:file_logger.py  l:101  c:sdcm.sct_events.file_logger p:INFO  > stress_cmd=cassandra-stress write n=400000 cl=QUORUM -mode native cql3 -schema 'replication(factor=6)' -log interval=5
10:10:13  < t:2023-03-19 08:10:13,337 f:file_logger.py  l:101  c:sdcm.sct_events.file_logger p:INFO  > errors:
10:10:13  < t:2023-03-19 08:10:13,337 f:file_logger.py  l:101  c:sdcm.sct_events.file_logger p:INFO  > 
10:10:13  < t:2023-03-19 08:10:13,337 f:file_logger.py  l:101  c:sdcm.sct_events.file_logger p:INFO  > Stress command completed with bad status 1: Error: No such container: 68817025f98cc624336324c7aa28d89ae8313e02e975db8c4a4e9d817c0123ec

Steps to Reproduce

Run above mentioned CI job where we load a Scylla cluster for 90%.
See error

Expected behavior: SCT must be smart enough to understand that it killed loaders and not try to use loaders after it.

Actual behavior: SCT allows not-yet stopped nemesis to use loaders which get killed in the tearDown before stopping nemesis.

fruch commented 1 year ago

There's two issues

1) we can't stop the nemesis in flight, if it started it started, and we can only wait for it.

2) we don't housekeep the stress threads, so we can iterate on them and call them and kill specifically, so we iterate on loaders and kill all dockers.

I think we should try and address only the 2nd issue, and stick to each stress thread to be killed only by its code, that can take the different stages of execution into account, and to be aware it was killed.

vponomaryov commented 1 year ago

There's two issues

we can't stop the nemesis in flight, if it started it started, and we can only wait for it.

we don't housekeep the stress threads, so we can iterate on them and call them and kill specifically, so we iterate on loaders and kill all dockers.

I think we should try and address only the 2nd issue, and stick to each stress thread to be killed only by its code, that can take the different stages of execution into account, and to be aware it was killed.

Actually, the solution is simple:

tearDown start
stop stress processes keeping docker containers running
stop nemesis
kill docker containers.

And now we have bug just because 4th step is included into the 2nd one.

fruch commented 1 year ago

stop stress processes keeping docker containers running

how do you do that ? without housekeeping the stress thread objects ?

vponomaryov commented 1 year ago

stop stress processes keeping docker containers running

how do you do that ? without housekeeping the stress thread objects ?

def __del__(self):

fruch commented 1 year ago

stop stress processes keeping docker containers running

how do you do that ? without housekeeping the stress thread objects ?

def __del__(self):

but at this stage the object exists, and there a running ssh command docker exec ... how you are going to stop stress processes keeping docker containers ?

vponomaryov commented 1 year ago

stop stress processes keeping docker containers running

how do you do that ? without housekeeping the stress thread objects ?

def __del__(self):

but at this stage the object exists, and there a running ssh command docker exec ... how you are going to stop stress processes keeping docker containers ?

The docker containers start with idle endless process. Stop of separate stress processes won't cause stop of containers.

fruch commented 1 year ago

stop stress processes keeping docker containers running

how do you do that ? without housekeeping the stress thread objects ?

def __del__(self):

but at this stage the object exists, and there a running ssh command docker exec ... how you are going to stop stress processes keeping docker containers ?

The docker containers start with idle endless process. Stop of separate stress processes won't cause stop of containers.

you missing my point, how the teardown is going to know how to stop those processes ? where it's listed all the currently running processes (i.e. loader related ones)

vponomaryov commented 1 year ago

stop stress processes keeping docker containers running

how do you do that ? without housekeeping the stress thread objects ?

def __del__(self):

but at this stage the object exists, and there a running ssh command docker exec ... how you are going to stop stress processes keeping docker containers ?

The docker containers start with idle endless process. Stop of separate stress processes won't cause stop of containers.

you missing my point, how the teardown is going to know how to stop those processes ? where it's listed all the currently running processes (i.e. loader related ones)

We do not need to close it explicitly anywhere including the tearDown. As I mentioned above, if we use __del__ method then it will be run automatically during garbage collection. It is enough for our case.

fruch commented 1 year ago

stop stress processes keeping docker containers running

how do you do that ? without housekeeping the stress thread objects ?

def __del__(self):

but at this stage the object exists, and there a running ssh command docker exec ... how you are going to stop stress processes keeping docker containers ?

The docker containers start with idle endless process. Stop of separate stress processes won't cause stop of containers.

you missing my point, how the teardown is going to know how to stop those processes ? where it's listed all the currently running processes (i.e. loader related ones)

We do not need to close it explicitly anywhere including the tearDown. As I mentioned above, if we use __del__ method then it will be run automatically during garbage collection. It is enough for our case.

so in theory it's enough to just remove the kill_stress part ? what would you do in the __del__ ?

scylladb / scylla-cluster-tests