dockerd could be restarted on loader for gemini job if several stress tool running on same loader

Prerequisites

[x] Are you rebased to master ?
[x] Is it reproducible ?
[x] Did you perform a cursory search if this issue isn't opened ?

Versions

SCT: master/branch-5.2
scylla: branch-5.2

Logs

test_id: e0a42551-4eae-4085-aa00-a5cc34e9e202
job log: https://cloudius-jenkins-test.s3.amazonaws.com/e0a42551-4eae-4085-aa00-a5cc34e9e202/20230322_151857/sct-runner-e0a42551.tar.gz, https://cloudius-jenkins-test.s3.amazonaws.com/e0a42551-4eae-4085-aa00-a5cc34e9e202/20230322_151857/loader-set-e0a42551.tar.gz

Description

All stress tools runnin in docker. For gemini jobs, we use 1 loader. Some times we can get situation, when some nemesis generate additional keyspaces with c-s or scylla-bench(ex NoCorruptRepair) Each process start docker container, and some times it coulld lead to OOM kill on loader all docker container or process dockerd could be restarted, how it happened on these to jobs: https://jenkins.scylladb.com/job/scylla-5.2/job/gemini-/job/gemini-3h-with-nemesis-test/9 https://jenkins.scylladb.com/job/scylla-5.2/job/gemini-/job/gemini-3h-with-nemesis-test/8

Steps to Reproduce

Run gemini job with nemesis which also start other stress tools

Actual behavior: [What actually happened] Dockerd could be restarted or killed.

As workaround, loader instance could increased

We've seen lots of issues like those, but I'm not sure it's connected to other stress commands running.

I think that we are using centos7 with some specific old docker version, might be the cause.

The thing I can't explain is why this happens mostly in Gemini jobs

We've seen lots of issues like those, but I'm not sure it's connected to other stress commands running.

I think that we are using centos7 with some specific old docker version, might be the cause.

The thing I can't explain is why this happens mostly in Gemini jobs

I haven't debugged the current issue, but from the description I may assume that the situation described in another SCT bug also influences here: https://github.com/scylladb/scylla-cluster-tests/issues/5945

The situation: when SCT kills any stress thread it kills also all the docker containers on all the loader nodes:

# sdcm/cluster.py
class BaseLoaderSet():
    def kill_stress_thread(self):                                                                   
        if self.nodes and self.nodes[0].is_kubernetes():                                            
            for node in self.nodes:                                                                 
                node.remoter.stop()                                                                 
        else:                                                                                       
            if self.params.get("use_prepared_loaders"):                                             
                self.kill_cassandra_stress_thread()                                                 
            else:                                                                                   
     ->         self.kill_docker_loaders()

    def kill_docker_loaders(self):                                                                  
        for loader in self.nodes:                                                                   
            try:                                                                                    
     ->         loader.remoter.run(cmd='docker ps -a -q | xargs docker rm -f', verbose=True, ignore_status=True)
     ->         self.log.info("Killed docker loader on node: %s", loader.name)                      
            except Exception as ex:  # pylint: disable=broad-except                                 
                self.log.warning("failed to kill docker stress command on [%s]: [%s]",              
                        str(loader), str(ex))

If it is so, then setting of the use_prepared_loaders: true option must workaround the problem. So, @aleksbykov , you can try to set this option and try to reproduce the issue with it.

We've seen lots of issues like those, but I'm not sure it's connected to other stress commands running.

I think that we are using centos7 with some specific old docker version, might be the cause.

The thing I can't explain is why this happens mostly in Gemini jobs

I haven't debugged the current issue, but from the description I may assume that the situation described in another SCT bug also influences here: https://github.com/scylladb/scylla-cluster-tests/issues/5945

The situation: when SCT kills any stress thread it kills also all the docker containers on all the loader nodes:
# sdcm/cluster.py
class BaseLoaderSet():
    def kill_stress_thread(self):                                                                   
        if self.nodes and self.nodes[0].is_kubernetes():                                            
            for node in self.nodes:                                                                 
                node.remoter.stop()                                                                 
        else:                                                                                       
            if self.params.get("use_prepared_loaders"):                                             
                self.kill_cassandra_stress_thread()                                                 
            else:                                                                                   
     ->         self.kill_docker_loaders()

    def kill_docker_loaders(self):                                                                  
        for loader in self.nodes:                                                                   
            try:                                                                                    
     ->         loader.remoter.run(cmd='docker ps -a -q | xargs docker rm -f', verbose=True, ignore_status=True)
     ->         self.log.info("Killed docker loader on node: %s", loader.name)                      
            except Exception as ex:  # pylint: disable=broad-except                                 
                self.log.warning("failed to kill docker stress command on [%s]: [%s]",              
                        str(loader), str(ex)) 
If it is so, then setting of the use_prepared_loaders: true option must workaround the problem. So, @aleksbykov , you can try to set this option and try to reproduce the issue with it.

@vponomaryov I don't think this is the case, kill_stress_thread is called only on teardown. and it doesn't cause dockerd to crash or OOM.

Also I'm not sure Gemini would work as expected with use_prepared_loaders it's a feature only for c-s, not for the other tools.

It not always reproduced. Looks depend on generated schema by gemini. Last run of the job passed. How do you think, could memory limit for stress dockers also help with this? because now we run docker for all available memory on host?

scylladb / scylla-cluster-tests