Open aleksbykov opened 1 year ago
We've seen lots of issues like those, but I'm not sure it's connected to other stress commands running.
I think that we are using centos7 with some specific old docker version, might be the cause.
The thing I can't explain is why this happens mostly in Gemini jobs
We've seen lots of issues like those, but I'm not sure it's connected to other stress commands running.
I think that we are using centos7 with some specific old docker version, might be the cause.
The thing I can't explain is why this happens mostly in Gemini jobs
I haven't debugged the current issue, but from the description I may assume that the situation described in another SCT bug also influences here: https://github.com/scylladb/scylla-cluster-tests/issues/5945
The situation: when SCT kills any stress thread it kills also all the docker containers on all the loader nodes:
# sdcm/cluster.py
class BaseLoaderSet():
def kill_stress_thread(self):
if self.nodes and self.nodes[0].is_kubernetes():
for node in self.nodes:
node.remoter.stop()
else:
if self.params.get("use_prepared_loaders"):
self.kill_cassandra_stress_thread()
else:
-> self.kill_docker_loaders()
def kill_docker_loaders(self):
for loader in self.nodes:
try:
-> loader.remoter.run(cmd='docker ps -a -q | xargs docker rm -f', verbose=True, ignore_status=True)
-> self.log.info("Killed docker loader on node: %s", loader.name)
except Exception as ex: # pylint: disable=broad-except
self.log.warning("failed to kill docker stress command on [%s]: [%s]",
str(loader), str(ex))
If it is so, then setting of the use_prepared_loaders: true
option must workaround the problem.
So, @aleksbykov , you can try to set this option and try to reproduce the issue with it.
We've seen lots of issues like those, but I'm not sure it's connected to other stress commands running.
I think that we are using centos7 with some specific old docker version, might be the cause.
The thing I can't explain is why this happens mostly in Gemini jobs
I haven't debugged the current issue, but from the description I may assume that the situation described in another SCT bug also influences here: https://github.com/scylladb/scylla-cluster-tests/issues/5945
The situation: when SCT kills any stress thread it kills also all the docker containers on all the loader nodes:
# sdcm/cluster.py class BaseLoaderSet(): def kill_stress_thread(self): if self.nodes and self.nodes[0].is_kubernetes(): for node in self.nodes: node.remoter.stop() else: if self.params.get("use_prepared_loaders"): self.kill_cassandra_stress_thread() else: -> self.kill_docker_loaders() def kill_docker_loaders(self): for loader in self.nodes: try: -> loader.remoter.run(cmd='docker ps -a -q | xargs docker rm -f', verbose=True, ignore_status=True) -> self.log.info("Killed docker loader on node: %s", loader.name) except Exception as ex: # pylint: disable=broad-except self.log.warning("failed to kill docker stress command on [%s]: [%s]", str(loader), str(ex))
If it is so, then setting of the
use_prepared_loaders: true
option must workaround the problem. So, @aleksbykov , you can try to set this option and try to reproduce the issue with it.
@vponomaryov I don't think this is the case, kill_stress_thread is called only on teardown. and it doesn't cause dockerd to crash or OOM.
Also I'm not sure Gemini would work as expected with use_prepared_loaders
it's a feature only for c-s, not for the other tools.
It not always reproduced. Looks depend on generated schema by gemini. Last run of the job passed. How do you think, could memory limit for stress dockers also help with this? because now we run docker for all available memory on host?
Prerequisites
Versions
Logs
Description
All stress tools runnin in docker. For gemini jobs, we use 1 loader. Some times we can get situation, when some nemesis generate additional keyspaces with c-s or scylla-bench(ex NoCorruptRepair) Each process start docker container, and some times it coulld lead to OOM kill on loader all docker container or process dockerd could be restarted, how it happened on these to jobs: https://jenkins.scylladb.com/job/scylla-5.2/job/gemini-/job/gemini-3h-with-nemesis-test/9 https://jenkins.scylladb.com/job/scylla-5.2/job/gemini-/job/gemini-3h-with-nemesis-test/8
Steps to Reproduce
Actual behavior: [What actually happened] Dockerd could be restarted or killed.
As workaround, loader instance could increased