Open KnifeyMoloko opened 2 years ago
i don't remember exactly what, but there was an issue recently about the stress tool to return upon termination, so IIUC the best thing we should do in this case, is to ignore (filter out, or decrease severity of) any event happening, specially from loaders, from the moment we have a critical event
I would start with adding enough debugging information of the java processes before and after the kill. I'm guessing we are missing some of the processes or maybe using wrong signal
worth to check if we get warning like this:
We used three pgrep cmds to search cs process ids, need to check if it's perfect.
def kill_cassandra_stress_thread(self):
search_cmds = [
'pgrep -f .*cassandra.*',
'pgrep -f cassandra.stress',
'pgrep -f cassandra-stress'
]
- worth to check if we get warning like this:
- failed to kill stress-command on ...
- We used three pgrep cmds to search cs process ids, need to check if it's perfect.
def kill_cassandra_stress_thread(self): search_cmds = [ 'pgrep -f .*cassandra.*', 'pgrep -f cassandra.stress', 'pgrep -f cassandra-stress' ]
loader.remoter.run(cmd=f'{filter_cmd} | xargs -I{{}} kill -TERM {{}}', verbose=True, ignore_status=True)
- the status is ignored
- we assume the output of filter_cmd is standardize. It can be confirmed by the verbose log
thank you @amoskong for the feedback
Installation details
Kernel Version: 5.13.0-1031-aws Scylla version (or git commit hash):
5.0.0-20220711.1ad59d6a7
with build-id3b91e26d78f1705b7066f9deb1cb73a3c03956ff
Cluster size: 6 nodes (i3.4xlarge)Scylla Nodes used in this run:
OS / Image:
ami-085f2a90a0f67349a
(aws: us-east-1)Test:
longevity-10gb-3h-test
Test id:2378acf8-7be9-430c-a8df-9e3f8f8131a6
Test name:scylla-5.0/longevity/longevity-10gb-3h-test
Test config file(s):Issue description
>>>>>>> During the test run one of our db nodes went down due to a
SpotTerminationError
and started the teardown process for the test. Part of that process is thekill_stress_thread
method which should kill the cassandra-stress process on the loaders.sct.log
6 minutes after that the loaders were still running the stress load:
sct.log
<<<<<<<
$ hydra investigate show-monitor 2378acf8-7be9-430c-a8df-9e3f8f8131a6
$ hydra investigate show-logs 2378acf8-7be9-430c-a8df-9e3f8f8131a6
Logs:
Jenkins job URL