Closed fruch closed 6 months ago
looking into those, seems like we are aborting on of the repair sessions
< t:2024-03-03 03:48:04,350 f:base.py l:228 c:RemoteLibSSH2CmdRunner p:DEBUG > [2024-03-03 03:45:50,906] Starting repair command #1570, repairing 1 ranges for keyspace system_auth (parallelism=SEQUENTIAL, full=true)
< t:2024-03-03 03:48:04,350 f:base.py l:228 c:RemoteLibSSH2CmdRunner p:DEBUG > [2024-03-03 03:45:50,906] Repair session 1570
< t:2024-03-03 03:48:04,350 f:base.py l:228 c:RemoteLibSSH2CmdRunner p:DEBUG > [2024-03-03 03:45:52,118] Repair session 1570 failed
< t:2024-03-03 03:48:04,350 f:base.py l:228 c:RemoteLibSSH2CmdRunner p:DEBUG > [2024-03-03 03:45:52,120] Starting repair command #1571, repairing 1 ranges for keyspace system
_distributed_everywhere (parallelism=SEQUENTIAL, full=true)
and the repair just keep going, and take more then 2min.
I suspect that the new scylla nodetool
is now being used, and behave a bit differently then the original nodetool command
My guess the orignal nodetool command has stop everything on the first failure, when the new version keeps on going and report the failure at the end.
so now when we call http://127.0.0.1:10000/storage_service/force_terminate_repair
only one repair session is stopped
@tchaikov @denesb, do we have a way to abort all of repair session, in the current running scylla nodetool repair
?
I suspect that the new
scylla nodetool
is now being used, and behave a bit differently then the original nodetool command.
Yes, the repair command was recently implemented and it is possible that SCT started using it. I tried to mimick the behaviour of Origin's nodetool repair
as close as possible but I may have missed some details around failure handling.
My guess the orignal nodetool command has stop everything on the first failure, when the new version keeps on going and report the failure at the end.
I can confirm that the new nodetool does continue and just collect the result of all repairs and reports it at the end.
so now when we call
http://127.0.0.1:10000/storage_service/force_terminate_repair
only one repair session is stopped@tchaikov @denesb, do we have a way to abort all of repair session, in the current running
scylla nodetool repair
?
I will adapt the new implementation to follow the behaviour of Origin and stop on the first failure.
I suspect that the new
scylla nodetool
is now being used, and behave a bit differently then the original nodetool command.Yes, the repair command was recently implemented and it is possible that SCT started using it. I tried to mimick the behaviour of Origin's
nodetool repair
as close as possible but I may have missed some details around failure handling.My guess the orignal nodetool command has stop everything on the first failure, when the new version keeps on going and report the failure at the end.
I can confirm that the new nodetool does continue and just collect the result of all repairs and reports it at the end.
so now when we call
http://127.0.0.1:10000/storage_service/force_terminate_repair
only one repair session is stopped @tchaikov @denesb, do we have a way to abort all of repair session, in the current runningscylla nodetool repair
?I will adapt the new implementation to follow the behaviour of Origin and stop on the first failure.
hard to tell which path is correct, but if we have a "requirement" that force_terminate_repair
api, stop a repair completely, so yes we should stop, but maybe just when the error is "user-requested" this abort ? and other possible failures might not stop it ?
cause keep on going, sound like a desirable behavior in general, when having any other errors
hard to tell which path is correct, but if we have a "requirement" that
force_terminate_repair
api, stop a repair completely, so yes we should stop, but maybe just when the error is "user-requested" this abort ? and other possible failures might not stop it ?cause keep on going, sound like a desirable behavior in general, when having any other errors
The API endpoint just returns one of 3 statues: RUNNING
, SUCESSFUL
, FAILED
. There is no distinction on why it failed. So for now I will just change scylla-nodetool repair
to follow origin and we can revisit this in the future if required.
Fix is here: https://github.com/scylladb/scylladb/pull/17678
But it will need some time to get in, as it depends on another PR, which is not queued yet.
Should be fixed by https://github.com/scylladb/scylladb/commit/566223c34a161b08c666f3658b41fafff0f69271. @fruch plesae verify and close.
this is fixed
on several case I've run into a case this nemesis is timing out on the repair thread
Packages
Scylla version:
5.5.0~dev-20240218.9d666f7d29cb
with build-id5ab8f82c4cd898fdf3fcdbe277727ef9fd1f554b
Kernel Version:
5.15.0-1053-aws
Issue description
Impact
Nemesis is failing with not enough information
How frequently does it reproduce?
see twice
Installation details
Cluster size: 4 nodes (i3en.2xlarge)
Scylla Nodes used in this run:
OS / Image:
ami-04a20d0ee501653e6
(aws: undefined_region)Test:
longevity-twcs-48h-test
Test id:dbe99a50-2f1b-4353-8219-9923581e8cff
Test name:scylla-master/teir1/longevity-twcs-48h-test
Test config file(s):Logs and commands
- Restore Monitor Stack command: `$ hydra investigate show-monitor dbe99a50-2f1b-4353-8219-9923581e8cff` - Restore monitor on AWS instance using [Jenkins job](https://jenkins.scylladb.com/view/QA/job/QA-tools/job/hydra-show-monitor/parambuild/?test_id=dbe99a50-2f1b-4353-8219-9923581e8cff) - Show all stored logs command: `$ hydra investigate show-logs dbe99a50-2f1b-4353-8219-9923581e8cff` ## Logs: - **db-cluster-dbe99a50.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/dbe99a50-2f1b-4353-8219-9923581e8cff/20240220_162815/db-cluster-dbe99a50.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/dbe99a50-2f1b-4353-8219-9923581e8cff/20240220_162815/db-cluster-dbe99a50.tar.gz) - **sct-runner-events-dbe99a50.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/dbe99a50-2f1b-4353-8219-9923581e8cff/20240220_162815/sct-runner-events-dbe99a50.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/dbe99a50-2f1b-4353-8219-9923581e8cff/20240220_162815/sct-runner-events-dbe99a50.tar.gz) - **sct-dbe99a50.log.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/dbe99a50-2f1b-4353-8219-9923581e8cff/20240220_162815/sct-dbe99a50.log.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/dbe99a50-2f1b-4353-8219-9923581e8cff/20240220_162815/sct-dbe99a50.log.tar.gz) - **loader-set-dbe99a50.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/dbe99a50-2f1b-4353-8219-9923581e8cff/20240220_162815/loader-set-dbe99a50.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/dbe99a50-2f1b-4353-8219-9923581e8cff/20240220_162815/loader-set-dbe99a50.tar.gz) - **monitor-set-dbe99a50.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/dbe99a50-2f1b-4353-8219-9923581e8cff/20240220_162815/monitor-set-dbe99a50.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/dbe99a50-2f1b-4353-8219-9923581e8cff/20240220_162815/monitor-set-dbe99a50.tar.gz) [Jenkins job URL](https://jenkins.scylladb.com/job/scylla-master/job/teir1/job/longevity-twcs-48h-test/2/) [Argus](https://argus.scylladb.com/test/e5978242-3a13-42e0-ad20-6be393f00d5d/runs?additionalRuns[]=dbe99a50-2f1b-4353-8219-9923581e8cff)2nd time seen:
Packages
Scylla version:
5.5.0~dev-20240218.9d666f7d29cb
with build-id5ab8f82c4cd898fdf3fcdbe277727ef9fd1f554b
Kernel Version:
5.15.0-1053-aws
Installation details
Cluster size: 4 nodes (i4i.4xlarge)
Scylla Nodes used in this run:
OS / Image:
ami-04a20d0ee501653e6
(aws: undefined_region)Test:
longevity-150gb-asymmetric-cluster-12h-test
Test id:1ed02fa2-1294-471b-bf94-f43fb10b018d
Test name:scylla-master/teir1/longevity-150gb-asymmetric-cluster-12h-test
Test config file(s):Logs and commands
- Restore Monitor Stack command: `$ hydra investigate show-monitor 1ed02fa2-1294-471b-bf94-f43fb10b018d` - Restore monitor on AWS instance using [Jenkins job](https://jenkins.scylladb.com/view/QA/job/QA-tools/job/hydra-show-monitor/parambuild/?test_id=1ed02fa2-1294-471b-bf94-f43fb10b018d) - Show all stored logs command: `$ hydra investigate show-logs 1ed02fa2-1294-471b-bf94-f43fb10b018d` ## Logs: - **db-cluster-1ed02fa2.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/1ed02fa2-1294-471b-bf94-f43fb10b018d/20240220_164928/db-cluster-1ed02fa2.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/1ed02fa2-1294-471b-bf94-f43fb10b018d/20240220_164928/db-cluster-1ed02fa2.tar.gz) - **sct-runner-events-1ed02fa2.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/1ed02fa2-1294-471b-bf94-f43fb10b018d/20240220_164928/sct-runner-events-1ed02fa2.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/1ed02fa2-1294-471b-bf94-f43fb10b018d/20240220_164928/sct-runner-events-1ed02fa2.tar.gz) - **sct-1ed02fa2.log.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/1ed02fa2-1294-471b-bf94-f43fb10b018d/20240220_164928/sct-1ed02fa2.log.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/1ed02fa2-1294-471b-bf94-f43fb10b018d/20240220_164928/sct-1ed02fa2.log.tar.gz) - **loader-set-1ed02fa2.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/1ed02fa2-1294-471b-bf94-f43fb10b018d/20240220_164928/loader-set-1ed02fa2.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/1ed02fa2-1294-471b-bf94-f43fb10b018d/20240220_164928/loader-set-1ed02fa2.tar.gz) - **monitor-set-1ed02fa2.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/1ed02fa2-1294-471b-bf94-f43fb10b018d/20240220_164928/monitor-set-1ed02fa2.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/1ed02fa2-1294-471b-bf94-f43fb10b018d/20240220_164928/monitor-set-1ed02fa2.tar.gz) [Jenkins job URL](https://jenkins.scylladb.com/job/scylla-master/job/teir1/job/longevity-150gb-asymmetric-cluster-12h-test/2/) [Argus](https://argus.scylladb.com/test/a1ca282f-6bbe-4e16-bec1-c296dc90fc0b/runs?additionalRuns[]=1ed02fa2-1294-471b-bf94-f43fb10b018d)