AbortRepair getting stuck for 2min on the thread running the repair

fruch commented 8 months ago

on several case I've run into a case this nemesis is timing out on the repair thread

2024-02-20 14:13:18.062: (DisruptionEvent Severity.ERROR) period_type=end event_id=04d585aa-9cbc-47f3-8200-2c8ce7329df7 duration=45m27s: nemesis_name=AbortRepair target_node=Node longevity-twcs-48h-master-db-node-dbe99a50-7 [3.250.99.117 | 10.4.11.215] (seed: True) errors=
Traceback (most recent call last):
File "/home/ubuntu/scylla-cluster-tests/sdcm/nemesis.py", line 5113, in wrapper
result = method(*args[1:], **kwargs)
File "/home/ubuntu/scylla-cluster-tests/sdcm/nemesis.py", line 3057, in disrupt_abort_repair
thread.result(timeout=120)
File "/usr/local/lib/python3.10/concurrent/futures/_base.py", line 460, in result
raise TimeoutError()
concurrent.futures._base.TimeoutError

Packages

Scylla version: 5.5.0~dev-20240218.9d666f7d29cb with build-id 5ab8f82c4cd898fdf3fcdbe277727ef9fd1f554b

Kernel Version: 5.15.0-1053-aws

Issue description

[x] This issue is a regression.
[ ] It is unknown if this issue is a regression.

Impact

Nemesis is failing with not enough information

How frequently does it reproduce?

see twice

Installation details

Cluster size: 4 nodes (i3en.2xlarge)

Scylla Nodes used in this run:

longevity-twcs-48h-master-db-node-dbe99a50-7 (3.250.99.117 | 10.4.11.215) (shards: 7)
longevity-twcs-48h-master-db-node-dbe99a50-6 (3.255.120.110 | 10.4.10.177) (shards: 7)
longevity-twcs-48h-master-db-node-dbe99a50-5 (3.252.132.51 | 10.4.11.171) (shards: 7)
longevity-twcs-48h-master-db-node-dbe99a50-4 (3.250.201.126 | 10.4.9.92) (shards: 7)
longevity-twcs-48h-master-db-node-dbe99a50-3 (54.72.160.93 | 10.4.11.206) (shards: 7)
longevity-twcs-48h-master-db-node-dbe99a50-2 (63.33.57.39 | 10.4.11.132) (shards: 7)
longevity-twcs-48h-master-db-node-dbe99a50-1 (18.200.241.32 | 10.4.8.163) (shards: 7)

OS / Image: ami-04a20d0ee501653e6 (aws: undefined_region)

Test: longevity-twcs-48h-test Test id: dbe99a50-2f1b-4353-8219-9923581e8cff Test name: scylla-master/teir1/longevity-twcs-48h-test Test config file(s):

longevity-twcs-48h.yaml

Logs and commands

- Restore Monitor Stack command: `$ hydra investigate show-monitor dbe99a50-2f1b-4353-8219-9923581e8cff` - Restore monitor on AWS instance using [Jenkins job](https://jenkins.scylladb.com/view/QA/job/QA-tools/job/hydra-show-monitor/parambuild/?test_id=dbe99a50-2f1b-4353-8219-9923581e8cff) - Show all stored logs command: `$ hydra investigate show-logs dbe99a50-2f1b-4353-8219-9923581e8cff` ## Logs: - **db-cluster-dbe99a50.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/dbe99a50-2f1b-4353-8219-9923581e8cff/20240220_162815/db-cluster-dbe99a50.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/dbe99a50-2f1b-4353-8219-9923581e8cff/20240220_162815/db-cluster-dbe99a50.tar.gz) - **sct-runner-events-dbe99a50.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/dbe99a50-2f1b-4353-8219-9923581e8cff/20240220_162815/sct-runner-events-dbe99a50.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/dbe99a50-2f1b-4353-8219-9923581e8cff/20240220_162815/sct-runner-events-dbe99a50.tar.gz) - **sct-dbe99a50.log.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/dbe99a50-2f1b-4353-8219-9923581e8cff/20240220_162815/sct-dbe99a50.log.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/dbe99a50-2f1b-4353-8219-9923581e8cff/20240220_162815/sct-dbe99a50.log.tar.gz) - **loader-set-dbe99a50.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/dbe99a50-2f1b-4353-8219-9923581e8cff/20240220_162815/loader-set-dbe99a50.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/dbe99a50-2f1b-4353-8219-9923581e8cff/20240220_162815/loader-set-dbe99a50.tar.gz) - **monitor-set-dbe99a50.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/dbe99a50-2f1b-4353-8219-9923581e8cff/20240220_162815/monitor-set-dbe99a50.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/dbe99a50-2f1b-4353-8219-9923581e8cff/20240220_162815/monitor-set-dbe99a50.tar.gz) [Jenkins job URL](https://jenkins.scylladb.com/job/scylla-master/job/teir1/job/longevity-twcs-48h-test/2/) [Argus](https://argus.scylladb.com/test/e5978242-3a13-42e0-ad20-6be393f00d5d/runs?additionalRuns[]=dbe99a50-2f1b-4353-8219-9923581e8cff)

2nd time seen:

Packages

Scylla version: 5.5.0~dev-20240218.9d666f7d29cb with build-id 5ab8f82c4cd898fdf3fcdbe277727ef9fd1f554b

Kernel Version: 5.15.0-1053-aws

Installation details

Cluster size: 4 nodes (i4i.4xlarge)

Scylla Nodes used in this run:

longevity-50gb-12h-master-db-node-1ed02fa2-7 (3.252.242.95 | 10.4.10.230) (shards: 13)
longevity-50gb-12h-master-db-node-1ed02fa2-6 (18.201.107.224 | 10.4.11.237) (shards: 12)
longevity-50gb-12h-master-db-node-1ed02fa2-5 (34.244.103.93 | 10.4.10.229) (shards: 12)
longevity-50gb-12h-master-db-node-1ed02fa2-4 (63.35.198.98 | 10.4.8.42) (shards: 13)
longevity-50gb-12h-master-db-node-1ed02fa2-3 (3.255.223.242 | 10.4.11.42) (shards: 13)
longevity-50gb-12h-master-db-node-1ed02fa2-2 (52.210.126.225 | 10.4.10.192) (shards: 10)
longevity-50gb-12h-master-db-node-1ed02fa2-1 (54.171.102.63 | 10.4.9.4) (shards: 9)

OS / Image: ami-04a20d0ee501653e6 (aws: undefined_region)

Test: longevity-150gb-asymmetric-cluster-12h-test Test id: 1ed02fa2-1294-471b-bf94-f43fb10b018d Test name: scylla-master/teir1/longevity-150gb-asymmetric-cluster-12h-test Test config file(s):

longevity-150GB-12h-autorization-LimitedMonkey.yaml

Logs and commands

- Restore Monitor Stack command: `$ hydra investigate show-monitor 1ed02fa2-1294-471b-bf94-f43fb10b018d` - Restore monitor on AWS instance using [Jenkins job](https://jenkins.scylladb.com/view/QA/job/QA-tools/job/hydra-show-monitor/parambuild/?test_id=1ed02fa2-1294-471b-bf94-f43fb10b018d) - Show all stored logs command: `$ hydra investigate show-logs 1ed02fa2-1294-471b-bf94-f43fb10b018d` ## Logs: - **db-cluster-1ed02fa2.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/1ed02fa2-1294-471b-bf94-f43fb10b018d/20240220_164928/db-cluster-1ed02fa2.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/1ed02fa2-1294-471b-bf94-f43fb10b018d/20240220_164928/db-cluster-1ed02fa2.tar.gz) - **sct-runner-events-1ed02fa2.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/1ed02fa2-1294-471b-bf94-f43fb10b018d/20240220_164928/sct-runner-events-1ed02fa2.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/1ed02fa2-1294-471b-bf94-f43fb10b018d/20240220_164928/sct-runner-events-1ed02fa2.tar.gz) - **sct-1ed02fa2.log.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/1ed02fa2-1294-471b-bf94-f43fb10b018d/20240220_164928/sct-1ed02fa2.log.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/1ed02fa2-1294-471b-bf94-f43fb10b018d/20240220_164928/sct-1ed02fa2.log.tar.gz) - **loader-set-1ed02fa2.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/1ed02fa2-1294-471b-bf94-f43fb10b018d/20240220_164928/loader-set-1ed02fa2.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/1ed02fa2-1294-471b-bf94-f43fb10b018d/20240220_164928/loader-set-1ed02fa2.tar.gz) - **monitor-set-1ed02fa2.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/1ed02fa2-1294-471b-bf94-f43fb10b018d/20240220_164928/monitor-set-1ed02fa2.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/1ed02fa2-1294-471b-bf94-f43fb10b018d/20240220_164928/monitor-set-1ed02fa2.tar.gz) [Jenkins job URL](https://jenkins.scylladb.com/job/scylla-master/job/teir1/job/longevity-150gb-asymmetric-cluster-12h-test/2/) [Argus](https://argus.scylladb.com/test/a1ca282f-6bbe-4e16-bec1-c296dc90fc0b/runs?additionalRuns[]=1ed02fa2-1294-471b-bf94-f43fb10b018d)

fruch commented 8 months ago

looking into those, seems like we are aborting on of the repair sessions

< t:2024-03-03 03:48:04,350 f:base.py         l:228  c:RemoteLibSSH2CmdRunner p:DEBUG > [2024-03-03 03:45:50,906] Starting repair command #1570, repairing 1 ranges for keyspace system_auth (parallelism=SEQUENTIAL, full=true)
< t:2024-03-03 03:48:04,350 f:base.py         l:228  c:RemoteLibSSH2CmdRunner p:DEBUG > [2024-03-03 03:45:50,906] Repair session 1570
< t:2024-03-03 03:48:04,350 f:base.py         l:228  c:RemoteLibSSH2CmdRunner p:DEBUG > [2024-03-03 03:45:52,118] Repair session 1570 failed
< t:2024-03-03 03:48:04,350 f:base.py         l:228  c:RemoteLibSSH2CmdRunner p:DEBUG > [2024-03-03 03:45:52,120] Starting repair command #1571, repairing 1 ranges for keyspace system
_distributed_everywhere (parallelism=SEQUENTIAL, full=true)

and the repair just keep going, and take more then 2min.

I suspect that the new scylla nodetool is now being used, and behave a bit differently then the original nodetool command My guess the orignal nodetool command has stop everything on the first failure, when the new version keeps on going and report the failure at the end.

so now when we call http://127.0.0.1:10000/storage_service/force_terminate_repair only one repair session is stopped

@tchaikov @denesb, do we have a way to abort all of repair session, in the current running scylla nodetool repair?

denesb commented 7 months ago

I suspect that the new scylla nodetool is now being used, and behave a bit differently then the original nodetool command.

Yes, the repair command was recently implemented and it is possible that SCT started using it. I tried to mimick the behaviour of Origin's nodetool repair as close as possible but I may have missed some details around failure handling.

My guess the orignal nodetool command has stop everything on the first failure, when the new version keeps on going and report the failure at the end.

I can confirm that the new nodetool does continue and just collect the result of all repairs and reports it at the end.

so now when we call http://127.0.0.1:10000/storage_service/force_terminate_repair only one repair session is stopped

@tchaikov @denesb, do we have a way to abort all of repair session, in the current running scylla nodetool repair?

I will adapt the new implementation to follow the behaviour of Origin and stop on the first failure.

fruch commented 7 months ago

I suspect that the new scylla nodetool is now being used, and behave a bit differently then the original nodetool command.

Yes, the repair command was recently implemented and it is possible that SCT started using it. I tried to mimick the behaviour of Origin's nodetool repair as close as possible but I may have missed some details around failure handling.

My guess the orignal nodetool command has stop everything on the first failure, when the new version keeps on going and report the failure at the end.

I can confirm that the new nodetool does continue and just collect the result of all repairs and reports it at the end.

so now when we call http://127.0.0.1:10000/storage_service/force_terminate_repair only one repair session is stopped @tchaikov @denesb, do we have a way to abort all of repair session, in the current running scylla nodetool repair?

I will adapt the new implementation to follow the behaviour of Origin and stop on the first failure.

hard to tell which path is correct, but if we have a "requirement" that force_terminate_repair api, stop a repair completely, so yes we should stop, but maybe just when the error is "user-requested" this abort ? and other possible failures might not stop it ?

cause keep on going, sound like a desirable behavior in general, when having any other errors

denesb commented 7 months ago

hard to tell which path is correct, but if we have a "requirement" that force_terminate_repair api, stop a repair completely, so yes we should stop, but maybe just when the error is "user-requested" this abort ? and other possible failures might not stop it ?

cause keep on going, sound like a desirable behavior in general, when having any other errors

The API endpoint just returns one of 3 statues: RUNNING, SUCESSFUL, FAILED. There is no distinction on why it failed. So for now I will just change scylla-nodetool repair to follow origin and we can revisit this in the future if required.

denesb commented 7 months ago

Fix is here: https://github.com/scylladb/scylladb/pull/17678

But it will need some time to get in, as it depends on another PR, which is not queued yet.

denesb commented 7 months ago

Should be fixed by https://github.com/scylladb/scylladb/commit/566223c34a161b08c666f3658b41fafff0f69271. @fruch plesae verify and close.

fruch commented 6 months ago

this is fixed

scylladb / scylla-cluster-tests