AbortRepairMonkey failed due to exceeding the 120 seconds timeout

KnifeyMoloko commented 2 years ago

Installation details Kernel version: 5.11.0-1028-aws Scylla version (or git commit hash): 5.1.dev-0.20220217.69fcc053b with build-id b8415b1ebbffff2b4183734680f4afab3bfed86d Cluster size: 6 nodes (i3.4xlarge) Scylla running with shards number (live nodes): longevity-tls-50gb-3d-master-db-node-2a70ca0f-1 (3.250.23.34 | 10.0.2.108): 14 shards longevity-tls-50gb-3d-master-db-node-2a70ca0f-9 (34.243.236.177 | 10.0.1.205): 14 shards longevity-tls-50gb-3d-master-db-node-2a70ca0f-14 (3.250.223.137 | 10.0.1.30): 14 shards longevity-tls-50gb-3d-master-db-node-2a70ca0f-17 (34.243.248.150 | 10.0.1.26): 14 shards longevity-tls-50gb-3d-master-db-node-2a70ca0f-21 (34.251.180.151 | 10.0.1.159): 14 shards longevity-tls-50gb-3d-master-db-node-2a70ca0f-22 (34.253.70.67 | 10.0.3.42): 14 shards longevity-tls-50gb-3d-master-db-node-2a70ca0f-23 (18.203.254.31 | 10.0.1.198): 14 shards longevity-tls-50gb-3d-master-db-node-2a70ca0f-24 (63.35.203.93 | 10.0.2.113): 14 shards Scylla running with shards number (terminated nodes): longevity-tls-50gb-3d-master-db-node-2a70ca0f-3 (54.170.189.138 | 10.0.0.69): 14 shards

List of nodes is larger than 10. See sct log for a full list of nodes.

OS (RHEL/CentOS/Ubuntu/AWS AMI): ami-041d8500e7cf30167 (aws: eu-west-1)

Test: longevity-50gb-3days Test name: longevity_test.LongevityTest.test_custom_time Test config file(s):

longevity-50GB-3days-authorization-and-tls-ssl.yaml

Issue description

==================================== The AbortRepairMonkey nemesis failed due to hitting a timeout for the silenced_nodetool_repair_to_fail() method to finish. Looks like the repair command was completed tough.

-9 !    INFO |  [shard 0] repair - repair[0e1a64d8-abde-461c-908a-e1330a5f485f]: Finished to shutdown off-strategy compaction updater
< t:2022-02-20 03:44:23,524 f:events_processes.py l:146  c:sdcm.sct_events.events_processes p:DEBUG > Get process `MainDevice' from EventsProcessesRegistry[lod_dir=/home/ubuntu/sct-results/20220218-040646-652694,id=0x7f45be94faf0,default=True]
< t:2022-02-20 03:44:23,524 f:events_processes.py l:146  c:sdcm.sct_events.events_processes p:DEBUG > Get process `MainDevice' from EventsProcessesRegistry[lod_dir=/home/ubuntu/sct-results/20220218-040646-652694,id=0x7f45be94faf0,default=True]
< t:2022-02-20 03:44:23,524 f:db_log_reader.py l:113  c:sdcm.db_log_reader   p:DEBUG > 2022-02-20T03:44:23+00:00 longevity-tls-50gb-3d-master-db-node-2a70ca0f-9 !    INFO |  [shard 0] repair - repair[0e1a64d8-abde-461c-908a-e1330a5f485f]: Started to shutdown off-strategy compaction updater
< t:2022-02-20 03:44:23,525 f:db_log_reader.py l:113  c:sdcm.db_log_reader   p:DEBUG > 2022-02-20T03:44:23+00:00 longevity-tls-50gb-3d-master-db-node-2a70ca0f-9 !    INFO |  [shard 0] repair - repair[0e1a64d8-abde-461c-908a-e1330a5f485f]: Finished to shutdown off-strategy compaction updater
< t:2022-02-20 03:44:23,797 f:cluster.py      l:2613 c:sdcm.cluster_aws     p:DEBUG > Node longevity-tls-50gb-3d-master-db-node-2a70ca0f-9 [34.243.236.177 | 10.0.1.205] (seed: False): Command '/usr/bin/nodetool -u cassandra -pw 'cassandra'  repair ' duration -> 139.8710241850058 s
< t:2022-02-20 03:44:23,798 f:nemesis.py      l:3314 c:sdcm.nemesis         p:ERROR > sdcm.nemesis.SisyphusMonkey: Unhandled exception in method <function Nemesis.disrupt_abort_repair at 0x7f45b2815240> < t:2022-02-20 03:44:23,798 f:nemesis.py      l:3314 c:sdcm.nemesis         p:ERROR > sdcm.nemesis.SisyphusMonkey: Unhandled exception in method <function Nemesis.disrupt_abort_repair at 0x7f45b2815240>
< t:2022-02-20 03:44:23,798 f:nemesis.py      l:3314 c:sdcm.nemesis         p:ERROR > Traceback (most recent call last):
< t:2022-02-20 03:44:23,798 f:nemesis.py      l:3314 c:sdcm.nemesis         p:ERROR >   File "/home/ubuntu/scylla-cluster-tests/sdcm/nemesis.py", line 3304, in wrapper
< t:2022-02-20 03:44:23,798 f:nemesis.py      l:3314 c:sdcm.nemesis         p:ERROR >     result = method(*args, **kwargs)
< t:2022-02-20 03:44:23,798 f:nemesis.py      l:3314 c:sdcm.nemesis         p:ERROR >   File "/home/ubuntu/scylla-cluster-tests/sdcm/nemesis.py", line 2127, in disrupt_abort_repair
< t:2022-02-20 03:44:23,798 f:nemesis.py      l:3314 c:sdcm.nemesis         p:ERROR >     thread.result(timeout=120)
< t:2022-02-20 03:44:23,798 f:nemesis.py      l:3314 c:sdcm.nemesis         p:ERROR >   File "/usr/local/lib/python3.10/concurrent/futures/_base.py", line 447, in result
< t:2022-02-20 03:44:23,798 f:nemesis.py      l:3314 c:sdcm.nemesis         p:ERROR >     raise TimeoutError()
< t:2022-02-20 03:44:23,798 f:nemesis.py      l:3314 c:sdcm.nemesis         p:ERROR > concurrent.futures._base.TimeoutError
< t:2022-02-20 03:44:23,799 f:events_processes.py l:146  c:sdcm.sct_events.events_processes p:DEBUG > Get process `MainDevice' from EventsProcessesRegistry[lod_dir=/home/ubuntu/sct-results/20220218-040646-652694,id=0x7f45be94faf0,default=True]
< t:2022-02-20 03:44:23,799 f:nemesis.py      l:3326 c:sdcm.nemesis         p:DEBUG > sdcm.nemesis.SisyphusMonkey: AbortRepair Node longevity-tls-50gb-3d-master-db-node-2a70ca0f-9 [34.243.236.177 | 10.0.1.205] (seed: False) duration -> 140 s 
< t:2022-02-20 03:44:23,799 f:nemesis.py      l:241  c:sdcm.nemesis         p:DEBUG > sdcm.nemesis.SisyphusMonkey: Update nemesis info with: {'start': 1645328523, 'end': 1645328663, 'duration': 140, 'node': 'Node longevity-tls-50gb-3d-master-db-node-2a70ca0f-9 [34.243.236.177 | 10.0.1.205] (seed: False)', 'subtype': 'end', 'error': '', 'full_traceback': 'Traceback (most recent call last):\n  File "/home/ubuntu/scylla-cluster-tests/sdcm/nemesis.py", line 3304, in wrapper\n    result = method(*args, **kwargs)\n  File "/home/ubuntu/scylla-cluster-tests/sdcm/nemesis.py", line 2127, in disrupt_abort_repair\n    thread.result(timeout=120)\n  File "/usr/local/lib/python3.10/concurrent/futures/_base.py", line 447, in result\n    raise TimeoutError()\nconcurrent.futures._base.TimeoutError\n'}

====================================

Restore Monitor Stack command: $ hydra investigate show-monitor 2a70ca0f-b6da-4b25-a7c3-d3ce9923a65e Restore monitor on AWS instance using Jenkins job Show all stored logs command: $ hydra investigate show-logs 2a70ca0f-b6da-4b25-a7c3-d3ce9923a65e

Test id: 2a70ca0f-b6da-4b25-a7c3-d3ce9923a65e

Jenkins job URL

fgelcer commented 2 years ago

seeing it on 5.0:

Installation details

Kernel Version: 5.13.0-1029-aws Scylla version (or git commit hash): 5.0~rc8-20220612.f28542a71 with build-id 85cf87619b93155a574647ec252ce5a043c7fe77 Cluster size: 4 nodes (i3.2xlarge)

Scylla Nodes used in this run:

longevity-lwt-3h-5-0-db-node-9a717cba-5 (3.230.0.206 | 10.0.2.231) (shards: 8)
longevity-lwt-3h-5-0-db-node-9a717cba-4 (34.237.223.203 | 10.0.3.251) (shards: 8)
longevity-lwt-3h-5-0-db-node-9a717cba-3 (3.236.188.65 | 10.0.1.218) (shards: 8)
longevity-lwt-3h-5-0-db-node-9a717cba-2 (44.198.193.146 | 10.0.1.242) (shards: 8)
longevity-lwt-3h-5-0-db-node-9a717cba-1 (3.237.79.213 | 10.0.1.1) (shards: 8)

OS / Image: ami-06d93b63348d73505 (aws: us-east-1)

Test: longevity-lwt-3h-test Test id: 9a717cba-379e-4b3a-9ce2-2c26254c08cc Test name: scylla-5.0/longevity/longevity-lwt-3h-test Test config file(s):

longevity-lwt-basic-3h.yaml

Issue description

>>>>>>> Your description here... <<<<<<<

Restore Monitor Stack command: $ hydra investigate show-monitor 9a717cba-379e-4b3a-9ce2-2c26254c08cc
Restore monitor on AWS instance using Jenkins job
Show all stored logs command: $ hydra investigate show-logs 9a717cba-379e-4b3a-9ce2-2c26254c08cc

Logs:

db-cluster-9a717cba.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/9a717cba-379e-4b3a-9ce2-2c26254c08cc/20220613_063428/db-cluster-9a717cba.tar.gz
monitor-set-9a717cba.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/9a717cba-379e-4b3a-9ce2-2c26254c08cc/20220613_063428/monitor-set-9a717cba.tar.gz
loader-set-9a717cba.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/9a717cba-379e-4b3a-9ce2-2c26254c08cc/20220613_063428/loader-set-9a717cba.tar.gz
sct-runner-9a717cba.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/9a717cba-379e-4b3a-9ce2-2c26254c08cc/20220613_063428/sct-runner-9a717cba.tar.gz
parallel-timelines-report-9a717cba.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/9a717cba-379e-4b3a-9ce2-2c26254c08cc/20220613_063428/parallel-timelines-report-9a717cba.tar.gz

Jenkins job URL

KnifeyMoloko commented 1 year ago

@fgelcer We don't see this issue come up very often (twice in a year). Extending the timeout from 120s to 180s will make it less likely to fail or we can ignore it, since it's such a rare occurence. WDYT?

fgelcer commented 1 year ago

if that is not frequently occurring, we should skip it, and if encounter again, report a new issue for it. increasing the timeout may hide issues from us, in a sense that most of the runs are passing within 120s

scylladb / scylla-cluster-tests