scylladb / scylla-cluster-tests

Tests for Scylla Clusters
GNU Affero General Public License v3.0
55 stars 93 forks source link

DecommissionStreamingErr times out when we interrupt the decommission process #4890

Open KnifeyMoloko opened 2 years ago

KnifeyMoloko commented 2 years ago

Installation details

Kernel Version: 5.13.0-1031-gcp Scylla version (or git commit hash): 5.1.dev-20220612.e87ca733f0d3 with build-id 2fcb16ce2dcf313a6d86fce8631e18e3055ab359 Cluster size: 6 nodes (n1-highmem-16)

Scylla Nodes used in this run:

OS / Image: https://www.googleapis.com/compute/v1/projects/scylla-images/global/images/6210294087766559177 (gce: us-east1)

Test: longevity-10gb-3h-gce-test Test id: 43a559b8-f978-4c69-8590-068f82e64df4 Test name: scylla-master/longevity/longevity-10gb-3h-gce-test Test config file(s):

Issue description

>>>>>>> During DecommissionStreamingErr we use ParallelObject to start a node decommission, and in parallel, watch the logs for the decommission starting and trigger a hard reboot to interrupt it. While this does what it was intended to do, we do get ParallelObject timeouts, most likely since the first command (nodetool decommission run using the remoter) does not return.

The solution for this would be to ignore the return value for the first command.

Error in question in sct.log:

2022-06-12 10:06:16.261: (NodetoolEvent Severity.NORMAL) period_type=begin event_id=c0b72657-f7a1-45a1-a4f9-c1a3ac9b3b7c: nodetool_command=decommission node=longevity-10gb-3h-master-db-node-43a559b8-0-5
...
...
...
< t:2022-06-12 10:16:16,263 f:nemesis.py      l:3410 c:sdcm.nemesis         p:ERROR > sdcm.nemesis.SisyphusMonkey: Unhandled exception in method <function Nemesis.disrupt_decommission_streaming_err at 0x7f277011e170> < t:2022-06-12 10:16:16,263 f:nemesis.py      l:3410 c:sdcm.nemesis         p:ERROR > sdcm.nemesis.SisyphusMonkey: Unhandled exception in method <function Nemesis.disrupt_decommission_streaming_err at 0x7f277011e170>
< t:2022-06-12 10:16:16,263 f:nemesis.py      l:3410 c:sdcm.nemesis         p:ERROR > Traceback (most recent call last):
< t:2022-06-12 10:16:16,263 f:nemesis.py      l:3410 c:sdcm.nemesis         p:ERROR >   File "/home/ubuntu/scylla-cluster-tests/sdcm/nemesis.py", line 3400, in wrapper
< t:2022-06-12 10:16:16,263 f:nemesis.py      l:3410 c:sdcm.nemesis         p:ERROR >     result = method(*args, **kwargs)
< t:2022-06-12 10:16:16,263 f:nemesis.py      l:3410 c:sdcm.nemesis         p:ERROR >   File "/home/ubuntu/scylla-cluster-tests/sdcm/nemesis.py", line 2924, in disrupt_decommission_streaming_err
< t:2022-06-12 10:16:16,263 f:nemesis.py      l:3410 c:sdcm.nemesis         p:ERROR >     self.start_and_interrupt_decommission_streaming()
< t:2022-06-12 10:16:16,263 f:nemesis.py      l:3410 c:sdcm.nemesis         p:ERROR >   File "/home/ubuntu/scylla-cluster-tests/sdcm/nemesis.py", line 2864, in start_and_interrupt_decommission_streaming
< t:2022-06-12 10:16:16,263 f:nemesis.py      l:3410 c:sdcm.nemesis         p:ERROR >     ParallelObject(objects=[trigger, watcher], timeout=600).call_objects()
< t:2022-06-12 10:16:16,263 f:nemesis.py      l:3410 c:sdcm.nemesis         p:ERROR >   File "/home/ubuntu/scylla-cluster-tests/sdcm/utils/common.py", line 460, in call_objects
< t:2022-06-12 10:16:16,263 f:nemesis.py      l:3410 c:sdcm.nemesis         p:ERROR >     return self.run(lambda x: x())
< t:2022-06-12 10:16:16,263 f:nemesis.py      l:3410 c:sdcm.nemesis         p:ERROR >   File "/home/ubuntu/scylla-cluster-tests/sdcm/utils/common.py", line 436, in run
< t:2022-06-12 10:16:16,263 f:nemesis.py      l:3410 c:sdcm.nemesis         p:ERROR >     raise FuturesTimeoutError("when running on: %s" % [r.obj for r in results])
< t:2022-06-12 10:16:16,263 f:nemesis.py      l:3410 c:sdcm.nemesis         p:ERROR > concurrent.futures._base.TimeoutError: when running on: [functools.partial(<bound method BaseNode.run_nodetool of <sdcm.cluster_gce.GCENode object at 0x7f2758743580>>, sub_cmd='decommission', warning_event_on_exception=(<class 'Exception'>,)), functools.partial(<bound method Nemesis._call_disrupt_func_after_expression_logged of <sdcm.nemesis.SisyphusMonkey object at 0x7f27682af430>>, expression='DECOMMISSIONING: unbootstrap starts', disrupt_func=<bound method BaseNode.reboot of <sdcm.cluster_gce.GCENode object at 0x7f2758743580>>, disrupt_func_kwargs={'hard': True, 'verify_ssh': True}, delay=0)]

<<<<<<<

Logs:

Jenkins job URL

roydahan commented 2 years ago

@KnifeyMoloko was there something changed here? Since when we started suing ParallelObject?

If we changed the nemesis, please mention the commit/PR so we can revert.

ShlomiBalalis commented 2 years ago

The issue was reproduced in the longevity-50gb-3days-test run of scylla 5.0:

Installation details

Kernel Version: 5.13.0-1029-aws Scylla version (or git commit hash): 5.0~rc8-20220612.f28542a71 with build-id 85cf87619b93155a574647ec252ce5a043c7fe77 Cluster size: 6 nodes (i3.4xlarge)

Scylla Nodes used in this run:

OS / Image: ami-088626a20ac6084a7 (aws: eu-west-1)

Test: longevity-50gb-3days-test Test id: 6d274d5f-0f67-4fff-9978-1e7b3f6cce0c Test name: scylla-5.0/longevity/longevity-50gb-3days-test Test config file(s):

Issue description

There were 7 runs of the disrupt_*_streaming_err nemesis variety in this run, and the error occured in 3 of them:

In node longevity-tls-50gb-3d-5-0-db-node-6d274d5f-3 at 2022-06-21 11:23:31.372, in node longevity-tls-50gb-3d-5-0-db-node-6d274d5f-11 at 2022-06-22 06:28:45.786 and in node longevity-tls-50gb-3d-5-0-db-node-6d274d5f-2 at 2022-06-24 07:06:40.000:

(DisruptionEvent Severity.ERROR) period_type=end event_id=74dc8eb9-c6d8-458d-8b26-4ae624c5a863 duration=8m39s: nemesis_name=RebuildStreamingErr target_node=Node longevity-tls-50gb-3d-5-0-db-node-6d274d5f-3 [34.249.223.159 | 10.0.2.187] (seed: False) errors=when running on: [functools.partial(<bound method BaseNode.run_nodetool of <sdcm.cluster_aws.AWSNode object at 0x7f78541d79d0>>, sub_cmd='rebuild', warning_event_on_exception=(<class 'Exception'>,)), functools.partial(<bound method Nemesis._call_disrupt_func_after_expression_logged of <sdcm.nemesis.SisyphusMonkey object at 0x7f7814637fd0>>, expression='Rebuild starts', disrupt_func=<bound method BaseNode.reboot of <sdcm.cluster_aws.AWSNode object at 0x7f78541d79d0>>, disrupt_func_kwargs={'hard': True, 'verify_ssh': True}, timeout=300, delay=1)]
Traceback (most recent call last):
File "/home/ubuntu/scylla-cluster-tests/sdcm/nemesis.py", line 3369, in wrapper
result = method(*args, **kwargs)
File "/home/ubuntu/scylla-cluster-tests/sdcm/nemesis.py", line 2900, in disrupt_rebuild_streaming_err
self.start_and_interrupt_rebuild_streaming()
File "/home/ubuntu/scylla-cluster-tests/sdcm/nemesis.py", line 2882, in start_and_interrupt_rebuild_streaming
ParallelObject(objects=[trigger, watcher], timeout=timeout + 60).call_objects()
File "/home/ubuntu/scylla-cluster-tests/sdcm/utils/common.py", line 460, in call_objects
return self.run(lambda x: x())
File "/home/ubuntu/scylla-cluster-tests/sdcm/utils/common.py", line 436, in run
raise FuturesTimeoutError("when running on: %s" % [r.obj for r in results])
concurrent.futures._base.TimeoutError: when running on: [functools.partial(<bound method BaseNode.run_nodetool of <sdcm.cluster_aws.AWSNode object at 0x7f78541d79d0>>, sub_cmd='rebuild', warning_event_on_exception=(<class 'Exception'>,)), functools.partial(<bound method Nemesis._call_disrupt_func_after_expression_logged of <sdcm.nemesis.SisyphusMonkey object at 0x7f7814637fd0>>, expression='Rebuild starts', disrupt_func=<bound method BaseNode.reboot of <sdcm.cluster_aws.AWSNode object at 0x7f78541d79d0>>, disrupt_func_kwargs={'hard': True, 'verify_ssh': True}, timeout=300, delay=1)]

Logs:

Jenkins job URL

roydahan commented 2 years ago

@KnifeyMoloko are you working on a fix?

(@fgelcer )

fruch commented 2 years ago

@fgelcer can you get someone to run those fixes, to verify they are working ?

fgelcer commented 2 years ago

@fgelcer can you get someone to run those fixes, to verify they are working ?

my comment above (and the link to a PR) is not related to this one, hence i'm removing it

fgelcer commented 2 years ago

@fruch , there were few commits related to it merged to master (and 1 wasn't backported to 5.0). i will give it a run on a staging job to be sure if this was fixed, and what is its status

fgelcer commented 2 years ago

@fruch , running in here

fgelcer commented 2 years ago

seems still not passing

enaydanov commented 2 years ago

Got the same for RepairStreamingErr nemesis in 5.0:

Installation details

Kernel Version: 5.13.0-1029-aws Scylla version (or git commit hash): 5.0~rc8-20220612.f28542a71 with build-id 85cf87619b93155a574647ec252ce5a043c7fe77 Cluster size: 4 nodes (i3.2xlarge)

Scylla Nodes used in this run:

OS / Image: ami-088626a20ac6084a7 (aws: eu-west-1)

Test: longevity-lwt-parallel-schema-changes-with-disruptive-24h-test Test id: 9a35c8d5-3278-4a93-9a33-e2670db5eddf Test name: scylla-5.0/longevity/longevity-lwt-parallel-schema-changes-with-disruptive-24h-test Test config file(s):

Issue description

                                            Traceback (most recent call last):
  File "/home/ubuntu/scylla-cluster-tests/sdcm/nemesis.py", line 3384, in wrapper
    result = method(*args, **kwargs)
  File "/home/ubuntu/scylla-cluster-tests/sdcm/nemesis.py", line 2919, in disrupt_repair_streaming_err
    self.start_and_interrupt_repair_streaming()
  File "/home/ubuntu/scylla-cluster-tests/sdcm/nemesis.py", line 2870, in start_and_interrupt_repair_streaming
    ParallelObject(objects=[trigger, watcher], timeout=600).call_objects()
  File "/home/ubuntu/scylla-cluster-tests/sdcm/utils/common.py", line 460, in call_objects
    return self.run(lambda x: x())
  File "/home/ubuntu/scylla-cluster-tests/sdcm/utils/common.py", line 436, in run
    raise FuturesTimeoutError("when running on: %s" % [r.obj for r in results])
concurrent.futures._base.TimeoutError: when running on: [functools.partial(<bound method BaseNode.run_nodetool of <sdcm.cluster_aws.AWSNode object at 0x7f8bab9f5b70>>, sub_cmd='repair', warning_event_on_exception=(<class 'Exception'>,)), functools.partial(<bound method Nemesis._call_disrupt_func_after_expression_logged of <sdcm.nemesis.DisruptiveMonkey object at 0x7f8dc07e66b0>>, expression='Repair 1 out of', disrupt_func=<function Nemesis.reboot_node at 0x7f8e0305d7e0>, disrupt_func_kwargs={'target_node': <sdcm.cluster_aws.AWSNode object at 0x7f8bab9f5b70>, 'hard': True, 'verify_ssh': True}, delay=1)]

Logs:

Jenkins job URL

fgelcer commented 2 years ago

@KnifeyMoloko , if this issue is not reproducing after last fix, please close this issue.

KnifeyMoloko commented 2 years ago

I saw it passing in some tests and no new occurrences in the last 2 weeks, so closing this one.

yarongilor commented 1 year ago

Looks like this is reproduced in 2022.1: error:

--
disrupt_decommission_streaming_err | longevity-10gb-3h-2022-1-db-node-e05ad6be-0-3 | Failed | 2023-05-15 07:07:35 | 2023-05-15 07:30:13
Nemesis InformationClass: SisyphusName: disrupt_decommission_streaming_errStatus: FailedFailure reasonTraceback (most recent call last):   File "/home/ubuntu/scylla-cluster-tests/sdcm/nemesis.py", line 3413, in wrapper     result = method(*args, **kwargs)   File "/home/ubuntu/scylla-cluster-tests/sdcm/nemesis.py", line 2933, in disrupt_decommission_streaming_err     self.start_and_interrupt_decommission_streaming()   File "/home/ubuntu/scylla-cluster-tests/sdcm/nemesis.py", line 2868, in start_and_interrupt_decommission_streaming     ParallelObject(objects=[trigger, watcher], timeout=1200).call_objects()   File "/home/ubuntu/scylla-cluster-tests/sdcm/utils/common.py", line 461, in call_objects     return self.run(lambda x: x())   File "/home/ubuntu/scylla-cluster-tests/sdcm/utils/common.py", line 437, in run     raise FuturesTimeoutError("when running on: %s" % [r.obj for r in results]) concurrent.futures._base.TimeoutError: when running on: [functools.partial(<bound method BaseNode.run_nodetool of <sdcm.cluster_gce.GCENode object at 0x7faac95520b0>>, sub_cmd='decommission', warning_event_on_exception=(<class 'Exception'>,), retry=0), functools.partial(<bound method Nemesis._call_disrupt_func_after_expression_logged of <sdcm.nemesis.SisyphusMonkey object at 0x7faac935bd00>>, expression='DECOMMISSIONING: unbootstrap starts', disrupt_func=<function Nemesis.reboot_node at 0x7faac9399fc0>, disrupt_func_kwargs={'target_node': <sdcm.cluster_gce.GCENode object at 0x7faac95520b0>, 'hard': True, 'verify_ssh': True}, delay=0)]

Issue description

Describe your issue in detail and steps it took to produce it.

Impact

Describe the impact this issue causes to the user.

How frequently does it reproduce?

Describe the frequency with how this issue can be reproduced.

Installation details

Kernel Version: 5.15.0-1032-gcp Scylla version (or git commit hash): 2022.1.7-20230514.f5c8060f16 with build-id 847ca2704d89d74e2f44409d8f3719c20eeb876c

Cluster size: 6 nodes (n1-highmem-16)

Scylla Nodes used in this run:

OS / Image: https://www.googleapis.com/compute/v1/projects/scylla-images/global/images/2495699258528504586 (gce: us-east1)

Test: longevity-10gb-3h-gce-test Test id: e05ad6be-27ff-48bf-aff6-aebccfb692c9 Test name: enterprise-2022.1/longevity/longevity-10gb-3h-gce-test Test config file(s):

Logs and commands - Restore Monitor Stack command: `$ hydra investigate show-monitor e05ad6be-27ff-48bf-aff6-aebccfb692c9` - Restore monitor on AWS instance using [Jenkins job](https://jenkins.scylladb.com/view/QA/job/QA-tools/job/hydra-show-monitor/parambuild/?test_id=e05ad6be-27ff-48bf-aff6-aebccfb692c9) - Show all stored logs command: `$ hydra investigate show-logs e05ad6be-27ff-48bf-aff6-aebccfb692c9` ## Logs: - **db-cluster-e05ad6be.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/e05ad6be-27ff-48bf-aff6-aebccfb692c9/20230515_090642/db-cluster-e05ad6be.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/e05ad6be-27ff-48bf-aff6-aebccfb692c9/20230515_090642/db-cluster-e05ad6be.tar.gz) - **monitor-set-e05ad6be.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/e05ad6be-27ff-48bf-aff6-aebccfb692c9/20230515_090642/monitor-set-e05ad6be.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/e05ad6be-27ff-48bf-aff6-aebccfb692c9/20230515_090642/monitor-set-e05ad6be.tar.gz) - **loader-set-e05ad6be.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/e05ad6be-27ff-48bf-aff6-aebccfb692c9/20230515_090642/loader-set-e05ad6be.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/e05ad6be-27ff-48bf-aff6-aebccfb692c9/20230515_090642/loader-set-e05ad6be.tar.gz) - **sct-runner-e05ad6be.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/e05ad6be-27ff-48bf-aff6-aebccfb692c9/20230515_090642/sct-runner-e05ad6be.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/e05ad6be-27ff-48bf-aff6-aebccfb692c9/20230515_090642/sct-runner-e05ad6be.tar.gz) - **parallel-timelines-report-e05ad6be.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/e05ad6be-27ff-48bf-aff6-aebccfb692c9/20230515_090642/parallel-timelines-report-e05ad6be.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/e05ad6be-27ff-48bf-aff6-aebccfb692c9/20230515_090642/parallel-timelines-report-e05ad6be.tar.gz) [Jenkins job URL](https://jenkins.scylladb.com/job/enterprise-2022.1/job/longevity/job/longevity-10gb-3h-gce-test/59/)
fruch commented 1 year ago

This was never backported: https://github.com/scylladb/scylla-cluster-tests/pull/4937

ShlomiBalalis commented 1 year ago

Happened again in 2022.1:

Issue description

The decommission was executed:

< t:2023-06-06 05:18:42,059 f:remote_base.py  l:520  c:RemoteLibSSH2CmdRunner p:DEBUG > Running command "/usr/bin/nodetool  decommission "...

The node started decommissioning:

2023-06-06T05:18:44+00:00 longevity-10gb-3h-2022-1-db-node-3bd52af4-0-2 !    INFO |  [shard 0] api - decommission
2023-06-06T05:18:44+00:00 longevity-10gb-3h-2022-1-db-node-3bd52af4-0-2 !    INFO |  [shard 0] storage_service - DECOMMISSIONING: starts
2023-06-06T05:18:44+00:00 longevity-10gb-3h-2022-1-db-node-3bd52af4-0-2 !    INFO |  [shard 0] storage_service - decommission[821ccb77-429f-4b9a-b5ae-02fb19fdcc96]: Started decommission operation, removing node=10.142.0.63, sync_nodes=[10.142.0.26, 10.142.0.63, 10.142.0.53,
 10.142.0.84, 10.142.0.56, 10.142.0.67], ignore_nodes=[]
2023-06-06T05:18:44+00:00 longevity-10gb-3h-2022-1-db-node-3bd52af4-0-2 !    INFO |  [shard 0] storage_service - decommission[821ccb77-429f-4b9a-b5ae-02fb19fdcc96]: Added node=10.142.0.63 as leaving node, coordinator=10.142.0.63
2023-06-06T05:18:45+00:00 longevity-10gb-3h-2022-1-db-node-3bd52af4-0-2 !    INFO |  [shard 0] storage_service - decommission[821ccb77-429f-4b9a-b5ae-02fb19fdcc96]: Started heartbeat_updater
2023-06-06T05:18:45+00:00 longevity-10gb-3h-2022-1-db-node-3bd52af4-0-2 !    INFO |  [shard 0] storage_service - DECOMMISSIONING: unbootstrap starts

A reboot was executed:

< t:2023-06-06 05:18:47,575 f:cluster.py      l:1050 c:sdcm.cluster_gce     p:DEBUG > Node longevity-10gb-3h-2022-1-db-node-3bd52af4-0-2 [35.229.119.15 | 10.142.0.63] (seed: False): Hardly rebooting node

The node started up fine:

2023-06-06T05:19:11+00:00 longevity-10gb-3h-2022-1-db-node-3bd52af4-0-2 !  NOTICE | syslog-ng starting up; version='3.25.1'
2023-06-06T05:19:11+00:00 longevity-10gb-3h-2022-1-db-node-3bd52af4-0-2 !  NOTICE | Syslog connection established; fd='19', server='AF_INET(10.142.0.49:49153)', local='AF_INET(0.0.0.0:0)'
2023-06-06T05:19:05+00:00 longevity-10gb-3h-2022-1-db-node-3bd52af4-0-2 !  NOTICE | Linux version 5.15.0-1035-gcp (buildd@lcy02-amd64-086) (gcc (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0, GNU ld (GNU Binutils for Ubuntu) 2.34) #43~20.04.1-Ubuntu SMP Mon May 22 16:49:11 UTC 2023 
(Ubuntu 5.15.0-1035.43~20.04.1-gcp 5.15.98)
2023-06-06T05:19:05+00:00 longevity-10gb-3h-2022-1-db-node-3bd52af4-0-2 !    INFO | Command line: BOOT_IMAGE=/boot/vmlinuz-5.15.0-1035-gcp root=PARTUUID=90958ddd-9b82-4dd9-b6c9-4380b837b198 ro console=ttyS0 net.ifnames=0 clocksource=tsc tsc=reliable panic=-1
2023-06-06T05:19:05+00:00 longevity-10gb-3h-2022-1-db-node-3bd52af4-0-2 !    INFO | KERNEL supported cpus:
2023-06-06T05:19:05+00:00 longevity-10gb-3h-2022-1-db-node-3bd52af4-0-2 !    INFO |   Intel GenuineIntel
2023-06-06T05:19:05+00:00 longevity-10gb-3h-2022-1-db-node-3bd52af4-0-2 !    INFO |   AMD AuthenticAMD
2023-06-06T05:19:05+00:00 longevity-10gb-3h-2022-1-db-node-3bd52af4-0-2 !    INFO |   Hygon HygonGenuine
2023-06-06T05:19:05+00:00 longevity-10gb-3h-2022-1-db-node-3bd52af4-0-2 !    INFO |   Centaur CentaurHauls
2023-06-06T05:19:05+00:00 longevity-10gb-3h-2022-1-db-node-3bd52af4-0-2 !    INFO |   zhaoxin   Shanghai  
2023-06-06T05:19:05+00:00 longevity-10gb-3h-2022-1-db-node-3bd52af4-0-2 !    INFO | x86/fpu: Supporting XSAVE feature 0x001: 'x87 floating point registers'
2023-06-06T05:19:05+00:00 longevity-10gb-3h-2022-1-db-node-3bd52af4-0-2 !    INFO | x86/fpu: Supporting XSAVE feature 0x002: 'SSE registers'
2023-06-06T05:19:05+00:00 longevity-10gb-3h-2022-1-db-node-3bd52af4-0-2 !    INFO | x86/fpu: Supporting XSAVE feature 0x004: 'AVX registers'
2023-06-06T05:19:05+00:00 longevity-10gb-3h-2022-1-db-node-3bd52af4-0-2 !    INFO | x86/fpu: xstate_offset[2]:  576, xstate_sizes[2]:  256
2023-06-06T05:19:05+00:00 longevity-10gb-3h-2022-1-db-node-3bd52af4-0-2 !    INFO | x86/fpu: Enabled xstate features 0x7, context size is 832 bytes, using 'standard' format.
2023-06-06T05:19:05+00:00 longevity-10gb-3h-2022-1-db-node-3bd52af4-0-2 !    INFO | signal: max sigframe size: 1776
2023-06-06T05:19:05+00:00 longevity-10gb-3h-2022-1-db-node-3bd52af4-0-2 !    INFO | BIOS-provided physical RAM map:

(and even scylla was) 2023-06-06T05:20:06+00:00 longevity-10gb-3h-2022-1-db-node-3bd52af4-0-2 ! INFO | [shard 0] init - Scylla version 2022.1.7-0.20230605.c375188572 initialization completed.

The uptime was changed: before:

< t:2023-06-06 05:18:46,075 f:remote_base.py  l:520  c:RemoteLibSSH2CmdRunner p:DEBUG > Running command "uptime -s"...
< t:2023-06-06 05:18:47,074 f:base.py         l:228  c:RemoteLibSSH2CmdRunner p:DEBUG > 2023-06-06 04:01:16
< t:2023-06-06 05:18:47,575 f:base.py         l:142  c:RemoteLibSSH2CmdRunner p:DEBUG > Command "uptime -s" finished with status 0

after:

< t:2023-06-06 05:19:33,424 f:remote_base.py  l:520  c:RemoteLibSSH2CmdRunner p:DEBUG > Running command "uptime -s"...
< t:2023-06-06 05:19:33,430 f:base.py         l:228  c:RemoteLibSSH2CmdRunner p:DEBUG > 2023-06-06 05:19:03
< t:2023-06-06 05:19:33,930 f:base.py         l:142  c:RemoteLibSSH2CmdRunner p:DEBUG > Command "uptime -s" finished with status 0

But the decommission command never ended:

< t:2023-06-06 05:38:42,068 f:file_logger.py  l:101  c:sdcm.sct_events.file_logger p:INFO  > 2023-06-06 05:38:42.066: (DisruptionEvent Severity.ERROR) period_type=end event_id=f4143b4a-ca71-48af-bee9-296c0cbaa62a duration=20m1s: nemesis_name=DecommissionStreamingErr target_node=Node longevity-10gb-3h-2022-1-db-node-3bd52af4-0-2 [35.229.119.15 | 10.142.0.63] (seed: False) errors=when running on: [functools.partial(<bound method BaseNode.run_nodetool of <sdcm.cluster_gce.GCENode object at 0x7fd8d20e64a0>>, sub_cmd='decommission', warning_event_on_exception=(<class 'Exception'>,), retry=0), functools.partial(<bound method Nemesis._call_disrupt_func_after_expression_logged of <sdcm.nemesis.SisyphusMonkey object at 0x7fd8d00e19f0>>, expression='DECOMMISSIONING: unbootstrap starts', disrupt_func=<function Nemesis.reboot_node at 0x7fd8d20e9fc0>, disrupt_func_kwargs={'target_node': <sdcm.cluster_gce.GCENode object at 0x7fd8d20e64a0>, 'hard': True, 'verify_ssh': True}, delay=0)]
< t:2023-06-06 05:38:42,068 f:file_logger.py  l:101  c:sdcm.sct_events.file_logger p:INFO  > Traceback (most recent call last):
< t:2023-06-06 05:38:42,068 f:file_logger.py  l:101  c:sdcm.sct_events.file_logger p:INFO  >   File "/home/ubuntu/scylla-cluster-tests/sdcm/nemesis.py", line 3425, in wrapper
< t:2023-06-06 05:38:42,068 f:file_logger.py  l:101  c:sdcm.sct_events.file_logger p:INFO  >     result = method(*args, **kwargs)
< t:2023-06-06 05:38:42,068 f:file_logger.py  l:101  c:sdcm.sct_events.file_logger p:INFO  >   File "/home/ubuntu/scylla-cluster-tests/sdcm/nemesis.py", line 2945, in disrupt_decommission_streaming_err
< t:2023-06-06 05:38:42,068 f:file_logger.py  l:101  c:sdcm.sct_events.file_logger p:INFO  >     self.start_and_interrupt_decommission_streaming()
< t:2023-06-06 05:38:42,068 f:file_logger.py  l:101  c:sdcm.sct_events.file_logger p:INFO  >   File "/home/ubuntu/scylla-cluster-tests/sdcm/nemesis.py", line 2875, in start_and_interrupt_decommission_streaming
< t:2023-06-06 05:38:42,068 f:file_logger.py  l:101  c:sdcm.sct_events.file_logger p:INFO  >     ParallelObject(objects=[trigger, watcher], timeout=1200).call_objects()
< t:2023-06-06 05:38:42,068 f:file_logger.py  l:101  c:sdcm.sct_events.file_logger p:INFO  >   File "/home/ubuntu/scylla-cluster-tests/sdcm/utils/common.py", line 461, in call_objects
< t:2023-06-06 05:38:42,068 f:file_logger.py  l:101  c:sdcm.sct_events.file_logger p:INFO  >     return self.run(lambda x: x())
< t:2023-06-06 05:38:42,068 f:file_logger.py  l:101  c:sdcm.sct_events.file_logger p:INFO  >   File "/home/ubuntu/scylla-cluster-tests/sdcm/utils/common.py", line 437, in run
< t:2023-06-06 05:38:42,068 f:file_logger.py  l:101  c:sdcm.sct_events.file_logger p:INFO  >     raise FuturesTimeoutError("when running on: %s" % [r.obj for r in results])
< t:2023-06-06 05:38:42,068 f:file_logger.py  l:101  c:sdcm.sct_events.file_logger p:INFO  > concurrent.futures._base.TimeoutError: when running on: [functools.partial(<bound method BaseNode.run_nodetool of <sdcm.cluster_gce.GCENode object at 0x7fd8d20e64a0>>, sub_cmd='decommission', warning_event_on_exception=(<class 'Exception'>,), retry=0), functools.partial(<bound method Nemesis._call_disrupt_func_after_expression_logged of <sdcm.nemesis.SisyphusMonkey object at 0x7fd8d00e19f0>>, expression='DECOMMISSIONING: unbootstrap starts', disrupt_func=<function Nemesis.reboot_node at 0x7fd8d20e9fc0>, disrupt_func_kwargs={'target_node': <sdcm.cluster_gce.GCENode object at 0x7fd8d20e64a0>, 'hard': True, 'verify_ssh': True}, delay=0)]

Installation details

Kernel Version: 5.15.0-1035-gcp Scylla version (or git commit hash): 2022.1.7-20230605.c375188572 with build-id aea84bca04ec4ada913a63fa52b9660160601039

Cluster size: 6 nodes (n1-highmem-16)

Scylla Nodes used in this run:

OS / Image: https://www.googleapis.com/compute/v1/projects/scylla-images/global/images/5215688002542750665 (gce: us-east1)

Test: longevity-10gb-3h-gce-test Test id: 3bd52af4-015a-4f30-8615-466b87c5409d Test name: enterprise-2022.1/longevity/longevity-10gb-3h-gce-test Test config file(s):

Logs and commands - Restore Monitor Stack command: `$ hydra investigate show-monitor 3bd52af4-015a-4f30-8615-466b87c5409d` - Restore monitor on AWS instance using [Jenkins job](https://jenkins.scylladb.com/view/QA/job/QA-tools/job/hydra-show-monitor/parambuild/?test_id=3bd52af4-015a-4f30-8615-466b87c5409d) - Show all stored logs command: `$ hydra investigate show-logs 3bd52af4-015a-4f30-8615-466b87c5409d` ## Logs: - **db-cluster-3bd52af4.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/3bd52af4-015a-4f30-8615-466b87c5409d/20230606_065453/db-cluster-3bd52af4.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/3bd52af4-015a-4f30-8615-466b87c5409d/20230606_065453/db-cluster-3bd52af4.tar.gz) - **monitor-set-3bd52af4.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/3bd52af4-015a-4f30-8615-466b87c5409d/20230606_065453/monitor-set-3bd52af4.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/3bd52af4-015a-4f30-8615-466b87c5409d/20230606_065453/monitor-set-3bd52af4.tar.gz) - **loader-set-3bd52af4.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/3bd52af4-015a-4f30-8615-466b87c5409d/20230606_065453/loader-set-3bd52af4.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/3bd52af4-015a-4f30-8615-466b87c5409d/20230606_065453/loader-set-3bd52af4.tar.gz) - **sct-runner-3bd52af4.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/3bd52af4-015a-4f30-8615-466b87c5409d/20230606_065453/sct-runner-3bd52af4.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/3bd52af4-015a-4f30-8615-466b87c5409d/20230606_065453/sct-runner-3bd52af4.tar.gz) - **parallel-timelines-report-3bd52af4.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/3bd52af4-015a-4f30-8615-466b87c5409d/20230606_065453/parallel-timelines-report-3bd52af4.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/3bd52af4-015a-4f30-8615-466b87c5409d/20230606_065453/parallel-timelines-report-3bd52af4.tar.gz) [Jenkins job URL](https://jenkins.scylladb.com/job/enterprise-2022.1/job/longevity/job/longevity-10gb-3h-gce-test/62/) [Argus](https://argus.scylladb.com/test/ed42ccd8-29a1-44bf-83bb-912b252d5d83/runs?additionalRuns[]=3bd52af4-015a-4f30-8615-466b87c5409d)
fgelcer commented 1 year ago

This was never backported: #4937

@fruch , can we backport it to 2022.1 and 2022.2?

fruch commented 1 year ago

it's already backported, i.e.

commits:  ce9313e61473753b9ba408f3a3e7e8768e9f7e5f
Already exists in branch-2022.1
commits:  ce9313e61473753b9ba408f3a3e7e8768e9f7e5f
Already exists in branch-2022.2

sound like a different issue, if decommission command get really stuck

ShlomiBalalis commented 8 months ago

Issue description

It's been happening again (exactly as I described it in https://github.com/scylladb/scylla-cluster-tests/issues/4890#issuecomment-1581055473) and I can't see any other option as to what the issue could be.

Nemesis starts:

< t:2024-01-24 17:14:43,783 f:file_logger.py  l:101  c:sdcm.sct_events.file_logger p:INFO  > 2024-01-24 17:14:43.780: (DisruptionEvent Severity.NORMAL) period_type=begin event_id=1b7d936c-2381-4cff-ab9d-712df80210f3: nemesis_name=DecommissionStreamingErr target_node=Node longevity-10gb-3h-2022-1-db-node-a32da17d-0-6 [35.231.230.139 | 10.142.15.212] (seed: False)

Decommission command executed:

< t:2024-01-24 17:14:44,800 f:remote_base.py  l:520  c:RemoteLibSSH2CmdRunner p:DEBUG > Running command "/usr/bin/nodetool  decommission "...

Decommission starts on target node (6):

2024-01-24T17:14:47+00:00 longevity-10gb-3h-2022-1-db-node-a32da17d-0-6 !    INFO |  [shard 0] storage_service - DECOMMISSIONING: unbootstrap starts
2024-01-24T17:14:47+00:00 longevity-10gb-3h-2022-1-db-node-a32da17d-0-6 !    INFO |  [shard 0] storage_service - enable_repair_based_node_ops=true, allowed_repair_based_node_ops={replace}
2024-01-24T17:14:47+00:00 longevity-10gb-3h-2022-1-db-node-a32da17d-0-6 !    INFO |  [shard 0] storage_service - LEAVING: replaying batch log and streaming data to other nodes

Reboot command executed:

< t:2024-01-24 17:14:49,364 f:cluster.py      l:1050 c:sdcm.cluster_gce     p:DEBUG > Node longevity-10gb-3h-2022-1-db-node-a32da17d-0-6 [35.231.230.139 | 10.142.15.212] (seed: False): Hardly rebooting node

Node gone up:

2024-01-24T17:15:12+00:00 longevity-10gb-3h-2022-1-db-node-a32da17d-0-6 !  NOTICE | Linux version 5.15.0-1048-gcp (buildd@lcy02-amd64-117) (gcc (Ubuntu 9.4.0-1ubuntu1~20.04.2) 9.4.0, GNU ld (GNU Binutils for Ubuntu) 2.34) #56~20.04.1-Ubuntu SMP Fri Nov 24 16:52:37 UTC 2023 (Ubuntu>
2024-01-24T17:15:12+00:00 longevity-10gb-3h-2022-1-db-node-a32da17d-0-6 !    INFO | Command line: BOOT_IMAGE=/boot/vmlinuz-5.15.0-1048-gcp root=PARTUUID=d7b5660e-7486-4bb8-9b57-1242a39279da ro console=ttyS0 net.ifnames=0 clocksource=tsc tsc=reliable panic=-1
2024-01-24T17:15:12+00:00 longevity-10gb-3h-2022-1-db-node-a32da17d-0-6 !    INFO | KERNEL supported cpus:
2024-01-24T17:15:12+00:00 longevity-10gb-3h-2022-1-db-node-a32da17d-0-6 !    INFO |   Intel GenuineIntel
2024-01-24T17:15:12+00:00 longevity-10gb-3h-2022-1-db-node-a32da17d-0-6 !    INFO |   AMD AuthenticAMD
2024-01-24T17:15:12+00:00 longevity-10gb-3h-2022-1-db-node-a32da17d-0-6 !    INFO |   Hygon HygonGenuine
2024-01-24T17:15:12+00:00 longevity-10gb-3h-2022-1-db-node-a32da17d-0-6 !    INFO |   Centaur CentaurHauls
2024-01-24T17:15:12+00:00 longevity-10gb-3h-2022-1-db-node-a32da17d-0-6 !    INFO |   zhaoxin   Shanghai  
2024-01-24T17:15:12+00:00 longevity-10gb-3h-2022-1-db-node-a32da17d-0-6 !    INFO | BIOS-provided physical RAM map:

uptime changed:

< t:2024-01-24 17:14:47,812 f:remote_base.py  l:520  c:RemoteLibSSH2CmdRunner p:DEBUG > Running command "uptime -s"...
< t:2024-01-24 17:14:48,863 f:base.py         l:228  c:RemoteLibSSH2CmdRunner p:DEBUG > 2024-01-24 14:24:45

...

< t:2024-01-24 17:15:27,327 f:remote_base.py  l:520  c:RemoteLibSSH2CmdRunner p:DEBUG > Running command "uptime -s"...
< t:2024-01-24 17:15:27,337 f:base.py         l:228  c:RemoteLibSSH2CmdRunner p:DEBUG > 2024-01-24 17:15:10

But Decommission command never ended, and timed out:

< t:2024-01-24 17:34:44,813 f:file_logger.py  l:101  c:sdcm.sct_events.file_logger p:INFO  > 2024-01-24 17:34:44.811: (DisruptionEvent Severity.ERROR) period_type=end event_id=1b7d936c-2381-4cff-ab9d-712df80210f3 duration=20m1s: nemesis_name=DecommissionStreamingErr target_node=Node longevity-10gb-3h-2022-1-db-node-a32da17d-0-6 [35.231.230.139 | 10.142.15.212] (seed: False) errors=when running on: [functools.partial(<bound method BaseNode.run_nodetool of <sdcm.cluster_gce.GCENode object at 0x7fa7f0196800>>, sub_cmd='decommission', warning_event_on_exception=(<class 'Exception'>,), retry=0), functools.partial(<bound method Nemesis._call_disrupt_func_after_expression_logged of <sdcm.nemesis.SisyphusMonkey object at 0x7fa7d87c77f0>>, expression='DECOMMISSIONING: unbootstrap starts', disrupt_func=<function Nemesis.reboot_node at 0x7fa7f01caf80>, disrupt_func_kwargs={'target_node': <sdcm.cluster_gce.GCENode object at 0x7fa7f0196800>, 'hard': True, 'verify_ssh': True}, delay=0)]
< t:2024-01-24 17:34:44,813 f:file_logger.py  l:101  c:sdcm.sct_events.file_logger p:INFO  > Traceback (most recent call last):
< t:2024-01-24 17:34:44,813 f:file_logger.py  l:101  c:sdcm.sct_events.file_logger p:INFO  >   File "/home/ubuntu/scylla-cluster-tests/sdcm/nemesis.py", line 3579, in wrapper
< t:2024-01-24 17:34:44,813 f:file_logger.py  l:101  c:sdcm.sct_events.file_logger p:INFO  >     result = method(*args, **kwargs)
< t:2024-01-24 17:34:44,813 f:file_logger.py  l:101  c:sdcm.sct_events.file_logger p:INFO  >   File "/home/ubuntu/scylla-cluster-tests/sdcm/nemesis.py", line 3100, in disrupt_decommission_streaming_err
< t:2024-01-24 17:34:44,813 f:file_logger.py  l:101  c:sdcm.sct_events.file_logger p:INFO  >     self.start_and_interrupt_decommission_streaming()
< t:2024-01-24 17:34:44,813 f:file_logger.py  l:101  c:sdcm.sct_events.file_logger p:INFO  >   File "/home/ubuntu/scylla-cluster-tests/sdcm/nemesis.py", line 3030, in start_and_interrupt_decommission_streaming
< t:2024-01-24 17:34:44,813 f:file_logger.py  l:101  c:sdcm.sct_events.file_logger p:INFO  >     ParallelObject(objects=[trigger, watcher], timeout=1200).call_objects()
< t:2024-01-24 17:34:44,813 f:file_logger.py  l:101  c:sdcm.sct_events.file_logger p:INFO  >   File "/home/ubuntu/scylla-cluster-tests/sdcm/utils/common.py", line 461, in call_objects
< t:2024-01-24 17:34:44,813 f:file_logger.py  l:101  c:sdcm.sct_events.file_logger p:INFO  >     return self.run(lambda x: x(), ignore_exceptions=ignore_exceptions)
< t:2024-01-24 17:34:44,813 f:file_logger.py  l:101  c:sdcm.sct_events.file_logger p:INFO  >   File "/home/ubuntu/scylla-cluster-tests/sdcm/utils/common.py", line 437, in run
< t:2024-01-24 17:34:44,813 f:file_logger.py  l:101  c:sdcm.sct_events.file_logger p:INFO  >     raise FuturesTimeoutError("when running on: %s" % [r.obj for r in results])
< t:2024-01-24 17:34:44,813 f:file_logger.py  l:101  c:sdcm.sct_events.file_logger p:INFO  > concurrent.futures._base.TimeoutError: when running on: [functools.partial(<bound method BaseNode.run_nodetool of <sdcm.cluster_gce.GCENode object at 0x7fa7f0196800>>, sub_cmd='decommission', warning_event_on_exception=(<class 'Exception'>,), retry=0), functools.partial(<bound method Nemesis._call_disrupt_func_after_expression_logged of <sdcm.nemesis.SisyphusMonkey object at 0x7fa7d87c77f0>>, expression='DECOMMISSIONING: unbootstrap starts', disrupt_func=<function Nemesis.reboot_node at 0x7fa7f01caf80>, disrupt_func_kwargs={'target_node': <sdcm.cluster_gce.GCENode object at 0x7fa7f0196800>, 'hard': True, 'verify_ssh': True}, delay=0)]

Installation details

Kernel Version: 5.15.0-1048-gcp Scylla version (or git commit hash): 2022.1.14-20240123.e90d107d48 with build-id 2c8bf367b4aedcb4b3d360f16ae02749880c24b7

Cluster size: 6 nodes (n1-highmem-16)

Scylla Nodes used in this run:

OS / Image: https://www.googleapis.com/compute/v1/projects/scylla-images/global/images/1281536115900986284 (gce: us-east1)

Test: longevity-10gb-3h-gce-test Test id: a32da17d-ae7c-44b0-96a5-63feca467fc5 Test name: enterprise-2022.1/longevity/longevity-10gb-3h-gce-test Test config file(s):

Logs and commands - Restore Monitor Stack command: `$ hydra investigate show-monitor a32da17d-ae7c-44b0-96a5-63feca467fc5` - Restore monitor on AWS instance using [Jenkins job](https://jenkins.scylladb.com/view/QA/job/QA-tools/job/hydra-show-monitor/parambuild/?test_id=a32da17d-ae7c-44b0-96a5-63feca467fc5) - Show all stored logs command: `$ hydra investigate show-logs a32da17d-ae7c-44b0-96a5-63feca467fc5` ## Logs: - **db-cluster-a32da17d.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/a32da17d-ae7c-44b0-96a5-63feca467fc5/20240124_184849/db-cluster-a32da17d.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/a32da17d-ae7c-44b0-96a5-63feca467fc5/20240124_184849/db-cluster-a32da17d.tar.gz) - **monitor-set-a32da17d.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/a32da17d-ae7c-44b0-96a5-63feca467fc5/20240124_184849/monitor-set-a32da17d.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/a32da17d-ae7c-44b0-96a5-63feca467fc5/20240124_184849/monitor-set-a32da17d.tar.gz) - **loader-set-a32da17d.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/a32da17d-ae7c-44b0-96a5-63feca467fc5/20240124_184849/loader-set-a32da17d.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/a32da17d-ae7c-44b0-96a5-63feca467fc5/20240124_184849/loader-set-a32da17d.tar.gz) - **sct-runner-a32da17d.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/a32da17d-ae7c-44b0-96a5-63feca467fc5/20240124_184849/sct-runner-a32da17d.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/a32da17d-ae7c-44b0-96a5-63feca467fc5/20240124_184849/sct-runner-a32da17d.tar.gz) - **parallel-timelines-report-a32da17d.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/a32da17d-ae7c-44b0-96a5-63feca467fc5/20240124_184849/parallel-timelines-report-a32da17d.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/a32da17d-ae7c-44b0-96a5-63feca467fc5/20240124_184849/parallel-timelines-report-a32da17d.tar.gz) [Jenkins job URL](https://jenkins.scylladb.com/job/enterprise-2022.1/job/longevity/job/longevity-10gb-3h-gce-test/80/) [Argus](https://argus.scylladb.com/test/ed42ccd8-29a1-44bf-83bb-912b252d5d83/runs?additionalRuns[]=a32da17d-ae7c-44b0-96a5-63feca467fc5)