Open KnifeyMoloko opened 2 years ago
@KnifeyMoloko was there something changed here? Since when we started suing ParallelObject?
If we changed the nemesis, please mention the commit/PR so we can revert.
The issue was reproduced in the longevity-50gb-3days-test run of scylla 5.0:
Kernel Version: 5.13.0-1029-aws
Scylla version (or git commit hash): 5.0~rc8-20220612.f28542a71
with build-id 85cf87619b93155a574647ec252ce5a043c7fe77
Cluster size: 6 nodes (i3.4xlarge)
Scylla Nodes used in this run:
OS / Image: ami-088626a20ac6084a7
(aws: eu-west-1)
Test: longevity-50gb-3days-test
Test id: 6d274d5f-0f67-4fff-9978-1e7b3f6cce0c
Test name: scylla-5.0/longevity/longevity-50gb-3days-test
Test config file(s):
There were 7 runs of the disrupt_*_streaming_err nemesis variety in this run, and the error occured in 3 of them:
In node longevity-tls-50gb-3d-5-0-db-node-6d274d5f-3 at 2022-06-21 11:23:31.372
, in node longevity-tls-50gb-3d-5-0-db-node-6d274d5f-11 at 2022-06-22 06:28:45.786
and in node longevity-tls-50gb-3d-5-0-db-node-6d274d5f-2 at 2022-06-24 07:06:40.000
:
(DisruptionEvent Severity.ERROR) period_type=end event_id=74dc8eb9-c6d8-458d-8b26-4ae624c5a863 duration=8m39s: nemesis_name=RebuildStreamingErr target_node=Node longevity-tls-50gb-3d-5-0-db-node-6d274d5f-3 [34.249.223.159 | 10.0.2.187] (seed: False) errors=when running on: [functools.partial(<bound method BaseNode.run_nodetool of <sdcm.cluster_aws.AWSNode object at 0x7f78541d79d0>>, sub_cmd='rebuild', warning_event_on_exception=(<class 'Exception'>,)), functools.partial(<bound method Nemesis._call_disrupt_func_after_expression_logged of <sdcm.nemesis.SisyphusMonkey object at 0x7f7814637fd0>>, expression='Rebuild starts', disrupt_func=<bound method BaseNode.reboot of <sdcm.cluster_aws.AWSNode object at 0x7f78541d79d0>>, disrupt_func_kwargs={'hard': True, 'verify_ssh': True}, timeout=300, delay=1)]
Traceback (most recent call last):
File "/home/ubuntu/scylla-cluster-tests/sdcm/nemesis.py", line 3369, in wrapper
result = method(*args, **kwargs)
File "/home/ubuntu/scylla-cluster-tests/sdcm/nemesis.py", line 2900, in disrupt_rebuild_streaming_err
self.start_and_interrupt_rebuild_streaming()
File "/home/ubuntu/scylla-cluster-tests/sdcm/nemesis.py", line 2882, in start_and_interrupt_rebuild_streaming
ParallelObject(objects=[trigger, watcher], timeout=timeout + 60).call_objects()
File "/home/ubuntu/scylla-cluster-tests/sdcm/utils/common.py", line 460, in call_objects
return self.run(lambda x: x())
File "/home/ubuntu/scylla-cluster-tests/sdcm/utils/common.py", line 436, in run
raise FuturesTimeoutError("when running on: %s" % [r.obj for r in results])
concurrent.futures._base.TimeoutError: when running on: [functools.partial(<bound method BaseNode.run_nodetool of <sdcm.cluster_aws.AWSNode object at 0x7f78541d79d0>>, sub_cmd='rebuild', warning_event_on_exception=(<class 'Exception'>,)), functools.partial(<bound method Nemesis._call_disrupt_func_after_expression_logged of <sdcm.nemesis.SisyphusMonkey object at 0x7f7814637fd0>>, expression='Rebuild starts', disrupt_func=<bound method BaseNode.reboot of <sdcm.cluster_aws.AWSNode object at 0x7f78541d79d0>>, disrupt_func_kwargs={'hard': True, 'verify_ssh': True}, timeout=300, delay=1)]
$ hydra investigate show-monitor 6d274d5f-0f67-4fff-9978-1e7b3f6cce0c
$ hydra investigate show-logs 6d274d5f-0f67-4fff-9978-1e7b3f6cce0c
@KnifeyMoloko are you working on a fix?
(@fgelcer )
@fgelcer can you get someone to run those fixes, to verify they are working ?
@fgelcer can you get someone to run those fixes, to verify they are working ?
my comment above (and the link to a PR) is not related to this one, hence i'm removing it
@fruch , there were few commits related to it merged to master (and 1 wasn't backported to 5.0). i will give it a run on a staging job to be sure if this was fixed, and what is its status
seems still not passing
Got the same for RepairStreamingErr nemesis in 5.0:
Kernel Version: 5.13.0-1029-aws
Scylla version (or git commit hash): 5.0~rc8-20220612.f28542a71
with build-id 85cf87619b93155a574647ec252ce5a043c7fe77
Cluster size: 4 nodes (i3.2xlarge)
Scylla Nodes used in this run:
OS / Image: ami-088626a20ac6084a7
(aws: eu-west-1)
Test: longevity-lwt-parallel-schema-changes-with-disruptive-24h-test
Test id: 9a35c8d5-3278-4a93-9a33-e2670db5eddf
Test name: scylla-5.0/longevity/longevity-lwt-parallel-schema-changes-with-disruptive-24h-test
Test config file(s):
Traceback (most recent call last):
File "/home/ubuntu/scylla-cluster-tests/sdcm/nemesis.py", line 3384, in wrapper
result = method(*args, **kwargs)
File "/home/ubuntu/scylla-cluster-tests/sdcm/nemesis.py", line 2919, in disrupt_repair_streaming_err
self.start_and_interrupt_repair_streaming()
File "/home/ubuntu/scylla-cluster-tests/sdcm/nemesis.py", line 2870, in start_and_interrupt_repair_streaming
ParallelObject(objects=[trigger, watcher], timeout=600).call_objects()
File "/home/ubuntu/scylla-cluster-tests/sdcm/utils/common.py", line 460, in call_objects
return self.run(lambda x: x())
File "/home/ubuntu/scylla-cluster-tests/sdcm/utils/common.py", line 436, in run
raise FuturesTimeoutError("when running on: %s" % [r.obj for r in results])
concurrent.futures._base.TimeoutError: when running on: [functools.partial(<bound method BaseNode.run_nodetool of <sdcm.cluster_aws.AWSNode object at 0x7f8bab9f5b70>>, sub_cmd='repair', warning_event_on_exception=(<class 'Exception'>,)), functools.partial(<bound method Nemesis._call_disrupt_func_after_expression_logged of <sdcm.nemesis.DisruptiveMonkey object at 0x7f8dc07e66b0>>, expression='Repair 1 out of', disrupt_func=<function Nemesis.reboot_node at 0x7f8e0305d7e0>, disrupt_func_kwargs={'target_node': <sdcm.cluster_aws.AWSNode object at 0x7f8bab9f5b70>, 'hard': True, 'verify_ssh': True}, delay=1)]
$ hydra investigate show-monitor 9a35c8d5-3278-4a93-9a33-e2670db5eddf
$ hydra investigate show-logs 9a35c8d5-3278-4a93-9a33-e2670db5eddf
@KnifeyMoloko , if this issue is not reproducing after last fix, please close this issue.
I saw it passing in some tests and no new occurrences in the last 2 weeks, so closing this one.
Looks like this is reproduced in 2022.1: error:
--
disrupt_decommission_streaming_err | longevity-10gb-3h-2022-1-db-node-e05ad6be-0-3 | Failed | 2023-05-15 07:07:35 | 2023-05-15 07:30:13
Nemesis InformationClass: SisyphusName: disrupt_decommission_streaming_errStatus: FailedFailure reasonTraceback (most recent call last): File "/home/ubuntu/scylla-cluster-tests/sdcm/nemesis.py", line 3413, in wrapper result = method(*args, **kwargs) File "/home/ubuntu/scylla-cluster-tests/sdcm/nemesis.py", line 2933, in disrupt_decommission_streaming_err self.start_and_interrupt_decommission_streaming() File "/home/ubuntu/scylla-cluster-tests/sdcm/nemesis.py", line 2868, in start_and_interrupt_decommission_streaming ParallelObject(objects=[trigger, watcher], timeout=1200).call_objects() File "/home/ubuntu/scylla-cluster-tests/sdcm/utils/common.py", line 461, in call_objects return self.run(lambda x: x()) File "/home/ubuntu/scylla-cluster-tests/sdcm/utils/common.py", line 437, in run raise FuturesTimeoutError("when running on: %s" % [r.obj for r in results]) concurrent.futures._base.TimeoutError: when running on: [functools.partial(<bound method BaseNode.run_nodetool of <sdcm.cluster_gce.GCENode object at 0x7faac95520b0>>, sub_cmd='decommission', warning_event_on_exception=(<class 'Exception'>,), retry=0), functools.partial(<bound method Nemesis._call_disrupt_func_after_expression_logged of <sdcm.nemesis.SisyphusMonkey object at 0x7faac935bd00>>, expression='DECOMMISSIONING: unbootstrap starts', disrupt_func=<function Nemesis.reboot_node at 0x7faac9399fc0>, disrupt_func_kwargs={'target_node': <sdcm.cluster_gce.GCENode object at 0x7faac95520b0>, 'hard': True, 'verify_ssh': True}, delay=0)]
Describe your issue in detail and steps it took to produce it.
Describe the impact this issue causes to the user.
Describe the frequency with how this issue can be reproduced.
Kernel Version: 5.15.0-1032-gcp
Scylla version (or git commit hash): 2022.1.7-20230514.f5c8060f16
with build-id 847ca2704d89d74e2f44409d8f3719c20eeb876c
Cluster size: 6 nodes (n1-highmem-16)
Scylla Nodes used in this run:
OS / Image: https://www.googleapis.com/compute/v1/projects/scylla-images/global/images/2495699258528504586
(gce: us-east1)
Test: longevity-10gb-3h-gce-test
Test id: e05ad6be-27ff-48bf-aff6-aebccfb692c9
Test name: enterprise-2022.1/longevity/longevity-10gb-3h-gce-test
Test config file(s):
This was never backported: https://github.com/scylladb/scylla-cluster-tests/pull/4937
Happened again in 2022.1:
The decommission was executed:
< t:2023-06-06 05:18:42,059 f:remote_base.py l:520 c:RemoteLibSSH2CmdRunner p:DEBUG > Running command "/usr/bin/nodetool decommission "...
The node started decommissioning:
2023-06-06T05:18:44+00:00 longevity-10gb-3h-2022-1-db-node-3bd52af4-0-2 ! INFO | [shard 0] api - decommission
2023-06-06T05:18:44+00:00 longevity-10gb-3h-2022-1-db-node-3bd52af4-0-2 ! INFO | [shard 0] storage_service - DECOMMISSIONING: starts
2023-06-06T05:18:44+00:00 longevity-10gb-3h-2022-1-db-node-3bd52af4-0-2 ! INFO | [shard 0] storage_service - decommission[821ccb77-429f-4b9a-b5ae-02fb19fdcc96]: Started decommission operation, removing node=10.142.0.63, sync_nodes=[10.142.0.26, 10.142.0.63, 10.142.0.53,
10.142.0.84, 10.142.0.56, 10.142.0.67], ignore_nodes=[]
2023-06-06T05:18:44+00:00 longevity-10gb-3h-2022-1-db-node-3bd52af4-0-2 ! INFO | [shard 0] storage_service - decommission[821ccb77-429f-4b9a-b5ae-02fb19fdcc96]: Added node=10.142.0.63 as leaving node, coordinator=10.142.0.63
2023-06-06T05:18:45+00:00 longevity-10gb-3h-2022-1-db-node-3bd52af4-0-2 ! INFO | [shard 0] storage_service - decommission[821ccb77-429f-4b9a-b5ae-02fb19fdcc96]: Started heartbeat_updater
2023-06-06T05:18:45+00:00 longevity-10gb-3h-2022-1-db-node-3bd52af4-0-2 ! INFO | [shard 0] storage_service - DECOMMISSIONING: unbootstrap starts
A reboot was executed:
< t:2023-06-06 05:18:47,575 f:cluster.py l:1050 c:sdcm.cluster_gce p:DEBUG > Node longevity-10gb-3h-2022-1-db-node-3bd52af4-0-2 [35.229.119.15 | 10.142.0.63] (seed: False): Hardly rebooting node
The node started up fine:
2023-06-06T05:19:11+00:00 longevity-10gb-3h-2022-1-db-node-3bd52af4-0-2 ! NOTICE | syslog-ng starting up; version='3.25.1'
2023-06-06T05:19:11+00:00 longevity-10gb-3h-2022-1-db-node-3bd52af4-0-2 ! NOTICE | Syslog connection established; fd='19', server='AF_INET(10.142.0.49:49153)', local='AF_INET(0.0.0.0:0)'
2023-06-06T05:19:05+00:00 longevity-10gb-3h-2022-1-db-node-3bd52af4-0-2 ! NOTICE | Linux version 5.15.0-1035-gcp (buildd@lcy02-amd64-086) (gcc (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0, GNU ld (GNU Binutils for Ubuntu) 2.34) #43~20.04.1-Ubuntu SMP Mon May 22 16:49:11 UTC 2023
(Ubuntu 5.15.0-1035.43~20.04.1-gcp 5.15.98)
2023-06-06T05:19:05+00:00 longevity-10gb-3h-2022-1-db-node-3bd52af4-0-2 ! INFO | Command line: BOOT_IMAGE=/boot/vmlinuz-5.15.0-1035-gcp root=PARTUUID=90958ddd-9b82-4dd9-b6c9-4380b837b198 ro console=ttyS0 net.ifnames=0 clocksource=tsc tsc=reliable panic=-1
2023-06-06T05:19:05+00:00 longevity-10gb-3h-2022-1-db-node-3bd52af4-0-2 ! INFO | KERNEL supported cpus:
2023-06-06T05:19:05+00:00 longevity-10gb-3h-2022-1-db-node-3bd52af4-0-2 ! INFO | Intel GenuineIntel
2023-06-06T05:19:05+00:00 longevity-10gb-3h-2022-1-db-node-3bd52af4-0-2 ! INFO | AMD AuthenticAMD
2023-06-06T05:19:05+00:00 longevity-10gb-3h-2022-1-db-node-3bd52af4-0-2 ! INFO | Hygon HygonGenuine
2023-06-06T05:19:05+00:00 longevity-10gb-3h-2022-1-db-node-3bd52af4-0-2 ! INFO | Centaur CentaurHauls
2023-06-06T05:19:05+00:00 longevity-10gb-3h-2022-1-db-node-3bd52af4-0-2 ! INFO | zhaoxin Shanghai
2023-06-06T05:19:05+00:00 longevity-10gb-3h-2022-1-db-node-3bd52af4-0-2 ! INFO | x86/fpu: Supporting XSAVE feature 0x001: 'x87 floating point registers'
2023-06-06T05:19:05+00:00 longevity-10gb-3h-2022-1-db-node-3bd52af4-0-2 ! INFO | x86/fpu: Supporting XSAVE feature 0x002: 'SSE registers'
2023-06-06T05:19:05+00:00 longevity-10gb-3h-2022-1-db-node-3bd52af4-0-2 ! INFO | x86/fpu: Supporting XSAVE feature 0x004: 'AVX registers'
2023-06-06T05:19:05+00:00 longevity-10gb-3h-2022-1-db-node-3bd52af4-0-2 ! INFO | x86/fpu: xstate_offset[2]: 576, xstate_sizes[2]: 256
2023-06-06T05:19:05+00:00 longevity-10gb-3h-2022-1-db-node-3bd52af4-0-2 ! INFO | x86/fpu: Enabled xstate features 0x7, context size is 832 bytes, using 'standard' format.
2023-06-06T05:19:05+00:00 longevity-10gb-3h-2022-1-db-node-3bd52af4-0-2 ! INFO | signal: max sigframe size: 1776
2023-06-06T05:19:05+00:00 longevity-10gb-3h-2022-1-db-node-3bd52af4-0-2 ! INFO | BIOS-provided physical RAM map:
(and even scylla was)
2023-06-06T05:20:06+00:00 longevity-10gb-3h-2022-1-db-node-3bd52af4-0-2 ! INFO | [shard 0] init - Scylla version 2022.1.7-0.20230605.c375188572 initialization completed.
The uptime was changed: before:
< t:2023-06-06 05:18:46,075 f:remote_base.py l:520 c:RemoteLibSSH2CmdRunner p:DEBUG > Running command "uptime -s"...
< t:2023-06-06 05:18:47,074 f:base.py l:228 c:RemoteLibSSH2CmdRunner p:DEBUG > 2023-06-06 04:01:16
< t:2023-06-06 05:18:47,575 f:base.py l:142 c:RemoteLibSSH2CmdRunner p:DEBUG > Command "uptime -s" finished with status 0
after:
< t:2023-06-06 05:19:33,424 f:remote_base.py l:520 c:RemoteLibSSH2CmdRunner p:DEBUG > Running command "uptime -s"...
< t:2023-06-06 05:19:33,430 f:base.py l:228 c:RemoteLibSSH2CmdRunner p:DEBUG > 2023-06-06 05:19:03
< t:2023-06-06 05:19:33,930 f:base.py l:142 c:RemoteLibSSH2CmdRunner p:DEBUG > Command "uptime -s" finished with status 0
But the decommission command never ended:
< t:2023-06-06 05:38:42,068 f:file_logger.py l:101 c:sdcm.sct_events.file_logger p:INFO > 2023-06-06 05:38:42.066: (DisruptionEvent Severity.ERROR) period_type=end event_id=f4143b4a-ca71-48af-bee9-296c0cbaa62a duration=20m1s: nemesis_name=DecommissionStreamingErr target_node=Node longevity-10gb-3h-2022-1-db-node-3bd52af4-0-2 [35.229.119.15 | 10.142.0.63] (seed: False) errors=when running on: [functools.partial(<bound method BaseNode.run_nodetool of <sdcm.cluster_gce.GCENode object at 0x7fd8d20e64a0>>, sub_cmd='decommission', warning_event_on_exception=(<class 'Exception'>,), retry=0), functools.partial(<bound method Nemesis._call_disrupt_func_after_expression_logged of <sdcm.nemesis.SisyphusMonkey object at 0x7fd8d00e19f0>>, expression='DECOMMISSIONING: unbootstrap starts', disrupt_func=<function Nemesis.reboot_node at 0x7fd8d20e9fc0>, disrupt_func_kwargs={'target_node': <sdcm.cluster_gce.GCENode object at 0x7fd8d20e64a0>, 'hard': True, 'verify_ssh': True}, delay=0)]
< t:2023-06-06 05:38:42,068 f:file_logger.py l:101 c:sdcm.sct_events.file_logger p:INFO > Traceback (most recent call last):
< t:2023-06-06 05:38:42,068 f:file_logger.py l:101 c:sdcm.sct_events.file_logger p:INFO > File "/home/ubuntu/scylla-cluster-tests/sdcm/nemesis.py", line 3425, in wrapper
< t:2023-06-06 05:38:42,068 f:file_logger.py l:101 c:sdcm.sct_events.file_logger p:INFO > result = method(*args, **kwargs)
< t:2023-06-06 05:38:42,068 f:file_logger.py l:101 c:sdcm.sct_events.file_logger p:INFO > File "/home/ubuntu/scylla-cluster-tests/sdcm/nemesis.py", line 2945, in disrupt_decommission_streaming_err
< t:2023-06-06 05:38:42,068 f:file_logger.py l:101 c:sdcm.sct_events.file_logger p:INFO > self.start_and_interrupt_decommission_streaming()
< t:2023-06-06 05:38:42,068 f:file_logger.py l:101 c:sdcm.sct_events.file_logger p:INFO > File "/home/ubuntu/scylla-cluster-tests/sdcm/nemesis.py", line 2875, in start_and_interrupt_decommission_streaming
< t:2023-06-06 05:38:42,068 f:file_logger.py l:101 c:sdcm.sct_events.file_logger p:INFO > ParallelObject(objects=[trigger, watcher], timeout=1200).call_objects()
< t:2023-06-06 05:38:42,068 f:file_logger.py l:101 c:sdcm.sct_events.file_logger p:INFO > File "/home/ubuntu/scylla-cluster-tests/sdcm/utils/common.py", line 461, in call_objects
< t:2023-06-06 05:38:42,068 f:file_logger.py l:101 c:sdcm.sct_events.file_logger p:INFO > return self.run(lambda x: x())
< t:2023-06-06 05:38:42,068 f:file_logger.py l:101 c:sdcm.sct_events.file_logger p:INFO > File "/home/ubuntu/scylla-cluster-tests/sdcm/utils/common.py", line 437, in run
< t:2023-06-06 05:38:42,068 f:file_logger.py l:101 c:sdcm.sct_events.file_logger p:INFO > raise FuturesTimeoutError("when running on: %s" % [r.obj for r in results])
< t:2023-06-06 05:38:42,068 f:file_logger.py l:101 c:sdcm.sct_events.file_logger p:INFO > concurrent.futures._base.TimeoutError: when running on: [functools.partial(<bound method BaseNode.run_nodetool of <sdcm.cluster_gce.GCENode object at 0x7fd8d20e64a0>>, sub_cmd='decommission', warning_event_on_exception=(<class 'Exception'>,), retry=0), functools.partial(<bound method Nemesis._call_disrupt_func_after_expression_logged of <sdcm.nemesis.SisyphusMonkey object at 0x7fd8d00e19f0>>, expression='DECOMMISSIONING: unbootstrap starts', disrupt_func=<function Nemesis.reboot_node at 0x7fd8d20e9fc0>, disrupt_func_kwargs={'target_node': <sdcm.cluster_gce.GCENode object at 0x7fd8d20e64a0>, 'hard': True, 'verify_ssh': True}, delay=0)]
Kernel Version: 5.15.0-1035-gcp
Scylla version (or git commit hash): 2022.1.7-20230605.c375188572
with build-id aea84bca04ec4ada913a63fa52b9660160601039
Cluster size: 6 nodes (n1-highmem-16)
Scylla Nodes used in this run:
OS / Image: https://www.googleapis.com/compute/v1/projects/scylla-images/global/images/5215688002542750665
(gce: us-east1)
Test: longevity-10gb-3h-gce-test
Test id: 3bd52af4-015a-4f30-8615-466b87c5409d
Test name: enterprise-2022.1/longevity/longevity-10gb-3h-gce-test
Test config file(s):
This was never backported: #4937
@fruch , can we backport it to 2022.1 and 2022.2?
it's already backported, i.e.
commits: ce9313e61473753b9ba408f3a3e7e8768e9f7e5f
Already exists in branch-2022.1
commits: ce9313e61473753b9ba408f3a3e7e8768e9f7e5f
Already exists in branch-2022.2
sound like a different issue, if decommission command get really stuck
It's been happening again (exactly as I described it in https://github.com/scylladb/scylla-cluster-tests/issues/4890#issuecomment-1581055473) and I can't see any other option as to what the issue could be.
Nemesis starts:
< t:2024-01-24 17:14:43,783 f:file_logger.py l:101 c:sdcm.sct_events.file_logger p:INFO > 2024-01-24 17:14:43.780: (DisruptionEvent Severity.NORMAL) period_type=begin event_id=1b7d936c-2381-4cff-ab9d-712df80210f3: nemesis_name=DecommissionStreamingErr target_node=Node longevity-10gb-3h-2022-1-db-node-a32da17d-0-6 [35.231.230.139 | 10.142.15.212] (seed: False)
Decommission command executed:
< t:2024-01-24 17:14:44,800 f:remote_base.py l:520 c:RemoteLibSSH2CmdRunner p:DEBUG > Running command "/usr/bin/nodetool decommission "...
Decommission starts on target node (6):
2024-01-24T17:14:47+00:00 longevity-10gb-3h-2022-1-db-node-a32da17d-0-6 ! INFO | [shard 0] storage_service - DECOMMISSIONING: unbootstrap starts
2024-01-24T17:14:47+00:00 longevity-10gb-3h-2022-1-db-node-a32da17d-0-6 ! INFO | [shard 0] storage_service - enable_repair_based_node_ops=true, allowed_repair_based_node_ops={replace}
2024-01-24T17:14:47+00:00 longevity-10gb-3h-2022-1-db-node-a32da17d-0-6 ! INFO | [shard 0] storage_service - LEAVING: replaying batch log and streaming data to other nodes
Reboot command executed:
< t:2024-01-24 17:14:49,364 f:cluster.py l:1050 c:sdcm.cluster_gce p:DEBUG > Node longevity-10gb-3h-2022-1-db-node-a32da17d-0-6 [35.231.230.139 | 10.142.15.212] (seed: False): Hardly rebooting node
Node gone up:
2024-01-24T17:15:12+00:00 longevity-10gb-3h-2022-1-db-node-a32da17d-0-6 ! NOTICE | Linux version 5.15.0-1048-gcp (buildd@lcy02-amd64-117) (gcc (Ubuntu 9.4.0-1ubuntu1~20.04.2) 9.4.0, GNU ld (GNU Binutils for Ubuntu) 2.34) #56~20.04.1-Ubuntu SMP Fri Nov 24 16:52:37 UTC 2023 (Ubuntu>
2024-01-24T17:15:12+00:00 longevity-10gb-3h-2022-1-db-node-a32da17d-0-6 ! INFO | Command line: BOOT_IMAGE=/boot/vmlinuz-5.15.0-1048-gcp root=PARTUUID=d7b5660e-7486-4bb8-9b57-1242a39279da ro console=ttyS0 net.ifnames=0 clocksource=tsc tsc=reliable panic=-1
2024-01-24T17:15:12+00:00 longevity-10gb-3h-2022-1-db-node-a32da17d-0-6 ! INFO | KERNEL supported cpus:
2024-01-24T17:15:12+00:00 longevity-10gb-3h-2022-1-db-node-a32da17d-0-6 ! INFO | Intel GenuineIntel
2024-01-24T17:15:12+00:00 longevity-10gb-3h-2022-1-db-node-a32da17d-0-6 ! INFO | AMD AuthenticAMD
2024-01-24T17:15:12+00:00 longevity-10gb-3h-2022-1-db-node-a32da17d-0-6 ! INFO | Hygon HygonGenuine
2024-01-24T17:15:12+00:00 longevity-10gb-3h-2022-1-db-node-a32da17d-0-6 ! INFO | Centaur CentaurHauls
2024-01-24T17:15:12+00:00 longevity-10gb-3h-2022-1-db-node-a32da17d-0-6 ! INFO | zhaoxin Shanghai
2024-01-24T17:15:12+00:00 longevity-10gb-3h-2022-1-db-node-a32da17d-0-6 ! INFO | BIOS-provided physical RAM map:
uptime changed:
< t:2024-01-24 17:14:47,812 f:remote_base.py l:520 c:RemoteLibSSH2CmdRunner p:DEBUG > Running command "uptime -s"...
< t:2024-01-24 17:14:48,863 f:base.py l:228 c:RemoteLibSSH2CmdRunner p:DEBUG > 2024-01-24 14:24:45
...
< t:2024-01-24 17:15:27,327 f:remote_base.py l:520 c:RemoteLibSSH2CmdRunner p:DEBUG > Running command "uptime -s"...
< t:2024-01-24 17:15:27,337 f:base.py l:228 c:RemoteLibSSH2CmdRunner p:DEBUG > 2024-01-24 17:15:10
But Decommission command never ended, and timed out:
< t:2024-01-24 17:34:44,813 f:file_logger.py l:101 c:sdcm.sct_events.file_logger p:INFO > 2024-01-24 17:34:44.811: (DisruptionEvent Severity.ERROR) period_type=end event_id=1b7d936c-2381-4cff-ab9d-712df80210f3 duration=20m1s: nemesis_name=DecommissionStreamingErr target_node=Node longevity-10gb-3h-2022-1-db-node-a32da17d-0-6 [35.231.230.139 | 10.142.15.212] (seed: False) errors=when running on: [functools.partial(<bound method BaseNode.run_nodetool of <sdcm.cluster_gce.GCENode object at 0x7fa7f0196800>>, sub_cmd='decommission', warning_event_on_exception=(<class 'Exception'>,), retry=0), functools.partial(<bound method Nemesis._call_disrupt_func_after_expression_logged of <sdcm.nemesis.SisyphusMonkey object at 0x7fa7d87c77f0>>, expression='DECOMMISSIONING: unbootstrap starts', disrupt_func=<function Nemesis.reboot_node at 0x7fa7f01caf80>, disrupt_func_kwargs={'target_node': <sdcm.cluster_gce.GCENode object at 0x7fa7f0196800>, 'hard': True, 'verify_ssh': True}, delay=0)]
< t:2024-01-24 17:34:44,813 f:file_logger.py l:101 c:sdcm.sct_events.file_logger p:INFO > Traceback (most recent call last):
< t:2024-01-24 17:34:44,813 f:file_logger.py l:101 c:sdcm.sct_events.file_logger p:INFO > File "/home/ubuntu/scylla-cluster-tests/sdcm/nemesis.py", line 3579, in wrapper
< t:2024-01-24 17:34:44,813 f:file_logger.py l:101 c:sdcm.sct_events.file_logger p:INFO > result = method(*args, **kwargs)
< t:2024-01-24 17:34:44,813 f:file_logger.py l:101 c:sdcm.sct_events.file_logger p:INFO > File "/home/ubuntu/scylla-cluster-tests/sdcm/nemesis.py", line 3100, in disrupt_decommission_streaming_err
< t:2024-01-24 17:34:44,813 f:file_logger.py l:101 c:sdcm.sct_events.file_logger p:INFO > self.start_and_interrupt_decommission_streaming()
< t:2024-01-24 17:34:44,813 f:file_logger.py l:101 c:sdcm.sct_events.file_logger p:INFO > File "/home/ubuntu/scylla-cluster-tests/sdcm/nemesis.py", line 3030, in start_and_interrupt_decommission_streaming
< t:2024-01-24 17:34:44,813 f:file_logger.py l:101 c:sdcm.sct_events.file_logger p:INFO > ParallelObject(objects=[trigger, watcher], timeout=1200).call_objects()
< t:2024-01-24 17:34:44,813 f:file_logger.py l:101 c:sdcm.sct_events.file_logger p:INFO > File "/home/ubuntu/scylla-cluster-tests/sdcm/utils/common.py", line 461, in call_objects
< t:2024-01-24 17:34:44,813 f:file_logger.py l:101 c:sdcm.sct_events.file_logger p:INFO > return self.run(lambda x: x(), ignore_exceptions=ignore_exceptions)
< t:2024-01-24 17:34:44,813 f:file_logger.py l:101 c:sdcm.sct_events.file_logger p:INFO > File "/home/ubuntu/scylla-cluster-tests/sdcm/utils/common.py", line 437, in run
< t:2024-01-24 17:34:44,813 f:file_logger.py l:101 c:sdcm.sct_events.file_logger p:INFO > raise FuturesTimeoutError("when running on: %s" % [r.obj for r in results])
< t:2024-01-24 17:34:44,813 f:file_logger.py l:101 c:sdcm.sct_events.file_logger p:INFO > concurrent.futures._base.TimeoutError: when running on: [functools.partial(<bound method BaseNode.run_nodetool of <sdcm.cluster_gce.GCENode object at 0x7fa7f0196800>>, sub_cmd='decommission', warning_event_on_exception=(<class 'Exception'>,), retry=0), functools.partial(<bound method Nemesis._call_disrupt_func_after_expression_logged of <sdcm.nemesis.SisyphusMonkey object at 0x7fa7d87c77f0>>, expression='DECOMMISSIONING: unbootstrap starts', disrupt_func=<function Nemesis.reboot_node at 0x7fa7f01caf80>, disrupt_func_kwargs={'target_node': <sdcm.cluster_gce.GCENode object at 0x7fa7f0196800>, 'hard': True, 'verify_ssh': True}, delay=0)]
Kernel Version: 5.15.0-1048-gcp
Scylla version (or git commit hash): 2022.1.14-20240123.e90d107d48
with build-id 2c8bf367b4aedcb4b3d360f16ae02749880c24b7
Cluster size: 6 nodes (n1-highmem-16)
Scylla Nodes used in this run:
OS / Image: https://www.googleapis.com/compute/v1/projects/scylla-images/global/images/1281536115900986284
(gce: us-east1)
Test: longevity-10gb-3h-gce-test
Test id: a32da17d-ae7c-44b0-96a5-63feca467fc5
Test name: enterprise-2022.1/longevity/longevity-10gb-3h-gce-test
Test config file(s):
Installation details
Kernel Version: 5.13.0-1031-gcp Scylla version (or git commit hash):
5.1.dev-20220612.e87ca733f0d3
with build-id2fcb16ce2dcf313a6d86fce8631e18e3055ab359
Cluster size: 6 nodes (n1-highmem-16)Scylla Nodes used in this run:
OS / Image:
https://www.googleapis.com/compute/v1/projects/scylla-images/global/images/6210294087766559177
(gce: us-east1)Test:
longevity-10gb-3h-gce-test
Test id:43a559b8-f978-4c69-8590-068f82e64df4
Test name:scylla-master/longevity/longevity-10gb-3h-gce-test
Test config file(s):Issue description
>>>>>>> During
DecommissionStreamingErr
we useParallelObject
to start a node decommission, and in parallel, watch the logs for the decommission starting and trigger a hard reboot to interrupt it. While this does what it was intended to do, we do getParallelObject
timeouts, most likely since the first command (nodetool decommission
run using the remoter) does not return.The solution for this would be to ignore the return value for the first command.
Error in question in
sct.log
:<<<<<<<
$ hydra investigate show-monitor 43a559b8-f978-4c69-8590-068f82e64df4
$ hydra investigate show-logs 43a559b8-f978-4c69-8590-068f82e64df4
Logs:
Jenkins job URL