disrupt_nodetool_decommission nemesis never ends if addition of a node fails

vponomaryov commented 2 years ago

Prerequisites

[x] Are you rebased to master ?
[x] Is it reproducible ?
[x] Did you perform a cursory search if this issue isn't opened ?

Versions

SCT: master
scylla: master

Logs

test_id: 490f8141-7f93-46c8-a1d9-ceffaa3688f0
job log: https://jenkins.scylladb.com/job/scylla-master/job/longevity/job/longevity-100gb-4h-ebs-gp3-test/35/console

Description

If we run disrupt_nodetool_decommission nemesis and it fails to add a node then this nemesis never ends. It leads to the absence of the nemesis after it. So, we do not get any errors here and CI job may finish successfully. Which is false positive.

Running longevity-100gb-4h-ebs-gp3-test job and starting mentioned nemesis we see following picture:

node-2 successfully gets decommissioned in 15 minutes:

2022-02-20T05:12:16+00:00 db-node-490f8141-2 !    INFO |  [shard 0] storage_service - DECOMMISSIONING: starts
2022-02-20T05:12:16+00:00 db-node-490f8141-2 !    INFO |  [shard 0] storage_service - decommission[281195e3-7c7d-458a-a68c-0d73f2a1ca33]: Started decommission operation,
                                                                        removing node=10.0.1.235, sync_nodes=[10.0.3.86, 10.0.2.192, 10.0.3.235, 10.0.2.60, 10.0.1.17, 10.0.1.235], ignore_nodes=[]
2022-02-20T05:12:16+00:00 db-node-490f8141-2 !    INFO |  [shard 0] storage_service - decommission[281195e3-7c7d-458a-a68c-0d73f2a1ca33]: Added node=10.0.1.235 as leaving node, coordinator=10.0.1.235
2022-02-20T05:12:16+00:00 db-node-490f8141-2 !    INFO |  [shard 0] storage_service - decommission[281195e3-7c7d-458a-a68c-0d73f2a1ca33]: Started heartbeat_updater
2022-02-20T05:12:16+00:00 db-node-490f8141-2 !    INFO |  [shard 0] storage_service - DECOMMISSIONING: unbootstrap starts
2022-02-20T05:12:16+00:00 db-node-490f8141-2 !    INFO |  [shard 0] storage_service - enable_repair_based_node_ops=true, allowed_repair_based_node_ops={replace}
2022-02-20T05:12:16+00:00 db-node-490f8141-2 !    INFO |  [shard 0] storage_service - LEAVING: replaying batch log and streaming data to other nodes
2022-02-20T05:12:16+00:00 db-node-490f8141-2 !    INFO |  [shard 0] range_streamer - Unbootstrap starts, nr_ranges_remaining=2786
2022-02-20T05:12:16+00:00 db-node-490f8141-2 !    INFO |  [shard 0] range_streamer - Unbootstrap with [10.0.2.192, 10.0.3.235, 10.0.3.86, 10.0.1.17, 10.0.2.60] for keyspace=system_auth started, nodes_to_st      ream=5
2022-02-20T05:12:16+00:00 db-node-490f8141-2 !    INFO |  [shard 0] range_streamer - Unbootstrap with 10.0.2.192 for keyspace=system_auth, streaming [0, 13) out of 134 ranges
2022-02-20T05:12:16+00:00 db-node-490f8141-2 !    INFO |  [shard 0] stream_session - [Stream #ac0188c0-920b-11ec-b971-e2b932f50c99] Executing streaming plan for Unbootstrap-system_auth-index-0 with peers={      10.0.2.192}, master
2022-02-20T05:12:16+00:00 db-node-490f8141-2 !    INFO |  [shard 0] range_streamer - Unbootstrap with 10.0.3.235 for keyspace=system_auth, streaming [0, 13) out of 135 ranges
2022-02-20T05:12:16+00:00 db-node-490f8141-2 !    INFO |  [shard 0] stream_session - [Stream #ac0188c1-920b-11ec-b971-e2b932f50c99] Executing streaming plan for Unbootstrap-system_auth-index-0 with peers={      10.0.3.235}, master
2022-02-20T05:12:16+00:00 db-node-490f8141-2 !    INFO |  [shard 0] range_streamer - Unbootstrap with 10.0.3.86 for keyspace=system_auth, streaming [0, 15) out of 156 ranges
2022-02-20T05:12:16+00:00 db-node-490f8141-2 !    INFO |  [shard 0] stream_session - [Stream #ac0188c2-920b-11ec-b971-e2b932f50c99] Executing streaming plan for Unbootstrap-system_auth-index-0 with peers={      10.0.3.86}, master
2022-02-20T05:12:16+00:00 db-node-490f8141-2 !    INFO |  [shard 0] range_streamer - Unbootstrap with 10.0.1.17 for keyspace=system_auth, streaming [0, 13) out of 137 ranges
2022-02-20T05:12:16+00:00 db-node-490f8141-2 !    INFO |  [shard 0] stream_session - [Stream #ac0188c3-920b-11ec-b971-e2b932f50c99] Executing streaming plan for Unbootstrap-system_auth-index-0 with peers={      10.0.1.17}, master
2022-02-20T05:12:16+00:00 db-node-490f8141-2 !    INFO |  [shard 0] range_streamer - Unbootstrap with 10.0.2.60 for keyspace=system_auth, streaming [0, 19) out of 195 ranges
2022-02-20T05:12:16+00:00 db-node-490f8141-2 !    INFO |  [shard 0] stream_session - [Stream #ac0188c4-920b-11ec-b971-e2b932f50c99] Executing streaming plan for Unbootstrap-system_auth-index-0 with peers={      10.0.2.60}, master
2022-02-20T05:12:16+00:00 db-node-490f8141-2 !    INFO |  [shard 0] stream_session - [Stream #ac0188c3-920b-11ec-b971-e2b932f50c99] Skip sending ks=system_auth, cf=role_members, reader contains no data, wi      th new rpc streaming
...                                                                                                    
2022-02-20T05:31:07+00:00 db-node-490f8141-2 !    INFO |  [shard 0] storage_service - DECOMMISSIONING: unbootstrap done
2022-02-20T05:31:07+00:00 db-node-490f8141-2 !    INFO |  [shard 0] storage_service - decommission[281195e3-7c7d-458a-a68c-0d73f2a1ca33]: Marked ops done from coordinator=10.0.1.235
2022-02-20T05:31:07+00:00 db-node-490f8141-2 !    INFO |  [shard 0] storage_service - decommission[281195e3-7c7d-458a-a68c-0d73f2a1ca33]: Finished decommission operation, 
                                                                        removing node=10.0.1.235, sync_nodes=[10.0.3.86, 10.0.2.192, 10.0.3.235, 10.0.2.60, 10.0.1.17, 10.0.1.235], ignore_nodes=[]
2022-02-20T05:31:07+00:00 db-node-490f8141-2 !    INFO |  [shard 0] storage_service - decommission[281195e3-7c7d-458a-a68c-0d73f2a1ca33]: Stopped heartbeat_updater
2022-02-20T05:31:07+00:00 db-node-490f8141-2 !    INFO |  [shard 0] storage_service - DECOMMISSIONING: leaving Raft group 0
2022-02-20T05:31:07+00:00 db-node-490f8141-2 !    INFO |  [shard 0] storage_service - DECOMMISSIONING: left Raft group 0
2022-02-20T05:31:07+00:00 db-node-490f8141-2 !    INFO |  [shard 0] storage_service - Stop transport: starts
2022-02-20T05:31:07+00:00 db-node-490f8141-2 !    INFO |  [shard 0] storage_service - Shutting down native transport server
2022-02-20T05:31:07+00:00 db-node-490f8141-2 !    INFO |  [shard 0] cql_server_controller - CQL server stopped
2022-02-20T05:31:07+00:00 db-node-490f8141-2 !    INFO |  [shard 0] storage_service - Shutting down native transport server was successful
2022-02-20T05:31:07+00:00 db-node-490f8141-2 !    INFO |  [shard 0] storage_service - Shutting down rpc server
2022-02-20T05:31:07+00:00 db-node-490f8141-2 !    INFO |  [shard 0] storage_service - Shutting down rpc server was successful
2022-02-20T05:31:07+00:00 db-node-490f8141-2 !    INFO |  [shard 0] storage_service - Shutting down alternator server
2022-02-20T05:31:07+00:00 db-node-490f8141-2 !    INFO |  [shard 0] storage_service - Shutting down alternator server was successful
2022-02-20T05:31:07+00:00 db-node-490f8141-2 !    INFO |  [shard 0] storage_service - Shutting down redis server
2022-02-20T05:31:07+00:00 db-node-490f8141-2 !    INFO |  [shard 0] storage_service - Shutting down redis server was successful
2022-02-20T05:31:07+00:00 db-node-490f8141-2 !    INFO |  [shard 0] storage_service - Stop transport: shutdown rpc and cql server done
2022-02-20T05:31:07+00:00 db-node-490f8141-2 !    INFO |  [shard 0] gossip - My status = LEFT          
2022-02-20T05:31:07+00:00 db-node-490f8141-2 ! WARNING |  [shard 0] gossip - No local state or state is in silent shutdown, not announcing shutdown
2022-02-20T05:31:07+00:00 db-node-490f8141-2 !    INFO |  [shard 0] gossip - Disable and wait for gossip loop started
2022-02-20T05:31:07+00:00 db-node-490f8141-2 !    INFO |  [shard 0] gossip - failure_detector_loop: Finished main loop
2022-02-20T05:31:07+00:00 db-node-490f8141-2 !    INFO |  [shard 0] gossip - Gossip is now stopped  
2022-02-20T05:31:07+00:00 db-node-490f8141-2 !    INFO |  [shard 0] storage_service - Stop transport: stop_gossiping done
2022-02-20T05:31:07+00:00 db-node-490f8141-2 !    INFO |  [shard 0] messaging_service - Stopping nontls server
2022-02-20T05:31:07+00:00 db-node-490f8141-2 !    INFO |  [shard 0] messaging_service - Stopping tls server
2022-02-20T05:31:07+00:00 db-node-490f8141-2 !    INFO |  [shard 0] messaging_service - Stopping tls server - Done
2022-02-20T05:31:07+00:00 db-node-490f8141-2 !    INFO |  [shard 0] messaging_service - Stopping client for address: 10.0.2.192:0

...                                                                                                    
2022-02-20T05:31:07+00:00 db-node-490f8141-2 !    INFO |  [shard 0] storage_service - messaging_service stopped
2022-02-20T05:31:07+00:00 db-node-490f8141-2 !    INFO |  [shard 0] storage_service - Stop transport: shutdown messaging_service done
2022-02-20T05:31:07+00:00 db-node-490f8141-2 !    INFO |  [shard 0] storage_service - Stop transport: shutdown stream_manager done
2022-02-20T05:31:07+00:00 db-node-490f8141-2 !    INFO |  [shard 0] storage_service - Stop transport: done
2022-02-20T05:31:07+00:00 db-node-490f8141-2 !    INFO |  [shard 0] storage_service - DECOMMISSIONING: stopped transport
2022-02-20T05:31:07+00:00 db-node-490f8141-2 !    INFO |  [shard 0] storage_service - DECOMMISSIONING: stop batchlog_manager done
2022-02-20T05:31:23+00:00 db-node-490f8141-2 !    INFO |  [shard 0] storage_service - DECOMMISSIONING: set_bootstrap_state done
2022-02-20T05:31:23+00:00 db-node-490f8141-2 !    INFO |  [shard 0] storage_service - DECOMMISSIONED:
2022-02-20T05:31:23+00:00 db-node-490f8141-2 !    INFO |  [shard 0] storage_service - DECOMMISSIONING: done

Then we start provisioning new node:

07:31:35  < t:2022-02-20 05:31:34,696 f:cluster_aws.py  l:241  c:sdcm.cluster         p:INFO  > Cluster longevity-100gb-4h-master-db-cluster-490f8141 (AMI: ['ami-041d8500e7cf30167'] Type: r5b.2xlarge): Create spot_low_price instance(s)

Logs on the new node:

2022-02-20T05:32:05+00:00 db-node-490f8141-7 !    INFO | 2022-02-20 05:32:05,629 - [scylla_configure] - INFO - Running post configuration script:
2022-02-20T05:32:05+00:00 db-node-490f8141-7 !    INFO | b'#!/bin/bash\nset -x\nsystemctl stop sshd || true\nSYSLOG_NG_INSTALLED=""\nif yum --help 2>/dev/null 1>&2 ; then\n    if rpm -q syslog-ng ; then\n              SYSLOG_NG_INSTALLED=1\n    else\n        for n in 1 2 3 4 5 6 7 8 9; do # cloud-init is running it with set +o braceexpand\n            if yum install -y --downloadonly syslog-ng; then\n                break\n            fi\      n        done\n\n        for n in 1 2 3; do # cloud-init is running it with set +o braceexpand\n            if yum install -y syslog-ng; then\n                SYSLOG_NG_INSTALLED=1\n                break\n            fi\n                  sleep 10\n        done\n    fi\nelif apt-get --help 2>/dev/null 1>&2 ; then\n    if dpkg-query --show syslog-ng ; then\n        SYSLOG_NG_INSTALLED=1\n    else\n        cat /etc/apt/sources.list\n        for n in 1 2 3 4 5 6 7 8       9; do # cloud-init is running it with set +o braceexpand\n            if apt-get -y update 2>&1 | tee /tmp/syslog_ng_install.output || grep NO_PUBKEY         /tmp/syslog_ng_install.output; then\n                break\n            f      i\n        done\n\n        for n in 1 2 3; do # cloud-init is running it with set +o braceexpand\n            while ! find /proc/*/fd -lname /var/lib/dpkg/lock-frontend -exec false {} + -quit ; do\n                sleep 1\n                  done\n            apt-get install -y syslog-ng || true\n            if dpkg-query --show syslog-ng ; then\n                SYSLOG_NG_INSTALLED=1\n                break\n            fi\n        done\n    fi\nelse\n    echo "Unsu      pported distro"\nfi\nif [ -z "$SYSLOG_NG_INSTALLED" ]; then\n\nif ! grep imjournalRatelimitInterval /etc/rsyslog.conf; then\n    cat <<EOF >> /etc/rsyslog.conf\n    #\n    # The following configuration was added by SCT.\n    #\n          \\$ModLoad imjournal\n    \\$imjournalRatelimitInterval 600\n    \\$imjournalRatelimitBurst 20000\nEOF\nfi\necho "action(type=\\"omfwd\\" Target=\\"10.0.1.122\\" Port=\\"49153\\" Protocol=\\"tcp\\")" >> /etc/rsyslog.conf\nsystemct      l restart rsyslog || true\nelse\n\nsource_name=`cat /etc/syslog-ng/syslog-ng.conf | tr -d "\\n" | tr -d "\\r" | sed -r "s/\\};/\\};\\n/g;         s/source /\\nsource /g" | grep -P "^source.*system\\(\\)" | cut -d" " -f2`\ndisk_buff      er_option=""\nif syslog-ng -V | grep disk; then\n    disk_buffer_option="disk-buffer(\n            mem-buf-size(100000)\n            disk-buf-size(2000000)\n            reliable(yes)\n            dir(\\"/tmp\\")\n        )"\nfi\n\n      if grep -P "keep-timestamp\\([^)]+\\)" /etc/syslog-ng/syslog-ng.conf; then\n    sed -i -r "s/keep-timestamp([ ]*yes[ ]*)/keep-timestamp(no)/g" /etc/syslog-ng/syslog-ng.conf\nelse\n    sed -i -r "s/([ \t]*options[ \t]*\\\\{)/\\\\1\\      n  keep-timestamp(no);\\n/g" /etc/syslog-ng/syslog-ng.conf\nfi\n\nif ! grep "destination remote_sct" /etc/syslog-ng/syslog-ng.conf; then\n    cat <<EOF >>/etc/syslog-ng/syslog-ng.conf\ndestination remote_sct {\n    network(\n              "10.0.1.122"\n        transport("tcp")\n        port(49153)\n        throttle(10000)\n        $disk_buffer_option\n    );\n};\n\ndestination d_system { file("/var/log/system.log"); };\nlog { source($source_name); destination(d_sy      stem); };\nEOF\nfi\n\nif ! grep -P "log {.*destination\\\\(remote_sct\\\\)" /etc/syslog-ng/syslog-ng.conf; then\n    echo "log { source($source_name); destination(remote_sct); };" >> /etc/syslog-ng/syslog-ng.conf\nfi\n\nif [ ! -z "      " ]; then\n    if grep "rewrite r_host" /etc/syslog-ng/syslog-ng.conf; then\n        sed -i -r "s/rewrite r_host \\{ set\\(\\"[^\\"]+\\"/rewrite r_host { set(\\"\\"/" /etc/syslog-ng/syslog-ng.conf\n    else\n        echo "rewrite r      _host { set(\\"\\", value(\\"HOST\\")); };" >>  /etc/syslog-ng/syslog-ng.conf\n        sed -i -r "s/destination\\(remote_sct\\);[ \\t]*\\};/destination\\(remote_sct\\); rewrite\\(r_host\\); \\};/" /etc/syslog-ng/syslog-ng.conf\n          fi\nfi\nsystemctl restart syslog-ng  || true\nfi\n\nif [ -f "/etc/security/limits.d/20-nproc.conf" ]; then\n    sed -i -e "s/^\\*[[:blank:]]*soft[[:blank:]]*nproc[[:blank:]]*.*/*\t\tsoft\tnproc\t\tunlimited/"     /etc/security/lim      its.d/20-nproc.conf || true\nelse\n    echo "*    hard    nproc    unlimited" > /etc/security/limits.d/20-nproc.conf || true\nfi\n\nsed -i "s/#MaxSessions \\(.*\\)$/MaxSessions 1000/" /etc/ssh/sshd_config || true\nsed -i "s/#MaxSta      rtups \\(.*\\)$/MaxStartups 60/" /etc/ssh/sshd_config || true\nsed -i "s/#LoginGraceTime \\(.*\\)$/LoginGraceTime 15s/" /etc/ssh/sshd_config || true\nsystemctl restart sshd || true\nsystemctl start sshd || true\n'
2022-02-20T05:32:05+00:00 longevity-100gb-4h-master-db-node-490f8141-7 !    INFO | Running post configuration script:
2022-02-20T05:32:05+00:00 longevity-100gb-4h-master-db-node-490f8141-7 !    INFO | b'#!/bin/bash\nset -x\nsystemctl stop sshd || true\nSYSLOG_NG_INSTALLED=""\nif yum --help 2>/dev/null 1>&2 ; then\n    if rpm -q syslog-ng ; then\n              SYSLOG_NG_INSTALLED=1\n    else\n        for n in 1 2 3 4 5 6 7 8 9; do # cloud-init is running it with set +o braceexpand\n            if yum install -y --downloadonly syslog-ng; then\n                break\n            fi\      n        done\n\n        for n in 1 2 3; do # cloud-init is running it with set +o braceexpand\n            if yum install -y syslog-ng; then\n                SYSLOG_NG_INSTALLED=1\n                break\n            fi\n                  sleep 10\n        done\n    fi\nelif apt-get --help 2>/dev/null 1>&2 ; then\n    if dpkg-query --show syslog-ng ; then\n        SYSLOG_NG_INSTALLED=1\n    else\n        cat /etc/apt/sources.list\n        for n in 1 2 3 4 5 6 7 8       9; do # cloud-init is running it with set +o braceexpand\n            if apt-get -y update 2>&1 | tee /tmp/syslog_ng_install.output || grep NO_PUBKEY         /tmp/syslog_ng_install.output; then\n                break\n            f      i\n        done\n\n        for n in 1 2 3; do # cloud-init is running it with set +o braceexpand\n            while ! find /proc/*/fd -lname /var/lib/dpkg/lock-frontend -exec false {} + -quit ; do\n                sleep 1\n                  done\n            apt-get install -y syslog-ng || true\n            if dpkg-query --show syslog-ng ; then\n                SYSLOG_NG_INSTALLED=1\n                break\n            fi\n        done\n    fi\nelse\n    echo "Unsu      pported distro"\nfi\nif [ -z "$SYSLOG_NG_INSTALLED" ]; then\n\nif ! grep imjournalRatelimitInterval /etc/rsyslog.conf; then\n    cat <<EOF >> /etc/rsyslog.conf\n    #\n    # The following configuration was added by SCT.\n    #\n          \\$ModLoad imjournal\n    \\$imjournalRatelimitInterval 600\n    \\$imjournalRatelimitBurst 20000\nEOF\nfi\necho "action(type=\\"omfwd\\" Target=\\"10.0.1.122\\" Port=\\"49153\\" Protocol=\\"tcp\\")" >> /etc/rsyslog.conf\nsystemct      l restart rsyslog || true\nelse\n\nsource_name=`cat /etc/syslog-ng/syslog-ng.conf | tr -d "\\n" | tr -d "\\r" | sed -r "s/\\};/\\};\\n/g;         s/source /\\nsource /g" | grep -P "^source.*system\\(\\)" | cut -d" " -f2`\ndisk_buff      er_option=""\nif syslog-ng -V | grep disk; then\n    disk_buffer_option="disk-buffer(\n            mem-buf-size(100000)\n            disk-buf-size(2000000)\n            reliable(yes)\n            dir(\\"/tmp\\")\n        )"\nfi\n\n      if grep -P "keep-timestamp\\([^)]+\\)" /etc/syslog-ng/syslog-ng.conf; then\n    sed -i -r "s/keep-timestamp([ ]*yes[ ]*)/keep-timestamp(no)/g" /etc/syslog-ng/syslog-ng.conf\nelse\n    sed -i -r "s/([ \t]*options[ \t]*\\\\{)/\\\\1\\      n  keep-timestamp(no);\\n/g" /etc/syslog-ng/syslog-ng.conf\nfi\n\nif ! grep "destination remote_sct" /etc/syslog-ng/syslog-ng.conf; then\n    cat <<EOF >>/etc/syslog-ng/syslog-ng.conf\ndestination remote_sct {\n    network(\n              "10.0.1.122"\n        transport("tcp")\n        port(49153)\n        throttle(10000)\n        $disk_buffer_option\n    );\n};\n\ndestination d_system { file("/var/log/system.log"); };\nlog { source($source_name); destination(d_sy      stem); };\nEOF\nfi\n\nif ! grep -P "log {.*destination\\\\(remote_sct\\\\)" /etc/syslog-ng/syslog-ng.conf; then\n    echo "log { source($source_name); destination(remote_sct); };" >> /etc/syslog-ng/syslog-ng.conf\nfi\n\nif [ ! -z "      " ]; then\n    if grep "rewrite r_host" /etc/syslog-ng/syslog-ng.conf; then\n        sed -i -r "s/rewrite r_host \\{ set\\(\\"[^\\"]+\\"/rewrite r_host { set(\\"\\"/" /etc/syslog-ng/syslog-ng.conf\n    else\n        echo "rewrite r      _host { set(\\"\\", value(\\"HOST\\")); };" >>  /etc/syslog-ng/syslog-ng.conf\n        sed -i -r "s/destination\\(remote_sct\\);[ \\t]*\\};/destination\\(remote_sct\\); rewrite\\(r_host\\); \\};/" /etc/syslog-ng/syslog-ng.conf\n          fi\nfi\nsystemctl restart syslog-ng  || true\nfi\n\nif [ -f "/etc/security/limits.d/20-nproc.conf" ]; then\n    sed -i -e "s/^\\*[[:blank:]]*soft[[:blank:]]*nproc[[:blank:]]*.*/*\t\tsoft\tnproc\t\tunlimited/"     /etc/security/lim      its.d/20-nproc.conf || true\nelse\n    echo "*    hard    nproc    unlimited" > /etc/security/limits.d/20-nproc.conf || true\nfi\n\nsed -i "s/#MaxSessions \\(.*\\)$/MaxSessions 1000/" /etc/ssh/sshd_config || true\nsed -i "s/#MaxSta      rtups \\(.*\\)$/MaxStartups 60/" /etc/ssh/sshd_config || true\nsed -i "s/#LoginGraceTime \\(.*\\)$/LoginGraceTime 15s/" /etc/ssh/sshd_config || true\nsystemctl restart sshd || true\nsystemctl start sshd || true\n'
2022-02-20T05:32:05+00:00 db-node-490f8141-7 !    INFO | + systemctl stop sshd                      
2022-02-20T05:32:05+00:00 db-node-490f8141-7 !    INFO | Received signal 15; terminating.           
2022-02-20T05:32:05+00:00 db-node-490f8141-7 !    INFO | Stopping OpenBSD Secure Shell server...       
2022-02-20T05:32:05+00:00 db-node-490f8141-7 !    INFO | ssh.service: Succeeded.                       
2022-02-20T05:32:05+00:00 db-node-490f8141-7 !    INFO | Stopped OpenBSD Secure Shell server.          
...
2022-02-20T05:32:12+00:00 db-node-490f8141-7 !    INFO | + apt-get install -y syslog-ng                
2022-02-20T05:32:12+00:00 db-node-490f8141-7 !    INFO | Reading package lists...                      
2022-02-20T05:32:12+00:00 db-node-490f8141-7 !    INFO | Building dependency tree...                   
2022-02-20T05:32:12+00:00 db-node-490f8141-7 !    INFO | Reading state information...                  
2022-02-20T05:32:12+00:00 db-node-490f8141-7 !    INFO | Started Snap Daemon.                          
2022-02-20T05:32:12+00:00 db-node-490f8141-7 !    INFO | E: Unable to locate package syslog-ng         
2022-02-20T05:32:12+00:00 db-node-490f8141-7 !    INFO | + true                                        
2022-02-20T05:32:12+00:00 db-node-490f8141-7 !    INFO | + dpkg-query --show syslog-ng                 
2022-02-20T05:32:12+00:00 db-node-490f8141-7 !    INFO | dpkg-query: no packages found matching syslog-ng
...                                                                                                 
2022-02-20T05:32:05+00:00 db-node-490f8141-7 !    INFO | dpkg-query: no packages found matching syslog-ng
...                                                                                                 
2022-02-20T05:32:12+00:00 db-node-490f8141-7 !    INFO | E: Unable to locate package syslog-ng      
2022-02-20T05:32:12+00:00 db-node-490f8141-7 !    INFO | + true                                     
2022-02-20T05:32:12+00:00 db-node-490f8141-7 !    INFO | + dpkg-query --show syslog-ng              
...                                                                                                 
2022-02-20T05:38:38+00:00 db-node-490f8141-7 !    INFO | [shard 0] init - Shutting down sighup was successful
2022-02-20T05:38:38+00:00 db-node-490f8141-7 !     ERR | [shard 0] init - Startup failed: exceptions::mutation_write_timeout_exception (Operation timed out for system_distributed_everywhere.cdc_generation_descriptions_v2 - received only 5 responses from 6 CL=ALL.)
...                                                                                                 
2022-02-20T05:44:28+00:00 db-node-490f8141-7 !    INFO | Unable to negotiate with 212.193.30.137 port 58932: no matching key exchange method found. Their offer: diffie-hellman-group14-sha1,diffie-hellman-group-exchange-sha1,diffie-hellman-group1-sha1 [preauth]

Last kind of the message appears for bunch of minutes and after it node doesn't do anything as well as nemesis.

Steps to Reproduce

Run disrupt_nodetool_decommission nemesis as part of the longevity-100gb-4h-ebs-gp3-test job.
See error

Expected behavior: nemesis should not hang, it must finish either successfully or not, but finish.

Actual behavior: nemesis hangs till the end of the test. Only load runs. Hung nemesis is absent in the email report. False positive is provided having issue happened.

vponomaryov commented 2 years ago

The problem with hanging is that wait_for_init method waits, by default, 6 hours for new node init. So, the solution here is to introduce additional check for catching problems on the go...

temichus commented 1 year ago

according to logs I see that there is the error at node start:

ERR | [shard 0] init - Startup failed: exceptions::mutation_write_timeout_exception (Operation timed out for system_distributed_everywhere.cdc_generation_descriptions_v2 - received only 5 responses from 6 CL=ALL.)

but this error converted to a WARNING by logic from https://github.com/scylladb/scylla-cluster-tests/pull/3521/files

06:38:38  < t:2022-02-20 05:38:38,562 f:file_logger.py  l:89   c:sdcm.sct_events.file_logger p:INFO  > 2022-02-20 05:38:38.558 <2022-02-20 05:38:38.000>: (DatabaseLogEvent Severity.WARNING) period_type=one-time event_id=3b16a216-68f6-4470-8cc2-0addbd2ccb35: type=SYSTEM_PAXOS_TIMEOUT regex=(mutation_write_|Operation timed out for system.paxos|Operation failed for system.paxos) line_number=14762 node=longevity-100gb-4h-master-db-node-490f8141-7

after this error test just hang for 8h (MAX_TIME_WAIT_FOR_NEW_NODE_UP: int = HOUR_IN_SEC * 8)

@juliayakovlev, Could you please remind me why this change was introduced? what problem did it solve? Can you suggest a solution for the current Issue?

fruch commented 1 year ago

according to logs I see that there is the error at node start:
ERR | [shard 0] init - Startup failed: exceptions::mutation_write_timeout_exception (Operation timed out for system_distributed_everywhere.cdc_generation_descriptions_v2 - received only 5 responses from 6 CL=ALL.)
but this error converted to a WARNING by logic from https://github.com/scylladb/scylla-cluster-tests/pull/3521/files

after this error test just hang for 8h (MAX_TIME_WAIT_FOR_NEW_NODE_UP: int = HOUR_IN_SEC * 8)

even it if this was at error level, we wouldn't stop the test cause of it, and it would wait for the whole 8h

@juliayakovlev, Could you please remind me why this change was introduced? what problem did it solve? Can you suggest a solution for the current Issue?

it was downgrade to warning cause of lot of LWT related code paths that ended up with timeouts in system.paxos related writes, and it was agree to ignore them, .*mutation_write_* is a bit wide, and I think should be reverted.

scylladb / scylla-cluster-tests