YCSB Alternator "failed: Connection refused" message event is not filtered out on node reboot

yarongilor commented 2 years ago

Installation details

Kernel Version: 5.13.0-1021-aws Scylla version (or git commit hash): 5.0~rc3-20220406.f92622e0d with build-id 2b79c4744216b294fdbd2f277940044c899156ea Cluster size: 4 nodes (i3.4xlarge)

Scylla Nodes used in this run:

alternator-48h-5-0-db-node-d918f269-7 (54.216.88.199 | 10.0.2.225) (shards: 14)
alternator-48h-5-0-db-node-d918f269-6 (34.253.233.182 | 10.0.3.56) (shards: 14)
alternator-48h-5-0-db-node-d918f269-5 (52.214.229.36 | 10.0.3.33) (shards: 14)
alternator-48h-5-0-db-node-d918f269-4 (34.249.215.199 | 10.0.3.166) (shards: 14)
alternator-48h-5-0-db-node-d918f269-3 (54.75.115.47 | 10.0.1.108) (shards: 14)
alternator-48h-5-0-db-node-d918f269-2 (54.246.21.14 | 10.0.2.166) (shards: 14)
alternator-48h-5-0-db-node-d918f269-1 (34.240.216.60 | 10.0.2.103) (shards: 14)

OS / Image: ami-07835983d717b1ea3 (aws: eu-west-1)

Test: longevity-alternator-200gb-48h-test Test id: d918f269-fa5e-4057-b74f-88062e9d5d0e Test name: scylla-5.0/longevity/longevity-alternator-200gb-48h-test Test config file(s):

longevity-alternator-200GB-48h.yaml

Issue description

scenario: running nemesis rolling_config_change_internode_compression

**>>>>>>>**
yarongilor@yaron-pc:~/Downloads/logs$ grep -v -i email sct-d918f269.log | egrep -i 'failed: Connection refused|inter node compression to|Restarting node |>>>>>>>' 
...
< t:2022-05-03 05:04:50,146 f:nemesis.py      l:1367 c:sdcm.nemesis         p:INFO  > sdcm.nemesis.SisyphusMonkey: >>>>>>>>>>>>>Started random_disrupt_method rolling_config_change_internode_compression

reboot node-1 successfully, then node-2:

< t:2022-05-03 05:05:35,368 f:nemesis.py      l:703  c:sdcm.nemesis         p:DEBUG > sdcm.nemesis.SisyphusMonkey: Changing Node alternator-48h-5-0-db-node-d918f269-1 [34.240.216.60 | 10.0.2.103] (seed: True) inter node compression to dc
< t:2022-05-03 05:05:40,927 f:nemesis.py      l:706  c:sdcm.nemesis         p:INFO  > sdcm.nemesis.SisyphusMonkey: Restarting node Node alternator-48h-5-0-db-node-d918f269-1 [34.240.216.60 | 10.0.2.103] (seed: True)
< t:2022-05-03 05:08:42,152 f:nemesis.py      l:703  c:sdcm.nemesis         p:DEBUG > sdcm.nemesis.SisyphusMonkey: Changing Node alternator-48h-5-0-db-node-d918f269-2 [54.246.21.14 | 10.0.2.166] (seed: False) inter node compression to dc
< t:2022-05-03 05:08:49,314 f:nemesis.py      l:706  c:sdcm.nemesis         p:INFO  > sdcm.nemesis.SisyphusMonkey: Restarting node Node alternator-48h-5-0-db-node-d918f269-2 [54.246.21.14 | 10.0.2.166] (seed: False)

Got "Connection refused" for the rebooted node-2:

< t:2022-05-03 05:08:59,982 f:base.py         l:228  c:RemoteLibSSH2CmdRunner p:DEBUG > 17035028 [Thread-8] ERROR site.ycsb.db.DynamoDBClient  -com.amazonaws.SdkClientException: Unable to execute HTTP request: Connect to alternator:8080 [alternator/10.0.2.166] failed: Connection refused (Connection refused)
< t:2022-05-03 05:08:59,983 f:file_logger.py  l:101  c:sdcm.sct_events.file_logger p:INFO  > 17035028 [Thread-8] ERROR site.ycsb.db.DynamoDBClient  -com.amazonaws.SdkClientException: Unable to execute HTTP request: Connect to alternator:8080 [alternator/10.0.2.166] failed: Connection refused (Connection refused)

The message is not filtered out and test got:

2022-05-03 05:08:59.982: (YcsbStressEvent Severity.ERROR) period_type=not-set event_id=236ce6a2-eed4-4a6a-81bb-260b094f34b1: type=error node=Node alternator-48h-5-0-loader-node-d918f269-4 [54.217.26.135 | 10.0.3.196] (seed: False)
stress_cmd=bin/ycsb run dynamodb -P workloads/workloadc -threads 20 -p readproportion=0.5 -p updateproportion=0.5 -p recordcount=200200300 -p fieldcount=8 -p fieldlength=128 -p operationcount=2140000000 -p dataintegrity=true -p maxexecutiontime=147600 -s  -P /tmp/dynamodb.properties
errors:
17035028 [Thread-8] ERROR site.ycsb.db.DynamoDBClient  -com.amazonaws.SdkClientException: Unable to execute HTTP request: Connect to alternator:8080 [alternator/10.0.2.166] failed: Connection refused (Connection refused)

<<<<<<<

Restore Monitor Stack command: $ hydra investigate show-monitor d918f269-fa5e-4057-b74f-88062e9d5d0e
Restore monitor on AWS instance using Jenkins job
Show all stored logs command: $ hydra investigate show-logs d918f269-fa5e-4057-b74f-88062e9d5d0e

Logs:

db-cluster-d918f269.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/d918f269-fa5e-4057-b74f-88062e9d5d0e/20220504_182348/db-cluster-d918f269.tar.gz
monitor-set-d918f269.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/d918f269-fa5e-4057-b74f-88062e9d5d0e/20220504_182348/monitor-set-d918f269.tar.gz
loader-set-d918f269.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/d918f269-fa5e-4057-b74f-88062e9d5d0e/20220504_182348/loader-set-d918f269.tar.gz
normal-d918f269.log.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/d918f269-fa5e-4057-b74f-88062e9d5d0e/20220504_182348/normal-d918f269.log.tar.gz
summary-d918f269.log.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/d918f269-fa5e-4057-b74f-88062e9d5d0e/20220504_182348/summary-d918f269.log.tar.gz
events-d918f269.log.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/d918f269-fa5e-4057-b74f-88062e9d5d0e/20220504_182348/events-d918f269.log.tar.gz
output-d918f269.log.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/d918f269-fa5e-4057-b74f-88062e9d5d0e/20220504_182348/output-d918f269.log.tar.gz
debug-d918f269.log.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/d918f269-fa5e-4057-b74f-88062e9d5d0e/20220504_182348/debug-d918f269.log.tar.gz
sct-d918f269.log.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/d918f269-fa5e-4057-b74f-88062e9d5d0e/20220504_182348/sct-d918f269.log.tar.gz
error-d918f269.log.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/d918f269-fa5e-4057-b74f-88062e9d5d0e/20220504_182348/error-d918f269.log.tar.gz
critical-d918f269.log.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/d918f269-fa5e-4057-b74f-88062e9d5d0e/20220504_182348/critical-d918f269.log.tar.gz
raw_events-d918f269.log.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/d918f269-fa5e-4057-b74f-88062e9d5d0e/20220504_182348/raw_events-d918f269.log.tar.gz
warning-d918f269.log.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/d918f269-fa5e-4057-b74f-88062e9d5d0e/20220504_182348/warning-d918f269.log.tar.gz
email_data-d918f269.json.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/d918f269-fa5e-4057-b74f-88062e9d5d0e/20220504_182348/email_data-d918f269.json.tar.gz
argus-d918f269.log.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/d918f269-fa5e-4057-b74f-88062e9d5d0e/20220504_182348/argus-d918f269.log.tar.gz
left_processes-d918f269.log.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/d918f269-fa5e-4057-b74f-88062e9d5d0e/20220504_182348/left_processes-d918f269.log.tar.gz

Jenkins job URL

yarongilor commented 2 years ago

@fgelcer , can you advise - it looks like this event should have been filtered out by: https://github.com/scylladb/scylla-cluster-tests/pull/4220

fruch commented 2 years ago

@fgelcer , can you advise - it looks like this event should have been filtered out by: #4220

@yarongilor ignore_ycsb_connection_refused that was fixed in #4220 is only used in upgrade_tests.

when YCSB is used, since it's using the DNS, there are cases it would use a node that is down.

filtering it, might be problematic, since we'll need to do so for each place we take a node down.

yarongilor commented 2 years ago

@fgelcer , can you advise - it looks like this event should have been filtered out by: #4220

@yarongilor ignore_ycsb_connection_refused that was fixed in #4220 is only used in upgrade_tests.

when YCSB is used, since it's using the DNS, there are cases it would use a node that is down.

filtering it, might be problematic, since we'll need to do so for each place we take a node down.

@fruch , why not apply this filter to all nemeses contains reboot somehow in a generic way? or else what's the alternative - changing this error severity to "warning"?

KnifeyMoloko commented 2 years ago

Bumped into the same in:

Installation details

Kernel Version: 5.13.0-1025-aws Scylla version (or git commit hash): 2022.1~rc7-20220602.7abea3aad with build-id 57fb7e7c94bbac6498149648f3818be3c1322ef9 Cluster size: 6 nodes (i3.4xlarge)

Scylla Nodes used in this run:

alternator-3h-2022-1-db-node-c0719d2a-8 (3.222.245.150 | 10.0.0.15) (shards: 14)
alternator-3h-2022-1-db-node-c0719d2a-7 (44.201.15.50 | 10.0.1.223) (shards: 14)
alternator-3h-2022-1-db-node-c0719d2a-6 (44.201.21.174 | 10.0.3.168) (shards: 14)
alternator-3h-2022-1-db-node-c0719d2a-5 (3.80.161.195 | 10.0.2.102) (shards: 14)
alternator-3h-2022-1-db-node-c0719d2a-4 (44.204.96.57 | 10.0.1.13) (shards: 14)
alternator-3h-2022-1-db-node-c0719d2a-3 (54.160.24.50 | 10.0.3.145) (shards: 14)
alternator-3h-2022-1-db-node-c0719d2a-2 (3.238.57.91 | 10.0.0.52) (shards: 14)
alternator-3h-2022-1-db-node-c0719d2a-1 (54.236.243.133 | 10.0.0.226) (shards: 14)

OS / Image: ami-0c0c4f759c88cd17d (aws: us-east-1)

Test: longevity-alternator-3h-test Test id: c0719d2a-85bb-4e6c-b228-4479afd09a0a Test name: enterprise-2022.1/longevity/longevity-alternator-3h-test Test config file(s):

longevity-alternator-3h.yaml

Issue description

>>>>>>> Your description here... <<<<<<<

Restore Monitor Stack command: $ hydra investigate show-monitor c0719d2a-85bb-4e6c-b228-4479afd09a0a
Restore monitor on AWS instance using Jenkins job
Show all stored logs command: $ hydra investigate show-logs c0719d2a-85bb-4e6c-b228-4479afd09a0a

Logs:

db-cluster-c0719d2a.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/c0719d2a-85bb-4e6c-b228-4479afd09a0a/20220603_092419/db-cluster-c0719d2a.tar.gz
monitor-set-c0719d2a.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/c0719d2a-85bb-4e6c-b228-4479afd09a0a/20220603_092419/monitor-set-c0719d2a.tar.gz
loader-set-c0719d2a.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/c0719d2a-85bb-4e6c-b228-4479afd09a0a/20220603_092419/loader-set-c0719d2a.tar.gz
sct-runner-c0719d2a.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/c0719d2a-85bb-4e6c-b228-4479afd09a0a/20220603_092419/sct-runner-c0719d2a.tar.gz
parallel-timelines-report-c0719d2a.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/c0719d2a-85bb-4e6c-b228-4479afd09a0a/20220603_092419/parallel-timelines-report-c0719d2a.tar.gz

Jenkins job URL

roydahan commented 2 years ago

@fruch is there a general solution we can do here? If not, and a filter needed for every nemesis that may reboot the node we provide to YCSB, like rolling-restart let's do it.

fruch commented 2 years ago

@fruch is there a general solution we can do here? If not, and a filter needed for every nemesis that may reboot the node we provide to YCSB, like rolling-restart let's do it.

I can't think of anything general, except filtering it always (i.e. ignoring it completely)

since we already have a context manager for those, we can apply it on nemesis we encountered the issue.

roydahan commented 2 years ago

Yes, that’s the simple solution, but I thought maybe there is a way to let YCSB know more hosts or something like that.

On Mon, Jun 13, 2022 at 08:17 Israel Fruchter @.***> wrote:

@fruch https://github.com/fruch is there a general solution we can do here? If not, and a filter needed for every nemesis that may reboot the node we provide to YCSB, like rolling-restart let's do it.

I can't think of anything general, except filtering it always (i.e. ignoring it completely)

since we already have a context manager for those, we can apply it on nemesis we encountered the issue.

— Reply to this email directly, view it on GitHub https://github.com/scylladb/scylla-cluster-tests/issues/4738#issuecomment-1153482760, or unsubscribe https://github.com/notifications/unsubscribe-auth/AE75CYFJWIUHPX2DUC2BXVLVO277ZANCNFSM5V655H6A . You are receiving this because you commented.Message ID: @.***>

fruch commented 2 years ago

we are using dynamodb client, it only knows one dns name, aws clients aren't aware of nodes.

this is why we are using a DNS server todo the "balancing", having a proper load-blancer is not implemented (nor in SCT, not in scylla-cloud)

On Mon, Jun 13, 2022 at 10:11 AM Roy Dahan @.***> wrote:

Yes, that’s the simple solution, but I thought maybe there is a way to let YCSB know more hosts or something like that.

On Mon, Jun 13, 2022 at 08:17 Israel Fruchter @.***> wrote:

@fruch https://github.com/fruch is there a general solution we can do here? If not, and a filter needed for every nemesis that may reboot the node we provide to YCSB, like rolling-restart let's do it.

I can't think of anything general, except filtering it always (i.e. ignoring it completely)

since we already have a context manager for those, we can apply it on nemesis we encountered the issue.

— Reply to this email directly, view it on GitHub < https://github.com/scylladb/scylla-cluster-tests/issues/4738#issuecomment-1153482760 , or unsubscribe < https://github.com/notifications/unsubscribe-auth/AE75CYFJWIUHPX2DUC2BXVLVO277ZANCNFSM5V655H6A

. You are receiving this because you commented.Message ID: @.***>

— Reply to this email directly, view it on GitHub https://github.com/scylladb/scylla-cluster-tests/issues/4738#issuecomment-1153555615, or unsubscribe https://github.com/notifications/unsubscribe-auth/AACTH4ZDPL53BUHGYCKA3WTVO3NINANCNFSM5V655H6A . You are receiving this because you were mentioned.Message ID: @.***>

yarongilor commented 1 year ago

Reproduced in restart-with-resharding nemesis:

(YcsbStressEvent Severity.ERROR) period_type=not-set event_id=566796be-31ea-47a1-85fe-030cfbf88357: type=error node=Node alternator-ttl-4-loaders-no-lwt-sis-loader-node-7da36ba4-3 [3.252.127.132 | 10.4.1.47] (seed: False)
stress_cmd=bin/ycsb load dynamodb -P workloads/workloadc -threads 13 -p recordcount=8589934401 -p fieldcount=2 -p fieldlength=16 -p insertstart=2147483600 -p insertcount=2147483600 -p table=usertable_no_lwt -p dynamodb.ttlKey=ttl -p dynamodb.ttlDuration=43200 -s -P /tmp/dynamodb.properties -p maxexecutiontime=180600
errors:

1265545 [Thread-12] ERROR site.ycsb.db.DynamoDBClient -com.amazonaws.SdkClientException: Unable to execute HTTP request: Connect to alternator:8080 [alternator/10.4.0.41] failed: Connection refused (Connection refused)

Installation details

Kernel Version: 5.15.0-1019-aws Scylla version (or git commit hash): 5.1.0~rc1-20220902.d10aee15e7e9 with build-id c127c717ecffa082ce97b94100d62b2549abe486 Relocatable Package: http://downloads.scylladb.com/unstable/scylla/branch-5.1/relocatable/2022-09-03T00:52:08Z/scylla-x86_64-package.tar.gz Cluster size: 4 nodes (i3.4xlarge)

Scylla Nodes used in this run:

alternator-ttl-4-loaders-no-lwt-sis-db-node-1670b377-4 (44.203.62.146 | 10.12.3.55) (shards: 14)
alternator-ttl-4-loaders-no-lwt-sis-db-node-1670b377-3 (3.230.3.160 | 10.12.3.95) (shards: 14)
alternator-ttl-4-loaders-no-lwt-sis-db-node-1670b377-2 (3.235.52.173 | 10.12.1.56) (shards: 14)
alternator-ttl-4-loaders-no-lwt-sis-db-node-1670b377-1 (3.219.33.213 | 10.12.0.108) (shards: 14)

OS / Image: ami-0437de2d7a582f47e (aws: us-east-1)

Test: longevity-alternator-1h-scan-12h-ttl-no-lwt-2h-grace-4loaders-nemesis Test id: 1670b377-7689-4fce-9ea5-27d154c7c954 Test name: scylla-staging/yarongilor/longevity-alternator-1h-scan-12h-ttl-no-lwt-2h-grace-4loaders-nemesis Test config file(s):

longevity-alternator-1h-scan-12h-ttl-no-lwt-2h-grace-4loaders-sisyphus.yaml

Issue description

>>>>>>> Your description here... <<<<<<<

Restore Monitor Stack command: $ hydra investigate show-monitor 1670b377-7689-4fce-9ea5-27d154c7c954
Restore monitor on AWS instance using Jenkins job
Show all stored logs command: $ hydra investigate show-logs 1670b377-7689-4fce-9ea5-27d154c7c954

Logs:

db-cluster-1670b377.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/1670b377-7689-4fce-9ea5-27d154c7c954/20221022_162921/db-cluster-1670b377.tar.gz
monitor-set-1670b377.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/1670b377-7689-4fce-9ea5-27d154c7c954/20221022_162921/monitor-set-1670b377.tar.gz
loader-set-1670b377.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/1670b377-7689-4fce-9ea5-27d154c7c954/20221022_162921/loader-set-1670b377.tar.gz
sct-runner-1670b377.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/1670b377-7689-4fce-9ea5-27d154c7c954/20221022_162921/sct-runner-1670b377.tar.gz

Jenkins job URL

fruch commented 1 year ago

@yarongilor instead of keep adding this this issue, it's a one liner change: https://github.com/scylladb/scylla-cluster-tests/pull/5391

yarongilor commented 1 year ago

@fruch , what about other nemesis? should it applied the same?

fruch commented 1 year ago

@fruch , what about other nemesis? should it applied the same?

we add it where is was obvious there's a reboot/restart. clearly we missed a few, if you happen to encounter it again during restart of a node, you now know what todo.

scylladb / scylla-cluster-tests