scylladb / scylla-cluster-tests

Tests for Scylla Clusters
GNU Affero General Public License v3.0
57 stars 94 forks source link

YCSB Alternator "failed: Connection refused" message event is not filtered out on node reboot #4738

Closed yarongilor closed 1 year ago

yarongilor commented 2 years ago

Installation details

Kernel Version: 5.13.0-1021-aws Scylla version (or git commit hash): 5.0~rc3-20220406.f92622e0d with build-id 2b79c4744216b294fdbd2f277940044c899156ea Cluster size: 4 nodes (i3.4xlarge)

Scylla Nodes used in this run:

OS / Image: ami-07835983d717b1ea3 (aws: eu-west-1)

Test: longevity-alternator-200gb-48h-test Test id: d918f269-fa5e-4057-b74f-88062e9d5d0e Test name: scylla-5.0/longevity/longevity-alternator-200gb-48h-test Test config file(s):

Issue description

scenario: running nemesis rolling_config_change_internode_compression

**>>>>>>>**
yarongilor@yaron-pc:~/Downloads/logs$ grep -v -i email sct-d918f269.log | egrep -i 'failed: Connection refused|inter node compression to|Restarting node |>>>>>>>' 
...
< t:2022-05-03 05:04:50,146 f:nemesis.py      l:1367 c:sdcm.nemesis         p:INFO  > sdcm.nemesis.SisyphusMonkey: >>>>>>>>>>>>>Started random_disrupt_method rolling_config_change_internode_compression

reboot node-1 successfully, then node-2:

< t:2022-05-03 05:05:35,368 f:nemesis.py      l:703  c:sdcm.nemesis         p:DEBUG > sdcm.nemesis.SisyphusMonkey: Changing Node alternator-48h-5-0-db-node-d918f269-1 [34.240.216.60 | 10.0.2.103] (seed: True) inter node compression to dc
< t:2022-05-03 05:05:40,927 f:nemesis.py      l:706  c:sdcm.nemesis         p:INFO  > sdcm.nemesis.SisyphusMonkey: Restarting node Node alternator-48h-5-0-db-node-d918f269-1 [34.240.216.60 | 10.0.2.103] (seed: True)
< t:2022-05-03 05:08:42,152 f:nemesis.py      l:703  c:sdcm.nemesis         p:DEBUG > sdcm.nemesis.SisyphusMonkey: Changing Node alternator-48h-5-0-db-node-d918f269-2 [54.246.21.14 | 10.0.2.166] (seed: False) inter node compression to dc
< t:2022-05-03 05:08:49,314 f:nemesis.py      l:706  c:sdcm.nemesis         p:INFO  > sdcm.nemesis.SisyphusMonkey: Restarting node Node alternator-48h-5-0-db-node-d918f269-2 [54.246.21.14 | 10.0.2.166] (seed: False)

Got "Connection refused" for the rebooted node-2:

< t:2022-05-03 05:08:59,982 f:base.py         l:228  c:RemoteLibSSH2CmdRunner p:DEBUG > 17035028 [Thread-8] ERROR site.ycsb.db.DynamoDBClient  -com.amazonaws.SdkClientException: Unable to execute HTTP request: Connect to alternator:8080 [alternator/10.0.2.166] failed: Connection refused (Connection refused)
< t:2022-05-03 05:08:59,983 f:file_logger.py  l:101  c:sdcm.sct_events.file_logger p:INFO  > 17035028 [Thread-8] ERROR site.ycsb.db.DynamoDBClient  -com.amazonaws.SdkClientException: Unable to execute HTTP request: Connect to alternator:8080 [alternator/10.0.2.166] failed: Connection refused (Connection refused)

The message is not filtered out and test got:

2022-05-03 05:08:59.982: (YcsbStressEvent Severity.ERROR) period_type=not-set event_id=236ce6a2-eed4-4a6a-81bb-260b094f34b1: type=error node=Node alternator-48h-5-0-loader-node-d918f269-4 [54.217.26.135 | 10.0.3.196] (seed: False)
stress_cmd=bin/ycsb run dynamodb -P workloads/workloadc -threads 20 -p readproportion=0.5 -p updateproportion=0.5 -p recordcount=200200300 -p fieldcount=8 -p fieldlength=128 -p operationcount=2140000000 -p dataintegrity=true -p maxexecutiontime=147600 -s  -P /tmp/dynamodb.properties
errors:
17035028 [Thread-8] ERROR site.ycsb.db.DynamoDBClient  -com.amazonaws.SdkClientException: Unable to execute HTTP request: Connect to alternator:8080 [alternator/10.0.2.166] failed: Connection refused (Connection refused)

<<<<<<<

Logs:

Jenkins job URL

yarongilor commented 2 years ago

@fgelcer , can you advise - it looks like this event should have been filtered out by: https://github.com/scylladb/scylla-cluster-tests/pull/4220

fruch commented 2 years ago

@fgelcer , can you advise - it looks like this event should have been filtered out by: #4220

@yarongilor ignore_ycsb_connection_refused that was fixed in #4220 is only used in upgrade_tests.

when YCSB is used, since it's using the DNS, there are cases it would use a node that is down.

filtering it, might be problematic, since we'll need to do so for each place we take a node down.

yarongilor commented 2 years ago

@fgelcer , can you advise - it looks like this event should have been filtered out by: #4220

@yarongilor ignore_ycsb_connection_refused that was fixed in #4220 is only used in upgrade_tests.

when YCSB is used, since it's using the DNS, there are cases it would use a node that is down.

filtering it, might be problematic, since we'll need to do so for each place we take a node down.

@fruch , why not apply this filter to all nemeses contains reboot somehow in a generic way? or else what's the alternative - changing this error severity to "warning"?

KnifeyMoloko commented 2 years ago

Bumped into the same in:

Installation details

Kernel Version: 5.13.0-1025-aws Scylla version (or git commit hash): 2022.1~rc7-20220602.7abea3aad with build-id 57fb7e7c94bbac6498149648f3818be3c1322ef9 Cluster size: 6 nodes (i3.4xlarge)

Scylla Nodes used in this run:

OS / Image: ami-0c0c4f759c88cd17d (aws: us-east-1)

Test: longevity-alternator-3h-test Test id: c0719d2a-85bb-4e6c-b228-4479afd09a0a Test name: enterprise-2022.1/longevity/longevity-alternator-3h-test Test config file(s):

Issue description

>>>>>>> Your description here... <<<<<<<

Logs:

Jenkins job URL

roydahan commented 2 years ago

@fruch is there a general solution we can do here? If not, and a filter needed for every nemesis that may reboot the node we provide to YCSB, like rolling-restart let's do it.

fruch commented 2 years ago

@fruch is there a general solution we can do here? If not, and a filter needed for every nemesis that may reboot the node we provide to YCSB, like rolling-restart let's do it.

I can't think of anything general, except filtering it always (i.e. ignoring it completely)

since we already have a context manager for those, we can apply it on nemesis we encountered the issue.

roydahan commented 2 years ago

Yes, that’s the simple solution, but I thought maybe there is a way to let YCSB know more hosts or something like that.

On Mon, Jun 13, 2022 at 08:17 Israel Fruchter @.***> wrote:

@fruch https://github.com/fruch is there a general solution we can do here? If not, and a filter needed for every nemesis that may reboot the node we provide to YCSB, like rolling-restart let's do it.

I can't think of anything general, except filtering it always (i.e. ignoring it completely)

since we already have a context manager for those, we can apply it on nemesis we encountered the issue.

— Reply to this email directly, view it on GitHub https://github.com/scylladb/scylla-cluster-tests/issues/4738#issuecomment-1153482760, or unsubscribe https://github.com/notifications/unsubscribe-auth/AE75CYFJWIUHPX2DUC2BXVLVO277ZANCNFSM5V655H6A . You are receiving this because you commented.Message ID: @.***>

fruch commented 2 years ago

we are using dynamodb client, it only knows one dns name, aws clients aren't aware of nodes.

this is why we are using a DNS server todo the "balancing", having a proper load-blancer is not implemented (nor in SCT, not in scylla-cloud)

On Mon, Jun 13, 2022 at 10:11 AM Roy Dahan @.***> wrote:

Yes, that’s the simple solution, but I thought maybe there is a way to let YCSB know more hosts or something like that.

On Mon, Jun 13, 2022 at 08:17 Israel Fruchter @.***> wrote:

@fruch https://github.com/fruch is there a general solution we can do here? If not, and a filter needed for every nemesis that may reboot the node we provide to YCSB, like rolling-restart let's do it.

I can't think of anything general, except filtering it always (i.e. ignoring it completely)

since we already have a context manager for those, we can apply it on nemesis we encountered the issue.

— Reply to this email directly, view it on GitHub < https://github.com/scylladb/scylla-cluster-tests/issues/4738#issuecomment-1153482760 , or unsubscribe < https://github.com/notifications/unsubscribe-auth/AE75CYFJWIUHPX2DUC2BXVLVO277ZANCNFSM5V655H6A

. You are receiving this because you commented.Message ID: @.***>

— Reply to this email directly, view it on GitHub https://github.com/scylladb/scylla-cluster-tests/issues/4738#issuecomment-1153555615, or unsubscribe https://github.com/notifications/unsubscribe-auth/AACTH4ZDPL53BUHGYCKA3WTVO3NINANCNFSM5V655H6A . You are receiving this because you were mentioned.Message ID: @.***>

yarongilor commented 1 year ago

Reproduced in restart-with-resharding nemesis:

(YcsbStressEvent Severity.ERROR) period_type=not-set event_id=566796be-31ea-47a1-85fe-030cfbf88357: type=error node=Node alternator-ttl-4-loaders-no-lwt-sis-loader-node-7da36ba4-3 [3.252.127.132 | 10.4.1.47] (seed: False)
stress_cmd=bin/ycsb load dynamodb -P workloads/workloadc -threads 13 -p recordcount=8589934401 -p fieldcount=2 -p fieldlength=16 -p insertstart=2147483600 -p insertcount=2147483600 -p table=usertable_no_lwt -p dynamodb.ttlKey=ttl -p dynamodb.ttlDuration=43200 -s -P /tmp/dynamodb.properties -p maxexecutiontime=180600
errors:

1265545 [Thread-12] ERROR site.ycsb.db.DynamoDBClient -com.amazonaws.SdkClientException: Unable to execute HTTP request: Connect to alternator:8080 [alternator/10.4.0.41] failed: Connection refused (Connection refused)

Installation details

Kernel Version: 5.15.0-1019-aws Scylla version (or git commit hash): 5.1.0~rc1-20220902.d10aee15e7e9 with build-id c127c717ecffa082ce97b94100d62b2549abe486 Relocatable Package: http://downloads.scylladb.com/unstable/scylla/branch-5.1/relocatable/2022-09-03T00:52:08Z/scylla-x86_64-package.tar.gz Cluster size: 4 nodes (i3.4xlarge)

Scylla Nodes used in this run:

OS / Image: ami-0437de2d7a582f47e (aws: us-east-1)

Test: longevity-alternator-1h-scan-12h-ttl-no-lwt-2h-grace-4loaders-nemesis Test id: 1670b377-7689-4fce-9ea5-27d154c7c954 Test name: scylla-staging/yarongilor/longevity-alternator-1h-scan-12h-ttl-no-lwt-2h-grace-4loaders-nemesis Test config file(s):

Issue description

>>>>>>> Your description here... <<<<<<<

Logs:

Jenkins job URL

fruch commented 1 year ago

@yarongilor instead of keep adding this this issue, it's a one liner change: https://github.com/scylladb/scylla-cluster-tests/pull/5391

yarongilor commented 1 year ago

@fruch , what about other nemesis? should it applied the same?

fruch commented 1 year ago

@fruch , what about other nemesis? should it applied the same?

we add it where is was obvious there's a reboot/restart. clearly we missed a few, if you happen to encounter it again during restart of a node, you now know what todo.