scylla-bench fails to reconnect after altering table

soyacz commented 1 year ago

Installation details

Kernel Version: 5.15.0-1026-aws Scylla version (or git commit hash): 5.2.0~dev-20221209.6075e01312a5 with build-id 0e5d044b8f9e5bdf7f53cc3c1e959fab95bf027c

Cluster size: 9 nodes (i3.2xlarge)

Scylla Nodes used in this run:

longevity-counters-multidc-master-db-node-7785df01-9 (54.157.115.162 | 10.12.2.62) (shards: 7)
longevity-counters-multidc-master-db-node-7785df01-8 (3.238.92.3 | 10.12.2.95) (shards: 7)
longevity-counters-multidc-master-db-node-7785df01-7 (3.236.190.51 | 10.12.0.119) (shards: 7)
longevity-counters-multidc-master-db-node-7785df01-6 (54.212.64.38 | 10.15.0.77) (shards: 7)
longevity-counters-multidc-master-db-node-7785df01-5 (35.92.94.31 | 10.15.3.207) (shards: 7)
longevity-counters-multidc-master-db-node-7785df01-4 (34.219.193.110 | 10.15.3.94) (shards: 7)
longevity-counters-multidc-master-db-node-7785df01-3 (52.213.121.166 | 10.4.0.42) (shards: 7)
longevity-counters-multidc-master-db-node-7785df01-2 (54.229.18.181 | 10.4.2.143) (shards: 7)
longevity-counters-multidc-master-db-node-7785df01-1 (34.245.75.18 | 10.4.0.195) (shards: 7)

OS / Image: ami-0b85d6f35bddaff65 ami-0a1ff01b931943772 ami-08e5c2ae0089cade3 (aws: eu-west-1)

Test: longevity-counters-6h-multidc-test Test id: 7785df01-a1fe-483a-beb7-2f63b9044b87 Test name: scylla-master/raft/longevity-counters-6h-multidc-test Test config file(s):

longevity-counters-multidc.yaml

Issue description

Counters test in multidc scenario is failing persistenlty after altering table. E.g. after running ALTER TABLE scylla_bench.test_counters WITH bloom_filter_fp_chance = 0.45374057709882093 or ALTER TABLE scylla_bench.test_counters WITH read_repair_chance = 0.9;, or even ALTER TABLE scylla_bench.test_counters WITH comment = 'IHQS6RAYS5VQ6CQZYBYEX1GP'; after such changes, scylla-bench is failing tests due error:

2022/12/09 15:26:29 error: failed to connect to "[HostInfo hostname=\"10.12.0.119\" connectAddress=\"10.12.0.119\" peer=\"<nil>\" rpc_address=\"10.12.0.119\" broadcast_address=\"10.12.0.119\" preferred_ip=\"<nil>\" connect_addr=\"10.12.0.119\" connect_addr_source=\"connect_address\" port=9042 data_centre=\"us-eastscylla_node_east\" rack=\"1a\" host_id=\"ec773dfb-ef87-4ab8-abbf-190e3e082e4c\" version=\"v3.0.8\" state=DOWN num_tokens=256]" due to error: gocql: no response to connection startup within timeout

later it looks connection is recovered - so connection issues are not permanent. But it is enough to fail test critically ending the test.

Restore Monitor Stack command: $ hydra investigate show-monitor 7785df01-a1fe-483a-beb7-2f63b9044b87
Restore monitor on AWS instance using Jenkins job
Show all stored logs command: $ hydra investigate show-logs 7785df01-a1fe-483a-beb7-2f63b9044b87

Logs:

| 20221209_161654 | grafana | https://cloudius-jenkins-test.s3.amazonaws.com/7785df01-a1fe-483a-beb7-2f63b9044b87/20221209_161654/grafana-screenshot-longevity-counters-6h-multidc-test-scylla-per-server-metrics-nemesis-20221209_161803-longevity-counters-multidc-master-monitor-node-7785df01-1.png | | 20221209_161654 | grafana | https://cloudius-jenkins-test.s3.amazonaws.com/7785df01-a1fe-483a-beb7-2f63b9044b87/20221209_161654/grafana-screenshot-overview-20221209_161654-longevity-counters-multidc-master-monitor-node-7785df01-1.png | | 20221209_162553 | db-cluster | https://cloudius-jenkins-test.s3.amazonaws.com/7785df01-a1fe-483a-beb7-2f63b9044b87/20221209_162553/db-cluster-7785df01.tar.gz | | 20221209_162553 | loader-set | https://cloudius-jenkins-test.s3.amazonaws.com/7785df01-a1fe-483a-beb7-2f63b9044b87/20221209_162553/loader-set-7785df01.tar.gz | | 20221209_162553 | monitor-set | https://cloudius-jenkins-test.s3.amazonaws.com/7785df01-a1fe-483a-beb7-2f63b9044b87/20221209_162553/monitor-set-7785df01.tar.gz | | 20221209_162553 | sct | https://cloudius-jenkins-test.s3.amazonaws.com/7785df01-a1fe-483a-beb7-2f63b9044b87/20221209_162553/sct-runner-7785df01.tar.gz

Jenkins job URL

fruch commented 1 year ago

maybe

Timeout:         5s

isn't enough for this test case ?

soyacz commented 1 year ago

I'm not sure, disconnections were persisting sometimes for 2 minutes. We would need to test it.

soyacz commented 1 year ago

I tried with timeout settings like this: -timeout 15s -retry-interval=80ms,5s -retry-number=20 and it failed anyway.

KnifeyMoloko commented 1 year ago

While running a large-partitions test I encountered a similar problem. Not sure if it's tied to this, but it's a possbility. After the pre-write workload, when starting one of the stress workloads, we got:

2022-12-08 21:08:42.623: (ScyllaBenchEvent Severity.CRITICAL) period_type=end event_id=a5a01a5d-ef1f-4c96-9836-7b6b23c0d77e duration=10s: node=Node longevity-large-partitions-4d-maste-loader-node-a967ab57-2 [34.249.171.113 | 10.4.2.108] (seed: False)
stress_cmd=scylla-bench -workload=uniform -mode=read -replication-factor=3 -partition-count=60 -clustering-row-count=10000000 -clustering-row-size=2048 -rows-per-request=2000 -timeout=180s -concurrency=700 -max-rate=64000  -duration=5760m -connection-count 500 -error-at-row-limit 1000 -nodes 10.4.1.5,10.4.2.90,10.4.2.71,10.4.1.191
errors:
Stress command completed with bad status 1: 2022/12/08 21:08:42 gocql: unable to create session: unable to fetch peer host info: Operation timed

Running the same job with a pinned version of scylla-bench (0.1.14) did not reproduce this issue. Similarly, a run without Raft did not fail at this point, so there might be some flakiness involved here.

Installation details

Kernel Version: 5.15.0-1026-aws Scylla version (or git commit hash): 5.2.0~dev-20221208.a076ceef97d5 with build-id 020ec076898a692651fd48edfb1920fc190cd81e

Cluster size: 4 nodes (i3en.3xlarge)

Scylla Nodes used in this run:

longevity-large-partitions-4d-maste-db-node-a967ab57-4 (3.252.203.198 | 10.4.1.191) (shards: 10)
longevity-large-partitions-4d-maste-db-node-a967ab57-3 (18.203.69.233 | 10.4.2.71) (shards: 10)
longevity-large-partitions-4d-maste-db-node-a967ab57-2 (52.212.226.132 | 10.4.2.90) (shards: 10)
longevity-large-partitions-4d-maste-db-node-a967ab57-1 (54.194.73.19 | 10.4.1.5) (shards: 10)

OS / Image: ami-063cdd564cd2fbe46 (aws: eu-west-1)

Test: longevity-large-partition-4days-test Test id: a967ab57-4860-4f31-8b0a-d940b857542e Test name: scylla-master/raft/longevity-large-partition-4days-test Test config file(s):

longevity-large-partition-4days.yaml

Issue description

>>>>>>> Your description here... <<<<<<<

Restore Monitor Stack command: $ hydra investigate show-monitor a967ab57-4860-4f31-8b0a-d940b857542e
Restore monitor on AWS instance using Jenkins job
Show all stored logs command: $ hydra investigate show-logs a967ab57-4860-4f31-8b0a-d940b857542e

Logs:

db-cluster-a967ab57.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/a967ab57-4860-4f31-8b0a-d940b857542e/20221208_212058/db-cluster-a967ab57.tar.gz
monitor-set-a967ab57.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/a967ab57-4860-4f31-8b0a-d940b857542e/20221208_212058/monitor-set-a967ab57.tar.gz
loader-set-a967ab57.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/a967ab57-4860-4f31-8b0a-d940b857542e/20221208_212058/loader-set-a967ab57.tar.gz
sct-runner-a967ab57.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/a967ab57-4860-4f31-8b0a-d940b857542e/20221208_212058/sct-runner-a967ab57.tar.gz

Jenkins job URL

roydahan commented 1 year ago

@avelanarius, we suspect there is a regression or at least a behavior change in how s-b works for us with later (latest?) gocql driver. We're kind of lost here on how to debug it or how to progress. Can you please help us or advise us how to debug it further?

juliayakovlev commented 1 month ago

scylla-bench failed with unable to create session: unable to fetch peer host info despite all nodes are up and ok

< t:2024-07-25 16:14:19,760 f:base.py         l:228  c:RemoteLibSSH2CmdRunner p:DEBUG > 2024/07/25 16:14:19 gocql: unable to create session: unable to fetch peer host info: Operation timed out for system.peers - received only 0 responses from 1 CL=ONE.
< t:2024-07-25 16:14:19,761 f:base.py         l:146  c:RemoteLibSSH2CmdRunner p:ERROR > Error executing command: "sudo  docker exec 5e5d3d02c589373354dd8ad087985ca17a7db44f6cd5f9a9d115641b82f41fb0 /bin/sh -c 'scylla-bench -workload=sequential -mode=write -replication-factor=3 -partition-count=750 -partition-offset=1251 -clustering-row-count=200000 -clustering-row-size=uniform:100..8192 -concurrency=10 -connection-count=10 -consistency-level=quorum -rows-per-request=10 -timeout=90s -iterations=0 -duration=720m  -error-at-row-limit 1000 -nodes 10.142.0.207,10.142.0.236,10.142.0.240,10.142.0.242,10.142.0.248'"; Exit status: 1
< t:2024-07-25 16:14:19,761 f:base.py         l:150  c:RemoteLibSSH2CmdRunner p:DEBUG > STDERR: 2024/07/25 16:14:19 gocql: unable to create session: unable to fetch peer host info: Operation timed out for system.peers - received only 0 responses from 1 CL=ONE.

< t:2024-07-25 16:14:19,763 f:file_logger.py  l:101  c:sdcm.sct_events.file_logger p:INFO  > 2024-07-25 16:14:19.761: (ScyllaBenchEvent Severity.ERROR) period_type=end event_id=53d8c0a0-2870-4453-b7e3-7df585f03411 during_nemesis=RunUniqueSequence duration=18s: node=Node longevity-large-partitions-200k-pks-loader-node-53145d7f-0-1 [35.196.217.128 | 10.142.0.250]

Packages

Scylla version: 2023.1.11-20240725.11a2022bd6ed with build-id a0cab71f78c44bb0b694d46800fbcaef02607251

Kernel Version: 5.15.0-1065-gcp

Issue description

[ ] This issue is a regression.
[ ] It is unknown if this issue is a regression.

Describe your issue in detail and steps it took to produce it.

Impact

Describe the impact this issue causes to the user.

How frequently does it reproduce?

Describe the frequency with how this issue can be reproduced.

Installation details

Cluster size: 5 nodes (n2-highmem-16)

Scylla Nodes used in this run:

longevity-large-partitions-200k-pks-db-node-53145d7f-0-8 (34.23.81.52 | 10.142.0.69) (shards: 14)
longevity-large-partitions-200k-pks-db-node-53145d7f-0-7 (35.237.229.97 | 10.142.0.12) (shards: 14)
longevity-large-partitions-200k-pks-db-node-53145d7f-0-6 (35.229.67.161 | 10.142.0.3) (shards: 14)
longevity-large-partitions-200k-pks-db-node-53145d7f-0-5 (35.196.146.159 | 10.142.0.248) (shards: 14)
longevity-large-partitions-200k-pks-db-node-53145d7f-0-4 (35.237.38.63 | 10.142.0.242) (shards: 14)
longevity-large-partitions-200k-pks-db-node-53145d7f-0-3 (35.196.86.69 | 10.142.0.240) (shards: 14)
longevity-large-partitions-200k-pks-db-node-53145d7f-0-2 (35.227.121.244 | 10.142.0.236) (shards: 14)
longevity-large-partitions-200k-pks-db-node-53145d7f-0-1 (35.227.87.19 | 10.142.0.207) (shards: 14)

OS / Image: https://www.googleapis.com/compute/v1/projects/scylla-images/global/images/6980420640571389317 (gce: undefined_region)

Test: longevity-large-partition-200k-pks-4days-gce-test Test id: 53145d7f-6918-4728-acc6-6236916d8d08 Test name: enterprise-2023.1/longevity/longevity-large-partition-200k-pks-4days-gce-test Test config file(s):

longevity-large-partition-200k_pks-4days.yaml

Logs and commands

- Restore Monitor Stack command: `$ hydra investigate show-monitor 53145d7f-6918-4728-acc6-6236916d8d08` - Restore monitor on AWS instance using [Jenkins job](https://jenkins.scylladb.com/view/QA/job/QA-tools/job/hydra-show-monitor/parambuild/?test_id=53145d7f-6918-4728-acc6-6236916d8d08) - Show all stored logs command: `$ hydra investigate show-logs 53145d7f-6918-4728-acc6-6236916d8d08` ## Logs: - **db-cluster-53145d7f.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/53145d7f-6918-4728-acc6-6236916d8d08/20240726_041924/db-cluster-53145d7f.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/53145d7f-6918-4728-acc6-6236916d8d08/20240726_041924/db-cluster-53145d7f.tar.gz) - **sct-runner-events-53145d7f.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/53145d7f-6918-4728-acc6-6236916d8d08/20240726_041924/sct-runner-events-53145d7f.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/53145d7f-6918-4728-acc6-6236916d8d08/20240726_041924/sct-runner-events-53145d7f.tar.gz) - **sct-53145d7f.log.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/53145d7f-6918-4728-acc6-6236916d8d08/20240726_041924/sct-53145d7f.log.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/53145d7f-6918-4728-acc6-6236916d8d08/20240726_041924/sct-53145d7f.log.tar.gz) - **loader-set-53145d7f.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/53145d7f-6918-4728-acc6-6236916d8d08/20240726_041924/loader-set-53145d7f.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/53145d7f-6918-4728-acc6-6236916d8d08/20240726_041924/loader-set-53145d7f.tar.gz) - **monitor-set-53145d7f.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/53145d7f-6918-4728-acc6-6236916d8d08/20240726_041924/monitor-set-53145d7f.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/53145d7f-6918-4728-acc6-6236916d8d08/20240726_041924/monitor-set-53145d7f.tar.gz) [Jenkins job URL](https://jenkins.scylladb.com/job/enterprise-2023.1/job/longevity/job/longevity-large-partition-200k-pks-4days-gce-test/18/) [Argus](https://argus.scylladb.com/test/66ff9e4a-0655-4bba-89f4-e4eb2d78691d/runs?additionalRuns[]=53145d7f-6918-4728-acc6-6236916d8d08)

roydahan commented 1 month ago

I'm trying to understand if it's a scylla-bench issue, it looks like a gocql issue to me. @sylwiaszunejko / @dkropachev can you please take a look at this one?

fruch commented 1 month ago

I'm trying to understand if it's a scylla-bench issue, it looks like a gocql issue to me. @sylwiaszunejko / @dkropachev can you please take a look at this one?

It's probably cause of scylla slowing down, the internal queries might not have enough timeouts setup.

So as always it's a combination of a scylla issue, and how strict we want to be with timeouts, and how configurable those internal queries are.

scylladb / scylla-bench