Open soyacz opened 1 year ago
maybe
Timeout: 5s
isn't enough for this test case ?
I'm not sure, disconnections were persisting sometimes for 2 minutes. We would need to test it.
I tried with timeout settings like this: -timeout 15s -retry-interval=80ms,5s -retry-number=20
and it failed anyway.
While running a large-partitions test I encountered a similar problem. Not sure if it's tied to this, but it's a possbility. After the pre-write workload, when starting one of the stress workloads, we got:
2022-12-08 21:08:42.623: (ScyllaBenchEvent Severity.CRITICAL) period_type=end event_id=a5a01a5d-ef1f-4c96-9836-7b6b23c0d77e duration=10s: node=Node longevity-large-partitions-4d-maste-loader-node-a967ab57-2 [34.249.171.113 | 10.4.2.108] (seed: False)
stress_cmd=scylla-bench -workload=uniform -mode=read -replication-factor=3 -partition-count=60 -clustering-row-count=10000000 -clustering-row-size=2048 -rows-per-request=2000 -timeout=180s -concurrency=700 -max-rate=64000 -duration=5760m -connection-count 500 -error-at-row-limit 1000 -nodes 10.4.1.5,10.4.2.90,10.4.2.71,10.4.1.191
errors:
Stress command completed with bad status 1: 2022/12/08 21:08:42 gocql: unable to create session: unable to fetch peer host info: Operation timed
Running the same job with a pinned version of scylla-bench (0.1.14) did not reproduce this issue. Similarly, a run without Raft did not fail at this point, so there might be some flakiness involved here.
Kernel Version: 5.15.0-1026-aws
Scylla version (or git commit hash): 5.2.0~dev-20221208.a076ceef97d5
with build-id 020ec076898a692651fd48edfb1920fc190cd81e
Cluster size: 4 nodes (i3en.3xlarge)
Scylla Nodes used in this run:
OS / Image: ami-063cdd564cd2fbe46
(aws: eu-west-1)
Test: longevity-large-partition-4days-test
Test id: a967ab57-4860-4f31-8b0a-d940b857542e
Test name: scylla-master/raft/longevity-large-partition-4days-test
Test config file(s):
>>>>>>> Your description here... <<<<<<<
$ hydra investigate show-monitor a967ab57-4860-4f31-8b0a-d940b857542e
$ hydra investigate show-logs a967ab57-4860-4f31-8b0a-d940b857542e
@avelanarius, we suspect there is a regression or at least a behavior change in how s-b works for us with later (latest?) gocql driver. We're kind of lost here on how to debug it or how to progress. Can you please help us or advise us how to debug it further?
scylla-bench failed with unable to create session: unable to fetch peer host info
despite all nodes are up and ok
< t:2024-07-25 16:14:19,760 f:base.py l:228 c:RemoteLibSSH2CmdRunner p:DEBUG > 2024/07/25 16:14:19 gocql: unable to create session: unable to fetch peer host info: Operation timed out for system.peers - received only 0 responses from 1 CL=ONE.
< t:2024-07-25 16:14:19,761 f:base.py l:146 c:RemoteLibSSH2CmdRunner p:ERROR > Error executing command: "sudo docker exec 5e5d3d02c589373354dd8ad087985ca17a7db44f6cd5f9a9d115641b82f41fb0 /bin/sh -c 'scylla-bench -workload=sequential -mode=write -replication-factor=3 -partition-count=750 -partition-offset=1251 -clustering-row-count=200000 -clustering-row-size=uniform:100..8192 -concurrency=10 -connection-count=10 -consistency-level=quorum -rows-per-request=10 -timeout=90s -iterations=0 -duration=720m -error-at-row-limit 1000 -nodes 10.142.0.207,10.142.0.236,10.142.0.240,10.142.0.242,10.142.0.248'"; Exit status: 1
< t:2024-07-25 16:14:19,761 f:base.py l:150 c:RemoteLibSSH2CmdRunner p:DEBUG > STDERR: 2024/07/25 16:14:19 gocql: unable to create session: unable to fetch peer host info: Operation timed out for system.peers - received only 0 responses from 1 CL=ONE.
< t:2024-07-25 16:14:19,763 f:file_logger.py l:101 c:sdcm.sct_events.file_logger p:INFO > 2024-07-25 16:14:19.761: (ScyllaBenchEvent Severity.ERROR) period_type=end event_id=53d8c0a0-2870-4453-b7e3-7df585f03411 during_nemesis=RunUniqueSequence duration=18s: node=Node longevity-large-partitions-200k-pks-loader-node-53145d7f-0-1 [35.196.217.128 | 10.142.0.250]
Scylla version: 2023.1.11-20240725.11a2022bd6ed
with build-id a0cab71f78c44bb0b694d46800fbcaef02607251
Kernel Version: 5.15.0-1065-gcp
Describe your issue in detail and steps it took to produce it.
Describe the impact this issue causes to the user.
Describe the frequency with how this issue can be reproduced.
Cluster size: 5 nodes (n2-highmem-16)
Scylla Nodes used in this run:
OS / Image: https://www.googleapis.com/compute/v1/projects/scylla-images/global/images/6980420640571389317
(gce: undefined_region)
Test: longevity-large-partition-200k-pks-4days-gce-test
Test id: 53145d7f-6918-4728-acc6-6236916d8d08
Test name: enterprise-2023.1/longevity/longevity-large-partition-200k-pks-4days-gce-test
Test config file(s):
I'm trying to understand if it's a scylla-bench issue, it looks like a gocql issue to me. @sylwiaszunejko / @dkropachev can you please take a look at this one?
I'm trying to understand if it's a scylla-bench issue, it looks like a gocql issue to me. @sylwiaszunejko / @dkropachev can you please take a look at this one?
It's probably cause of scylla slowing down, the internal queries might not have enough timeouts setup.
So as always it's a combination of a scylla issue, and how strict we want to be with timeouts, and how configurable those internal queries are.
Installation details
Kernel Version: 5.15.0-1026-aws Scylla version (or git commit hash):
5.2.0~dev-20221209.6075e01312a5
with build-id0e5d044b8f9e5bdf7f53cc3c1e959fab95bf027c
Cluster size: 9 nodes (i3.2xlarge)
Scylla Nodes used in this run:
OS / Image:
ami-0b85d6f35bddaff65 ami-0a1ff01b931943772 ami-08e5c2ae0089cade3
(aws: eu-west-1)Test:
longevity-counters-6h-multidc-test
Test id:7785df01-a1fe-483a-beb7-2f63b9044b87
Test name:scylla-master/raft/longevity-counters-6h-multidc-test
Test config file(s):Issue description
Counters test in multidc scenario is failing persistenlty after altering table. E.g. after running
ALTER TABLE scylla_bench.test_counters WITH bloom_filter_fp_chance = 0.45374057709882093
orALTER TABLE scylla_bench.test_counters WITH read_repair_chance = 0.9;
, or evenALTER TABLE scylla_bench.test_counters WITH comment = 'IHQS6RAYS5VQ6CQZYBYEX1GP';
after such changes, scylla-bench is failing tests due error:later it looks connection is recovered - so connection issues are not permanent. But it is enough to fail test critically ending the test.
$ hydra investigate show-monitor 7785df01-a1fe-483a-beb7-2f63b9044b87
$ hydra investigate show-logs 7785df01-a1fe-483a-beb7-2f63b9044b87
Logs:
| 20221209_161654 | grafana | https://cloudius-jenkins-test.s3.amazonaws.com/7785df01-a1fe-483a-beb7-2f63b9044b87/20221209_161654/grafana-screenshot-longevity-counters-6h-multidc-test-scylla-per-server-metrics-nemesis-20221209_161803-longevity-counters-multidc-master-monitor-node-7785df01-1.png | | 20221209_161654 | grafana | https://cloudius-jenkins-test.s3.amazonaws.com/7785df01-a1fe-483a-beb7-2f63b9044b87/20221209_161654/grafana-screenshot-overview-20221209_161654-longevity-counters-multidc-master-monitor-node-7785df01-1.png | | 20221209_162553 | db-cluster | https://cloudius-jenkins-test.s3.amazonaws.com/7785df01-a1fe-483a-beb7-2f63b9044b87/20221209_162553/db-cluster-7785df01.tar.gz | | 20221209_162553 | loader-set | https://cloudius-jenkins-test.s3.amazonaws.com/7785df01-a1fe-483a-beb7-2f63b9044b87/20221209_162553/loader-set-7785df01.tar.gz | | 20221209_162553 | monitor-set | https://cloudius-jenkins-test.s3.amazonaws.com/7785df01-a1fe-483a-beb7-2f63b9044b87/20221209_162553/monitor-set-7785df01.tar.gz | | 20221209_162553 | sct | https://cloudius-jenkins-test.s3.amazonaws.com/7785df01-a1fe-483a-beb7-2f63b9044b87/20221209_162553/sct-runner-7785df01.tar.gz
Jenkins job URL