Timeout errors - received only 0 responses from 2 CL=QUORUM.

soyacz commented 2 years ago

When running long longevity (TWCS), s-b load drops and eventually fails due too many errors.

2022/02/04 10:26:10 Operation timed out for scylla_bench.test - received only 0 responses from 2 CL=QUORUM.

While we know why it fails, its not clear why we see these errors.

Restore commands: Restore Monitor Stack command: $ hydra investigate show-monitor cc5e0f99-50a8-4f7a-b81b-64032b2429a0 Restore monitor on AWS instance using Jenkins job Show all stored logs command: $ hydra investigate show-logs cc5e0f99-50a8-4f7a-b81b-64032b2429a0

Logs: grafana - https://cloudius-jenkins-test.s3.amazonaws.com/cc5e0f99-50a8-4f7a-b81b-64032b2429a0/20220204_162208/grafana-screenshot-longevity-twcs-48h-test-scylla-per-server-metrics-nemesis-20220204_162427-longevity-twcs-48h-master-monitor-node-cc5e0f99-1.png grafana - https://cloudius-jenkins-test.s3.amazonaws.com/cc5e0f99-50a8-4f7a-b81b-64032b2429a0/20220204_162208/grafana-screenshot-overview-20220204_162208-longevity-twcs-48h-master-monitor-node-cc5e0f99-1.png db-cluster - https://cloudius-jenkins-test.s3.amazonaws.com/cc5e0f99-50a8-4f7a-b81b-64032b2429a0/20220204_163847/db-cluster-cc5e0f99.tar.gz loader-set - https://cloudius-jenkins-test.s3.amazonaws.com/cc5e0f99-50a8-4f7a-b81b-64032b2429a0/20220204_163847/loader-set-cc5e0f99.tar.gz monitor-set - https://cloudius-jenkins-test.s3.amazonaws.com/cc5e0f99-50a8-4f7a-b81b-64032b2429a0/20220204_163847/monitor-set-cc5e0f99.tar.gz sct - https://cloudius-jenkins-test.s3.amazonaws.com/cc5e0f99-50a8-4f7a-b81b-64032b2429a0/20220204_163847/sct-runner-cc5e0f99.tar.gz

Links: Build URL Download "Overview metrics" Grafana Screenshot Download "Per server metrics nemesis" Grafana Screenshot

enaydanov commented 2 years ago

Got it here too:

Installation details

Kernel Version: 5.11.0-1028-aws

Scylla version (or git commit hash): 5.1.dev-20220303.3b5ba5c1a998 with build-id abc821aa82a471e4414ae67ec146cbffc9c8f194

Cluster size: 4 nodes (i3en.2xlarge)

Scylla running with shards number (live nodes): No resources left at the end of the run

OS / Image: ami-04e93503987e389d7 (aws: eu-west-1)

Test: longevity-twcs-48h-test

Test id: 42f0a3aa-66c6-4358-9f87-0111cbdbeced

Test name: longevity_twcs_test/longevity-twcs-48h-test

Test config file(s):

longevity-twcs-48h.yaml

Commands

Restore Monitor Stack command: $ hydra investigate show-monitor 42f0a3aa-66c6-4358-9f87-0111cbdbeced
Restore monitor on AWS instance using Jenkins job
Show all stored logs command: $ hydra investigate show-logs 42f0a3aa-66c6-4358-9f87-0111cbdbeced

Logs:

db-cluster-42f0a3aa.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/42f0a3aa-66c6-4358-9f87-0111cbdbeced/20220304_090518/db-cluster-42f0a3aa.tar.gz
monitor-set-42f0a3aa.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/42f0a3aa-66c6-4358-9f87-0111cbdbeced/20220304_090518/monitor-set-42f0a3aa.tar.gz
loader-set-42f0a3aa.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/42f0a3aa-66c6-4358-9f87-0111cbdbeced/20220304_090518/loader-set-42f0a3aa.tar.gz
sct-runner-42f0a3aa.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/42f0a3aa-66c6-4358-9f87-0111cbdbeced/20220304_090518/sct-runner-42f0a3aa.tar.gz

Jenkins job URL

fgelcer commented 2 years ago

is it read or write command timing out?

ShlomiBalalis commented 2 years ago

Installation details

Kernel Version: 5.15.0-1015-aws Scylla version (or git commit hash): 5.1.0~dev-20220726.29c28dcb0c33 with build-id f0f40bc93cc45be63928bbe9eaf674885347ba58 Cluster size: 4 nodes (i3en.2xlarge)

Scylla Nodes used in this run:

longevity-twcs-48h-master-db-node-ad6fa570-9 (34.244.10.209 | 10.4.2.215) (shards: 7)
longevity-twcs-48h-master-db-node-ad6fa570-8 (54.229.28.162 | 10.4.3.120) (shards: 7)
longevity-twcs-48h-master-db-node-ad6fa570-7 (54.77.23.82 | 10.4.2.44) (shards: 7)
longevity-twcs-48h-master-db-node-ad6fa570-6 (54.77.247.137 | 10.4.3.46) (shards: 7)
longevity-twcs-48h-master-db-node-ad6fa570-5 (34.242.35.233 | 10.4.0.248) (shards: 7)
longevity-twcs-48h-master-db-node-ad6fa570-4 (3.250.226.153 | 10.4.0.201) (shards: 7)
longevity-twcs-48h-master-db-node-ad6fa570-3 (52.214.64.218 | 10.4.2.54) (shards: 7)
longevity-twcs-48h-master-db-node-ad6fa570-2 (52.31.27.25 | 10.4.2.156) (shards: 7)
longevity-twcs-48h-master-db-node-ad6fa570-10 (63.32.91.84 | 10.4.1.184) (shards: 7)
longevity-twcs-48h-master-db-node-ad6fa570-1 (3.250.50.84 | 10.4.2.1) (shards: 7)

OS / Image: ami-0c90fd2f4dcfdd273 (aws: eu-west-1)

Test: longevity-twcs-48h-test Test id: ad6fa570-f155-48da-b448-ac20822b0e95 Test name: scylla-master/longevity/longevity-twcs-48h-test Test config file(s):

longevity-twcs-48h.yaml

Issue description

The issue seem to have happened again this week? The error started to appear around 07-29 10:55, and continued throughout the rest of the 48h run. During this whole time the cluster seem healthy (everyone is consistently UN, disruptive nemeses aside). The bench thread is a read thread named scylla-bench-l0-d0d6ea3e-266f-4e0a-a9eb-d4bbd16a679c.log

Restore Monitor Stack command: $ hydra investigate show-monitor ad6fa570-f155-48da-b448-ac20822b0e95
Restore monitor on AWS instance using Jenkins job
Show all stored logs command: $ hydra investigate show-logs ad6fa570-f155-48da-b448-ac20822b0e95

Logs:

db-cluster-ad6fa570.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/ad6fa570-f155-48da-b448-ac20822b0e95/20220731_041723/db-cluster-ad6fa570.tar.gz
monitor-set-ad6fa570.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/ad6fa570-f155-48da-b448-ac20822b0e95/20220731_041723/monitor-set-ad6fa570.tar.gz
loader-set-ad6fa570.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/ad6fa570-f155-48da-b448-ac20822b0e95/20220731_041723/loader-set-ad6fa570.tar.gz
sct-runner-ad6fa570.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/ad6fa570-f155-48da-b448-ac20822b0e95/20220731_041723/sct-runner-ad6fa570.tar.gz

Jenkins job URL

roydahan commented 2 years ago

This is a known and expected behaviour of scylla-bench at the moment. We try to add a retry mechanism to scylla-bench that should let us know when the timeouts should concern us and when not.

scylladb / scylla-bench