scylladb / scylla-bench

43 stars 36 forks source link

Timeout errors - received only 0 responses from 2 CL=QUORUM. #91

Closed soyacz closed 2 years ago

soyacz commented 2 years ago

When running long longevity (TWCS), s-b load drops and eventually fails due too many errors.

2022/02/04 10:26:10 Operation timed out for scylla_bench.test - received only 0 responses from 2 CL=QUORUM.

While we know why it fails, its not clear why we see these errors.

Restore commands: Restore Monitor Stack command: $ hydra investigate show-monitor cc5e0f99-50a8-4f7a-b81b-64032b2429a0 Restore monitor on AWS instance using Jenkins job Show all stored logs command: $ hydra investigate show-logs cc5e0f99-50a8-4f7a-b81b-64032b2429a0

Logs: grafana - https://cloudius-jenkins-test.s3.amazonaws.com/cc5e0f99-50a8-4f7a-b81b-64032b2429a0/20220204_162208/grafana-screenshot-longevity-twcs-48h-test-scylla-per-server-metrics-nemesis-20220204_162427-longevity-twcs-48h-master-monitor-node-cc5e0f99-1.png grafana - https://cloudius-jenkins-test.s3.amazonaws.com/cc5e0f99-50a8-4f7a-b81b-64032b2429a0/20220204_162208/grafana-screenshot-overview-20220204_162208-longevity-twcs-48h-master-monitor-node-cc5e0f99-1.png db-cluster - https://cloudius-jenkins-test.s3.amazonaws.com/cc5e0f99-50a8-4f7a-b81b-64032b2429a0/20220204_163847/db-cluster-cc5e0f99.tar.gz loader-set - https://cloudius-jenkins-test.s3.amazonaws.com/cc5e0f99-50a8-4f7a-b81b-64032b2429a0/20220204_163847/loader-set-cc5e0f99.tar.gz monitor-set - https://cloudius-jenkins-test.s3.amazonaws.com/cc5e0f99-50a8-4f7a-b81b-64032b2429a0/20220204_163847/monitor-set-cc5e0f99.tar.gz sct - https://cloudius-jenkins-test.s3.amazonaws.com/cc5e0f99-50a8-4f7a-b81b-64032b2429a0/20220204_163847/sct-runner-cc5e0f99.tar.gz

Links: Build URL Download "Overview metrics" Grafana Screenshot Download "Per server metrics nemesis" Grafana Screenshot

enaydanov commented 2 years ago

Got it here too:

Installation details

Kernel Version: 5.11.0-1028-aws

Scylla version (or git commit hash): 5.1.dev-20220303.3b5ba5c1a998 with build-id abc821aa82a471e4414ae67ec146cbffc9c8f194

Cluster size: 4 nodes (i3en.2xlarge)

Scylla running with shards number (live nodes): No resources left at the end of the run

OS / Image: ami-04e93503987e389d7 (aws: eu-west-1)

Test: longevity-twcs-48h-test

Test id: 42f0a3aa-66c6-4358-9f87-0111cbdbeced

Test name: longevity_twcs_test/longevity-twcs-48h-test

Test config file(s):

Commands

Logs:

Jenkins job URL

fgelcer commented 2 years ago

is it read or write command timing out?

ShlomiBalalis commented 2 years ago

Installation details

Kernel Version: 5.15.0-1015-aws Scylla version (or git commit hash): 5.1.0~dev-20220726.29c28dcb0c33 with build-id f0f40bc93cc45be63928bbe9eaf674885347ba58 Cluster size: 4 nodes (i3en.2xlarge)

Scylla Nodes used in this run:

OS / Image: ami-0c90fd2f4dcfdd273 (aws: eu-west-1)

Test: longevity-twcs-48h-test Test id: ad6fa570-f155-48da-b448-ac20822b0e95 Test name: scylla-master/longevity/longevity-twcs-48h-test Test config file(s):

Issue description

The issue seem to have happened again this week? The error started to appear around 07-29 10:55, and continued throughout the rest of the 48h run. During this whole time the cluster seem healthy (everyone is consistently UN, disruptive nemeses aside). The bench thread is a read thread named scylla-bench-l0-d0d6ea3e-266f-4e0a-a9eb-d4bbd16a679c.log

Logs:

Jenkins job URL

roydahan commented 2 years ago

This is a known and expected behaviour of scylla-bench at the moment. We try to add a retry mechanism to scylla-bench that should let us know when the timeouts should concern us and when not.