Closed soyacz closed 2 years ago
Got it here too:
Kernel Version: 5.11.0-1028-aws
Scylla version (or git commit hash): 5.1.dev-20220303.3b5ba5c1a998
with build-id abc821aa82a471e4414ae67ec146cbffc9c8f194
Cluster size: 4 nodes (i3en.2xlarge)
Scylla running with shards number (live nodes): No resources left at the end of the run
OS / Image: ami-04e93503987e389d7
(aws: eu-west-1)
Test: longevity-twcs-48h-test
Test id: 42f0a3aa-66c6-4358-9f87-0111cbdbeced
Test name: longevity_twcs_test/longevity-twcs-48h-test
Test config file(s):
$ hydra investigate show-monitor 42f0a3aa-66c6-4358-9f87-0111cbdbeced
$ hydra investigate show-logs 42f0a3aa-66c6-4358-9f87-0111cbdbeced
is it read or write command timing out?
Kernel Version: 5.15.0-1015-aws
Scylla version (or git commit hash): 5.1.0~dev-20220726.29c28dcb0c33
with build-id f0f40bc93cc45be63928bbe9eaf674885347ba58
Cluster size: 4 nodes (i3en.2xlarge)
Scylla Nodes used in this run:
OS / Image: ami-0c90fd2f4dcfdd273
(aws: eu-west-1)
Test: longevity-twcs-48h-test
Test id: ad6fa570-f155-48da-b448-ac20822b0e95
Test name: scylla-master/longevity/longevity-twcs-48h-test
Test config file(s):
The issue seem to have happened again this week? The error started to appear around 07-29 10:55
, and continued throughout the rest of the 48h run. During this whole time the cluster seem healthy (everyone is consistently UN, disruptive nemeses aside).
The bench thread is a read thread named scylla-bench-l0-d0d6ea3e-266f-4e0a-a9eb-d4bbd16a679c.log
$ hydra investigate show-monitor ad6fa570-f155-48da-b448-ac20822b0e95
$ hydra investigate show-logs ad6fa570-f155-48da-b448-ac20822b0e95
This is a known and expected behaviour of scylla-bench at the moment. We try to add a retry mechanism to scylla-bench that should let us know when the timeouts should concern us and when not.
When running long longevity (TWCS), s-b load drops and eventually fails due too many errors.
2022/02/04 10:26:10 Operation timed out for scylla_bench.test - received only 0 responses from 2 CL=QUORUM.
While we know why it fails, its not clear why we see these errors.
Restore commands: Restore Monitor Stack command: $ hydra investigate show-monitor cc5e0f99-50a8-4f7a-b81b-64032b2429a0 Restore monitor on AWS instance using Jenkins job Show all stored logs command: $ hydra investigate show-logs cc5e0f99-50a8-4f7a-b81b-64032b2429a0
Logs: grafana - https://cloudius-jenkins-test.s3.amazonaws.com/cc5e0f99-50a8-4f7a-b81b-64032b2429a0/20220204_162208/grafana-screenshot-longevity-twcs-48h-test-scylla-per-server-metrics-nemesis-20220204_162427-longevity-twcs-48h-master-monitor-node-cc5e0f99-1.png grafana - https://cloudius-jenkins-test.s3.amazonaws.com/cc5e0f99-50a8-4f7a-b81b-64032b2429a0/20220204_162208/grafana-screenshot-overview-20220204_162208-longevity-twcs-48h-master-monitor-node-cc5e0f99-1.png db-cluster - https://cloudius-jenkins-test.s3.amazonaws.com/cc5e0f99-50a8-4f7a-b81b-64032b2429a0/20220204_163847/db-cluster-cc5e0f99.tar.gz loader-set - https://cloudius-jenkins-test.s3.amazonaws.com/cc5e0f99-50a8-4f7a-b81b-64032b2429a0/20220204_163847/loader-set-cc5e0f99.tar.gz monitor-set - https://cloudius-jenkins-test.s3.amazonaws.com/cc5e0f99-50a8-4f7a-b81b-64032b2429a0/20220204_163847/monitor-set-cc5e0f99.tar.gz sct - https://cloudius-jenkins-test.s3.amazonaws.com/cc5e0f99-50a8-4f7a-b81b-64032b2429a0/20220204_163847/sct-runner-cc5e0f99.tar.gz
Links: Build URL Download "Overview metrics" Grafana Screenshot Download "Per server metrics nemesis" Grafana Screenshot