scylladb / scylla-bench

42 stars 34 forks source link

validate shouldn't be truncating by default #130

Open fruch opened 8 months ago

fruch commented 8 months ago

Issue description

when validate is use on multiple s-b, truncate can easily timeout

2023-12-31 16:21:30.473: (ScyllaBenchEvent Severity.ERROR) period_type=end event_id=14b729f7-4ba2-42be-a1a8-8e77680417ae during_nemesis=EnableDisableTableEncryptionAwsKmsProviderWithRotation duration=1m0s: node=Node longevity-tls-1tb-7d-2024-1-loader-node-7b483215-2 [34.200.245.124 | 10.12.1.81] (seed: False)
stress_cmd=scylla-bench -mode=write -workload=sequential -consistency-level=all -replication-factor=3 -partition-count=50 -clustering-row-count=100 -clustering-row-size=uniform:75..125 -keyspace keyspace1 -table tmp_encrypted_table -timeout=120s -validate-data -tls  -username cassandra -password cassandra -error-at-row-limit 1000 -nodes 10.12.2.154,10.12.1.123,10.12.0.220,10.12.1.51
errors:

Stress command completed with bad status 1: 2023/12/31 16:21:30 Error during truncate: seastar::rpc::timeout_error (rpc call timed out)
2023-12-31 16:21:30.885: (ScyllaBenchEvent Severity.ERROR) period_type=end event_id=b5a902b6-0a3e-4766-b1b1-56cd7b14ce82 during_nemesis=EnableDisableTableEncryptionAwsKmsProviderWithRotation duration=1m2s: node=Node longevity-tls-1tb-7d-2024-1-loader-node-7b483215-1 [34.205.89.231 | 10.12.0.86] (seed: False)
stress_cmd=scylla-bench -mode=write -workload=sequential -consistency-level=all -replication-factor=3 -partition-count=50 -clustering-row-count=100 -clustering-row-size=uniform:75..125 -keyspace keyspace1 -table tmp_encrypted_table -timeout=120s -validate-data -tls  -username cassandra -password cassandra -error-at-row-limit 1000 -nodes 10.12.2.154,10.12.1.123,10.12.0.220,10.12.1.51
errors:

Stress command completed with bad status 1: 2023/12/31 16:21:30 Error during truncate: seastar::rpc::timeout_error (rpc call timed out)
2023-12-31 16:21:38.450: (ScyllaBenchEvent Severity.ERROR) period_type=end event_id=f9e1a1c7-ce29-4d84-a950-5eadd0dcc4d9 during_nemesis=EnableDisableTableEncryptionAwsKmsProviderWithRotation duration=1m0s: node=Node longevity-tls-1tb-7d-2024-1-loader-node-7b483215-3 [3.93.35.150 | 10.12.1.245] (seed: False)
stress_cmd=scylla-bench -mode=write -workload=sequential -consistency-level=all -replication-factor=3 -partition-count=50 -clustering-row-count=100 -clustering-row-size=uniform:75..125 -keyspace keyspace1 -table tmp_encrypted_table -timeout=120s -validate-data -tls  -username cassandra -password cassandra -error-at-row-limit 1000 -nodes 10.12.2.154,10.12.1.123,10.12.0.220,10.12.1.51
errors:

Stress command completed with bad status 1: 2023/12/31 16:21:38 Error during truncate: seastar::rpc::timeout_error (rpc call timed out)
2023-12-31 16:21:38.904: (ScyllaBenchEvent Severity.ERROR) period_type=end event_id=a0f2d03b-4474-4729-bae3-60b8219d45b8 during_nemesis=EnableDisableTableEncryptionAwsKmsProviderWithRotation duration=1m0s: node=Node longevity-tls-1tb-7d-2024-1-loader-node-7b483215-4 [35.175.115.64 | 10.12.2.93] (seed: False)
stress_cmd=scylla-bench -mode=write -workload=sequential -consistency-level=all -replication-factor=3 -partition-count=50 -clustering-row-count=100 -clustering-row-size=uniform:75..125 -keyspace keyspace1 -table tmp_encrypted_table -timeout=120s -validate-data -tls  -username cassandra -password cassandra -error-at-row-limit 1000 -nodes 10.12.2.154,10.12.1.123,10.12.0.220,10.12.1.51
errors:

Stress command completed with bad status 1: 2023/12/31 16:21:38 Error during truncate: seastar::rpc::timeout_error (rpc call timed out)

Impact

On cases we have quite high load, those command tend to fail and fail the nemesis using them

How frequently does it reproduce?

happens quite a lot, on multiple cases (as we use s-b more and more)

Installation details

Kernel Version: 5.15.0-1051-aws Scylla version (or git commit hash): 2024.1.0~rc2-20231217.f57117d9cfe3 with build-id 3a4d2dfe8ef4eef5454badb34d1710a5f36a859c

Cluster size: 4 nodes (i4i.4xlarge)

Scylla Nodes used in this run:

OS / Image: ami-0514d67e24a1dae78 (aws: undefined_region)

Test: longevity-1tb-5days-test Test id: 7b483215-4589-4363-930d-8fdc839f6a95 Test name: enterprise-2024.1/longevity/longevity-1tb-5days-test Test config file(s):

Logs and commands - Restore Monitor Stack command: `$ hydra investigate show-monitor 7b483215-4589-4363-930d-8fdc839f6a95` - Restore monitor on AWS instance using [Jenkins job](https://jenkins.scylladb.com/view/QA/job/QA-tools/job/hydra-show-monitor/parambuild/?test_id=7b483215-4589-4363-930d-8fdc839f6a95) - Show all stored logs command: `$ hydra investigate show-logs 7b483215-4589-4363-930d-8fdc839f6a95` ## Logs: - **db-cluster-7b483215.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/7b483215-4589-4363-930d-8fdc839f6a95/20240105_081711/db-cluster-7b483215.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/7b483215-4589-4363-930d-8fdc839f6a95/20240105_081711/db-cluster-7b483215.tar.gz) - **sct-runner-events-7b483215.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/7b483215-4589-4363-930d-8fdc839f6a95/20240105_081711/sct-runner-events-7b483215.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/7b483215-4589-4363-930d-8fdc839f6a95/20240105_081711/sct-runner-events-7b483215.tar.gz) - **2023_12_31__12_31_42_712.sct-7b483215.log.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/7b483215-4589-4363-930d-8fdc839f6a95/20240105_081711/2023_12_31__12_31_42_712.sct-7b483215.log.gz](https://cloudius-jenkins-test.s3.amazonaws.com/7b483215-4589-4363-930d-8fdc839f6a95/20240105_081711/2023_12_31__12_31_42_712.sct-7b483215.log.gz) - **2024_01_02__19_22_45_706.sct-7b483215.log.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/7b483215-4589-4363-930d-8fdc839f6a95/20240105_081711/2024_01_02__19_22_45_706.sct-7b483215.log.gz](https://cloudius-jenkins-test.s3.amazonaws.com/7b483215-4589-4363-930d-8fdc839f6a95/20240105_081711/2024_01_02__19_22_45_706.sct-7b483215.log.gz) - **loader-set-7b483215.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/7b483215-4589-4363-930d-8fdc839f6a95/20240105_081711/loader-set-7b483215.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/7b483215-4589-4363-930d-8fdc839f6a95/20240105_081711/loader-set-7b483215.tar.gz) - **monitor-set-7b483215.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/7b483215-4589-4363-930d-8fdc839f6a95/20240105_081711/monitor-set-7b483215.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/7b483215-4589-4363-930d-8fdc839f6a95/20240105_081711/monitor-set-7b483215.tar.gz) [Jenkins job URL](https://jenkins.scylladb.com/job/enterprise-2024.1/job/longevity/job/longevity-1tb-5days-test/3/) [Argus](https://argus.scylladb.com/test/eb44e4c2-8b8b-4365-9efb-245d9fd5c196/runs?additionalRuns[]=7b483215-4589-4363-930d-8fdc839f6a95)
mykaul commented 8 months ago

I assume you refer to https://github.com/scylladb/scylla-bench/blob/12020ff90cacd0dbfabebfe488029978429f3efa/main.go#L125

fruch commented 8 months ago

I assume you refer to

https://github.com/scylladb/scylla-bench/blob/12020ff90cacd0dbfabebfe488029978429f3efa/main.go#L125

yes, doing this at the same time from multiple processes, on a cluster that is ready is with substantial load, joining to be slow, and a waste of time.

if a user want run validation, it's on him to be aware that this table need to be truncated before. anyhow it should happen only if the user ask for it, and even then it need a longer timeout

and in general all the queries in PrepareDatabase are without retries, and with default timeout. that might not be enough.

juliayakovlev commented 2 months ago

https://github.com/scylladb/scylla-bench/issues/30