Closed yarongilor closed 1 year ago
The issue is reproduced in a successive run -
yarongilor@yarongilor:~/Downloads/logs/sct-runner-8685a104$ grep 'Count result:' sct.log | grep 50000 -c
1597
yarongilor@yarongilor:~/Downloads/logs/sct-runner-8685a104$ grep 'Count result:' sct.log | grep -v 50000 -c
1
yarongilor@yarongilor:~/Downloads/logs/sct-runner-8685a104$ grep 'Count result:' sct.log | grep -v 50000
< t:2022-12-01 14:04:11,048 f:tester.py l:2475 c:LargePartitionLongevityTest p:DEBUG > Count result: ['', 'count', '-------', '46000', '', '(1 rows)']
Look at the logs following logs:
Configuration
Mode: write
Workload: sequential
Timeout: 3m0s
Retries:
number: 10
min interval: 80ms
max interval: 1s
Consistency level: quorum
Partition count: 100
Partition offset: 401
Clustering rows: 50000
Clustering row size: Uniform(min=100, max=5120)
Rows per request: 2000
Page size: 1000
Concurrency: 250
Connections: 150
Maximum rate: unlimited
Client compression: true
Hdr memory consumption: 122948000 bytes
And following non 50k rows results:
< t:2022-11-30 10:20:04,699 > Count result: ['', 'count', '-------', '26000', '', '(1 rows)']
< t:2022-11-30 10:21:25,742 > Count result: ['', 'count', '-------', '48000', '', '(1 rows)']
< t:2022-11-30 10:23:37,900 > Count result: ['', 'count', '-------', '46000', '', '(1 rows)']
< t:2022-11-30 10:26:03,296 > Count result: ['', 'count', '-------', '24000', '', '(1 rows)']
< t:2022-11-30 10:27:03,741 > Count result: ['', 'count', '-------', '40000', '', '(1 rows)']
< t:2022-11-30 10:27:13,169 > Count result: ['', 'count', '-------', '38000', '', '(1 rows)']
We can see that absent data is multiple of 2k, which is number of rows per request. As a result, we see that 39 different requests were not successful.
As a first step, I propose to set the consistency level to be ALL
instead of QUORUM
for the 'prepare write' load.
I expect the situation to be changed to all 50k.
Look at the logs following logs:
Configuration Mode: write Workload: sequential Timeout: 3m0s Retries: number: 10 min interval: 80ms max interval: 1s Consistency level: quorum Partition count: 100 Partition offset: 401 Clustering rows: 50000 Clustering row size: Uniform(min=100, max=5120) Rows per request: 2000 Page size: 1000 Concurrency: 250 Connections: 150 Maximum rate: unlimited Client compression: true Hdr memory consumption: 122948000 bytes
And following non 50k rows results:
< t:2022-11-30 10:20:04,699 > Count result: ['', 'count', '-------', '26000', '', '(1 rows)'] < t:2022-11-30 10:21:25,742 > Count result: ['', 'count', '-------', '48000', '', '(1 rows)'] < t:2022-11-30 10:23:37,900 > Count result: ['', 'count', '-------', '46000', '', '(1 rows)'] < t:2022-11-30 10:26:03,296 > Count result: ['', 'count', '-------', '24000', '', '(1 rows)'] < t:2022-11-30 10:27:03,741 > Count result: ['', 'count', '-------', '40000', '', '(1 rows)'] < t:2022-11-30 10:27:13,169 > Count result: ['', 'count', '-------', '38000', '', '(1 rows)']
We can see that absent data is multiple of 2k, which is number of rows per request. As a result, we see that 39 different requests were not successful.
As a first step, I propose to set the consistency level to be
ALL
instead ofQUORUM
for the 'prepare write' load. I expect the situation to be changed to all 50k.
@vponomaryov right, but if the run_cqlsh queries using a quorum there should be no problem.
perhaps the issue is CL is not 'quorumin the cqlsh
select count(*)` queries.
- Deafult CL of cqlsh is ONE.
great, this support above estimation
- Why are you running it with cqlsh and not cql driver?
It's an old code from long time ago. probably always worked ok where past s-b versions used.
- I don't think the problem is with the query, but if you have a reproducer and a specific partition number, it would be good to query it specifically when the test reached this.
I'll run a reproducer with a quorum-query using the driver: https://jenkins.scylladb.com/job/scylla-staging/job/Longevity_yaron/job/longevity-large-partition-4days-test-rq2/16/
- Check the timeouts / failed writes metrics during the prepare to check @vponomaryov theory.
(oops, perhaps i forgot to mention) i'm not sure about that, because IIRC, at end of test, this query did get the right number of rows (that's why test got the validation error).
The issue is not reproduced using a driver query with a quorum CL:
ubuntu@ip-10-4-0-134:~/sct-results/latest$ grep 'Count result:' sct.log | grep -v 50000
ubuntu@ip-10-4-0-134:~/sct-results/latest$ grep 'Count result:' sct.log | grep -c 50000
799
fixed in https://github.com/scylladb/scylla-cluster-tests/pull/5335:
ubuntu@ip-10-4-1-126:~/sct-results/latest$ grep 'Count result:' sct.log | grep -c 50000
800
ubuntu@ip-10-4-1-126:~/sct-results/latest$ grep 'Count result:' sct.log | grep -v 50000
ubuntu@ip-10-4-1-126:~/sct-results/latest$
ubuntu@ip-10-4-1-126:~/sct-results/latest$ cat test_id
42bbe08b-c76f-4d7a-a019-8aa574535076
Installation details
Kernel Version: 5.15.0-1023-aws Scylla version (or git commit hash):
2022.2.0~rc5-20221121.feb292600fc4
with build-id172f9538efb0893c97c86cdf05622925159f4fa2
Cluster size: 4 nodes (i4i.4xlarge)
Scylla Nodes used in this run:
OS / Image:
ami-0b77b476432e37d90
(aws: eu-west-1)Test:
longevity-large-partition-4days-test-rq2
Test id:10079e1e-18ba-4ffc-a6b0-e4a399b7d23c
Test name:scylla-staging/Longevity_yaron/longevity-large-partition-4days-test-rq2
Test config file(s):Issue description
>>>>>>> scenario: s-b is set to write 50k rows per partition, for 800 partitions. This load run and completed in the prepare stress step without using any nemesis. Then a validation check of sct longevity counts the rows of each partition and report it. The result is that some partitions have a lower number than 50k.
The yaml configuration is:
The count output has:
we probably can try more debugging like retries of the counts to see if, for some reason, rows number is only updated after a more while <<<<<<<
$ hydra investigate show-monitor 10079e1e-18ba-4ffc-a6b0-e4a399b7d23c
$ hydra investigate show-logs 10079e1e-18ba-4ffc-a6b0-e4a399b7d23c
Logs:
No logs captured during this run.
+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ | Log links for testrun with test id 10079e1e-18ba-4ffc-a6b0-e4a399b7d23c | +-----------------+-------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ | Date | Log type | Link | +-----------------+-------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ | 20221130_110838 | grafana | https://cloudius-jenkins-test.s3.amazonaws.com/10079e1e-18ba-4ffc-a6b0-e4a399b7d23c/20221130_110838/grafana-screenshot-longevity-large-partition-4days-test-rq2-scylla-per-server-metrics-nemesis-20221130_110948-longevity-large-partitions-order-by-monitor-node-10079e1e-1.png | | 20221130_110838 | grafana | https://cloudius-jenkins-test.s3.amazonaws.com/10079e1e-18ba-4ffc-a6b0-e4a399b7d23c/20221130_110838/grafana-screenshot-overview-20221130_110839-longevity-large-partitions-order-by-monitor-node-10079e1e-1.png | | 20221130_195014 | grafana | https://cloudius-jenkins-test.s3.amazonaws.com/10079e1e-18ba-4ffc-a6b0-e4a399b7d23c/20221130_195014/grafana-screenshot-longevity-large-partition-4days-test-rq2-scylla-per-server-metrics-nemesis-20221130_195050-longevity-large-partitions-order-by-monitor-node-10079e1e-1.png | | 20221130_195014 | grafana | https://cloudius-jenkins-test.s3.amazonaws.com/10079e1e-18ba-4ffc-a6b0-e4a399b7d23c/20221130_195014/grafana-screenshot-overview-20221130_195014-longevity-large-partitions-order-by-monitor-node-10079e1e-1.png | | 20221130_195821 | grafana | https://cloudius-jenkins-test.s3.amazonaws.com/10079e1e-18ba-4ffc-a6b0-e4a399b7d23c/20221130_195821/grafana-screenshot-longevity-large-partition-4days-test-rq2-scylla-per-server-metrics-nemesis-20221130_195857-longevity-large-partitions-order-by-monitor-node-10079e1e-1.png | | 20221130_195821 | grafana | https://cloudius-jenkins-test.s3.amazonaws.com/10079e1e-18ba-4ffc-a6b0-e4a399b7d23c/20221130_195821/grafana-screenshot-overview-20221130_195821-longevity-large-partitions-order-by-monitor-node-10079e1e-1.png | | 20221201_091457 | grafana | https://cloudius-jenkins-test.s3.amazonaws.com/10079e1e-18ba-4ffc-a6b0-e4a399b7d23c/20221201_091457/grafana-screenshot-longevity-large-partition-4days-test-rq2-scylla-per-server-metrics-nemesis-20221201_091534-longevity-large-partitions-order-by-monitor-node-10079e1e-1.png | | 20221201_091457 | grafana | https://cloudius-jenkins-test.s3.amazonaws.com/10079e1e-18ba-4ffc-a6b0-e4a399b7d23c/20221201_091457/grafana-screenshot-overview-20221201_091457-longevity-large-partitions-order-by-monitor-node-10079e1e-1.png | | 20221201_094657 | grafana | https://cloudius-jenkins-test.s3.amazonaws.com/10079e1e-18ba-4ffc-a6b0-e4a399b7d23c/20221201_094657/grafana-screenshot-longevity-large-partition-4days-test-rq2-scylla-per-server-metrics-nemesis-20221201_094733-longevity-large-partitions-order-by-monitor-node-10079e1e-1.png | | 20221201_094657 | grafana | https://cloudius-jenkins-test.s3.amazonaws.com/10079e1e-18ba-4ffc-a6b0-e4a399b7d23c/20221201_094657/grafana-screenshot-overview-20221201_094657-longevity-large-partitions-order-by-monitor-node-10079e1e-1.png | | 20221201_125438 | db-cluster | https://cloudius-jenkins-test.s3.amazonaws.com/10079e1e-18ba-4ffc-a6b0-e4a399b7d23c/20221201_125438/db-cluster-10079e1e.tar.gz | | 20221201_125438 | loader-set | https://cloudius-jenkins-test.s3.amazonaws.com/10079e1e-18ba-4ffc-a6b0-e4a399b7d23c/20221201_125438/loader-set-10079e1e.tar.gz | | 20221201_125438 | monitor-set | https://cloudius-jenkins-test.s3.amazonaws.com/10079e1e-18ba-4ffc-a6b0-e4a399b7d23c/20221201_125438/monitor-set-10079e1e.tar.gz | | 20221201_125438 | sct | https://cloudius-jenkins-test.s3.amazonaws.com/10079e1e-18ba-4ffc-a6b0-e4a399b7d23c/20221201_125438/sct-runner-10079e1e.tar.gz | +-----------------+-------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
Jenkins job URL