scylladb / scylladb

NoSQL data store using the seastar framework, compatible with Apache Cassandra
http://scylladb.com
GNU Affero General Public License v3.0
13.68k stars 1.3k forks source link

[Perf][Tablets] Latency spike up to 86 ms during steady step, when no cluster operations running #21766

Open juliayakovlev opened 1 day ago

juliayakovlev commented 1 day ago

Packages

Scylla version: 6.3.0~dev-20241129.65949ce60780 with build-id d0921e78443678667ebaf5d8cdfda19428d03e6c

Kernel Version: 6.8.0-1019-aws

Issue description

First step in a performance test with nemeses, named "Steady state", runs 30 min load without any cluster operations. Test runs with tablets enabled. Most of test runs the latency is low during this step (less 10 ms). But there are 2 runs with mixed load (50% read + 50% write) when the read latency is > 10 ms.

Compactions and rebuilding bloom filter run in this time. Yellow line is elasticity-test-ubuntu-db-node-75894e25-3 node. Looks like this node has a problem:

Image Image Image Image Image Image

reader_concurrency_semaphore

Dec 01 03:27:36.797876 elasticity-test-ubuntu-db-node-75894e25-3 scylla[6288]:  [shard 0:stmt] reader_concurrency_semaphore - (rate limiting dropped 38455 similar messages) Semaphore user with 4/100 count and 93444/174986362 memory resources: timed out, dumping permit diagnostics:
                                                                               Trigger permit: count=0, memory=0, table=keyspace1.standard1, operation=data-query, state=waiting_for_admission
                                                                               Identified bottleneck(s): CPU

                                                                               permits        count        memory        table/operation/state
                                                                               3        3        68K        keyspace1.standard1/data-query/active/need_cpu
                                                                               1        1        24K        keyspace1.standard1/data-query/active/await
                                                                               11        0        0B        keyspace1.standard1/mutation-query/waiting_for_admission
                                                                               7629        0        0B        keyspace1.standard1/data-query/waiting_for_admission

                                                                               7644        4        91K        total

                                                                               Stats:
                                                                               permit_based_evictions: 0
                                                                               time_based_evictions: 0
                                                                               inactive_reads: 0
                                                                               total_successful_reads: 13659209
                                                                               total_failed_reads: 333872
                                                                               total_reads_shed_due_to_overload: 0
                                                                               total_reads_killed_due_to_kill_limit: 0
                                                                               reads_admitted: 13662427
                                                                               reads_enqueued_for_admission: 10007492
                                                                               reads_enqueued_for_memory: 0
                                                                               reads_admitted_immediately: 3993235
                                                                               reads_queued_because_ready_list: 1895825
                                                                               reads_queued_because_need_cpu_permits: 8111667
                                                                               reads_queued_because_memory_resources: 0
                                                                               reads_queued_because_count_resources: 0
                                                                               reads_queued_with_eviction: 0
                                                                               total_permits: 14000727
                                                                               current_permits: 7644
                                                                               need_cpu_permits: 4
                                                                               awaits_permits: 1
                                                                               disk_reads: 4
                                                                               sstables_read: 12

Impact

Describe the impact this issue causes to the user.

How frequently does it reproduce?

Describe the frequency with how this issue can be reproduced.

Installation details

Cluster size: 3 nodes (i4i.2xlarge)

Scylla Nodes used in this run:

OS / Image: ami-09535e6a78b3a7b32 (aws: undefined_region)

Test: scylla-master-perf-regression-latency-650gb-elasticity Test id: 75894e25-15cb-4919-a57c-78cf6756fdb8 Test name: scylla-master/perf-regression/scylla-master-perf-regression-latency-650gb-elasticity Test method: performance_regression_test.PerformanceRegressionTest.test_latency_mixed_with_nemesis Test config file(s):

Logs and commands - Restore Monitor Stack command: `$ hydra investigate show-monitor 75894e25-15cb-4919-a57c-78cf6756fdb8` - Restore monitor on AWS instance using [Jenkins job](https://jenkins.scylladb.com/view/QA/job/QA-tools/job/hydra-show-monitor/parambuild/?test_id=75894e25-15cb-4919-a57c-78cf6756fdb8) - Show all stored logs command: `$ hydra investigate show-logs 75894e25-15cb-4919-a57c-78cf6756fdb8` ## Logs: - **elasticity-test-ubuntu-db-node-75894e25-3** - [https://cloudius-jenkins-test.s3.amazonaws.com/75894e25-15cb-4919-a57c-78cf6756fdb8/20241130_221337/elasticity-test-ubuntu-db-node-75894e25-3-75894e25.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/75894e25-15cb-4919-a57c-78cf6756fdb8/20241130_221337/elasticity-test-ubuntu-db-node-75894e25-3-75894e25.tar.gz) - **elasticity-test-ubuntu-db-node-75894e25-1** - [https://cloudius-jenkins-test.s3.amazonaws.com/75894e25-15cb-4919-a57c-78cf6756fdb8/20241130_221337/elasticity-test-ubuntu-db-node-75894e25-1-75894e25.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/75894e25-15cb-4919-a57c-78cf6756fdb8/20241130_221337/elasticity-test-ubuntu-db-node-75894e25-1-75894e25.tar.gz) - **elasticity-test-ubuntu-db-node-75894e25-2** - [https://cloudius-jenkins-test.s3.amazonaws.com/75894e25-15cb-4919-a57c-78cf6756fdb8/20241130_221337/elasticity-test-ubuntu-db-node-75894e25-2-75894e25.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/75894e25-15cb-4919-a57c-78cf6756fdb8/20241130_221337/elasticity-test-ubuntu-db-node-75894e25-2-75894e25.tar.gz) - **db-cluster-75894e25.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/75894e25-15cb-4919-a57c-78cf6756fdb8/20241201_075802/db-cluster-75894e25.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/75894e25-15cb-4919-a57c-78cf6756fdb8/20241201_075802/db-cluster-75894e25.tar.gz) - **sct-runner-events-75894e25.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/75894e25-15cb-4919-a57c-78cf6756fdb8/20241201_075802/sct-runner-events-75894e25.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/75894e25-15cb-4919-a57c-78cf6756fdb8/20241201_075802/sct-runner-events-75894e25.tar.gz) - **sct-75894e25.log.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/75894e25-15cb-4919-a57c-78cf6756fdb8/20241201_075802/sct-75894e25.log.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/75894e25-15cb-4919-a57c-78cf6756fdb8/20241201_075802/sct-75894e25.log.tar.gz) - **loader-set-75894e25.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/75894e25-15cb-4919-a57c-78cf6756fdb8/20241201_075802/loader-set-75894e25.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/75894e25-15cb-4919-a57c-78cf6756fdb8/20241201_075802/loader-set-75894e25.tar.gz) - **monitor-set-75894e25.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/75894e25-15cb-4919-a57c-78cf6756fdb8/20241201_075802/monitor-set-75894e25.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/75894e25-15cb-4919-a57c-78cf6756fdb8/20241201_075802/monitor-set-75894e25.tar.gz) [Jenkins job URL](https://jenkins.scylladb.com/job/scylla-master/job/perf-regression/job/scylla-master-perf-regression-latency-650gb-elasticity/23/) [Argus](https://argus.scylladb.com/test/e5b4605c-4796-4e91-95e0-56dff1dfa341/runs?additionalRuns[]=75894e25-15cb-4919-a57c-78cf6756fdb8)
juliayakovlev commented 1 day ago

There is similar issue in enterprise https://github.com/scylladb/scylla-enterprise/issues/5013