scylladb / scylladb

NoSQL data store using the seastar framework, compatible with Apache Cassandra
http://scylladb.com
GNU Affero General Public License v3.0
13.45k stars 1.27k forks source link

High latency on mixed workload in steady state - 6ms #8987

Open yarongilor opened 3 years ago

yarongilor commented 3 years ago

The 250GB performance-with-nemesis test got degraded results for both mixed and read latencies over the last week. it raised from ~5 to 6ms.

Test: performance_regression_test.PerformanceRegressionTest.test_latency_mixed_with_nemesis_mixed Test start time: 2021-07-02 05:14:23.614179 Test test id: ac4535df-4ba2-49f1-9c5b-f50e1fe53669 Scylla Server Version: 4.6.dev.20210628.c0c1e2601 with build-id be170b54ddb69a5ec11af5b81c1b366ad2b895de Setup Details: ami_id_db_scylla: ami-0fcc4c76ea9901d17 cluster_backend: aws instance_type_db: i3.2xlarge instance_type_loader: c5.2xlarge instance_type_monitor: t3.large region_name: ['eu-west-1'] Amount of reactor stalls: 1424

full performance results per nemesis:

https://docs.google.com/document/d/1TF_hSgoJyMPFqjyNvRJzZUBnPAZ2uL7yJRlKFeqw4vA/edit?usp=sharing

Test details: +--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ | Log links for testrun with test id ac4535df-4ba2-49f1-9c5b-f50e1fe53669 | +-----------------+-------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ | Date | Log type | Link | +-----------------+-------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ | 20190101_010101 | prometheus | https://cloudius-jenkins-test.s3.amazonaws.com/ac4535df-4ba2-49f1-9c5b-f50e1fe53669/prometheus_snapshot_20210702_050008.tar.gz | | 20190101_010101 | prometheus | https://cloudius-jenkins-test.s3.amazonaws.com/ac4535df-4ba2-49f1-9c5b-f50e1fe53669/prometheus_snapshot_20210702_111456.tar.gz | | 20190101_010101 | prometheus | https://cloudius-jenkins-test.s3.amazonaws.com/ac4535df-4ba2-49f1-9c5b-f50e1fe53669/prometheus_snapshot_20210702_112457.tar.gz | | 20190101_010101 | prometheus | https://cloudius-jenkins-test.s3.amazonaws.com/ac4535df-4ba2-49f1-9c5b-f50e1fe53669/prometheus_snapshot_20210702_120638.tar.gz | | 20210702_045115 | grafana | https://cloudius-jenkins-test.s3.amazonaws.com/ac4535df-4ba2-49f1-9c5b-f50e1fe53669/20210702_045115/grafana-screenshot-overview-20210702_045115-perf-latency-nemesis-perf-v10-monitor-node-ac4535df-1.png | | 20210702_045115 | grafana | https://cloudius-jenkins-test.s3.amazonaws.com/ac4535df-4ba2-49f1-9c5b-f50e1fe53669/20210702_045115/grafana-screenshot-scylla-per-server-metrics-nemesis-20210702_045456-perf-latency-nemesis-perf-v10-monitor-node-ac4535df-1.png | | 20210702_110536 | grafana | https://cloudius-jenkins-test.s3.amazonaws.com/ac4535df-4ba2-49f1-9c5b-f50e1fe53669/20210702_110536/grafana-screenshot-overview-20210702_110536-perf-latency-nemesis-perf-v10-monitor-node-ac4535df-1.png | | 20210702_110536 | grafana | https://cloudius-jenkins-test.s3.amazonaws.com/ac4535df-4ba2-49f1-9c5b-f50e1fe53669/20210702_110536/grafana-screenshot-scylla-per-server-metrics-nemesis-20210702_110848-perf-latency-nemesis-perf-v10-monitor-node-ac4535df-1.png | | 20210702_111537 | grafana | https://cloudius-jenkins-test.s3.amazonaws.com/ac4535df-4ba2-49f1-9c5b-f50e1fe53669/20210702_111537/grafana-screenshot-overview-20210702_111537-perf-latency-nemesis-perf-v10-monitor-node-ac4535df-1.png | | 20210702_111537 | grafana | https://cloudius-jenkins-test.s3.amazonaws.com/ac4535df-4ba2-49f1-9c5b-f50e1fe53669/20210702_111537/grafana-screenshot-scylla-per-server-metrics-nemesis-20210702_111850-perf-latency-nemesis-perf-v10-monitor-node-ac4535df-1.png | | 20210702_115629 | grafana | https://cloudius-jenkins-test.s3.amazonaws.com/ac4535df-4ba2-49f1-9c5b-f50e1fe53669/20210702_115629/grafana-screenshot-overview-20210702_115629-perf-latency-nemesis-perf-v10-monitor-node-ac4535df-1.png | | 20210702_115629 | grafana | https://cloudius-jenkins-test.s3.amazonaws.com/ac4535df-4ba2-49f1-9c5b-f50e1fe53669/20210702_115629/grafana-screenshot-scylla-per-server-metrics-nemesis-20210702_115942-perf-latency-nemesis-perf-v10-monitor-node-ac4535df-1.png | | 20210702_120717 | db-cluster | https://cloudius-jenkins-test.s3.amazonaws.com/ac4535df-4ba2-49f1-9c5b-f50e1fe53669/20210702_120717/db-cluster-ac4535df.zip | | 20210702_120717 | loader-set | https://cloudius-jenkins-test.s3.amazonaws.com/ac4535df-4ba2-49f1-9c5b-f50e1fe53669/20210702_120717/loader-set-ac4535df.zip | | 20210702_120717 | monitor-set | https://cloudius-jenkins-test.s3.amazonaws.com/ac4535df-4ba2-49f1-9c5b-f50e1fe53669/20210702_120717/monitor-set-ac4535df.zip | | 20210702_120717 | sct-runner | https://cloudius-jenkins-test.s3.amazonaws.com/ac4535df-4ba2-49f1-9c5b-f50e1fe53669/20210702_120717/sct-runner-ac4535df.zip | +-----------------+-------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

yarongilor commented 3 years ago

reactor_stalls.log

avikivity commented 3 years ago

Please decode the stalls and paste them here. Also ask @fgelcer about how to avoid duplicate reports.

yarongilor commented 3 years ago

Please decode the stalls and paste them here. Also ask @fgelcer about how to avoid duplicate reports.

@bhalevy , can you please help decoding and building the tree with your script?

roydahan commented 3 years ago

@fgelcer please review and share if it's a dup or something we know or even an issue?

bhalevy commented 3 years ago

Please decode the stalls and paste them here. Also ask @fgelcer about how to avoid duplicate reports.

@bhalevy , can you please help decoding and building the tree with your script?

@yarongilor I'd like everybody to be able to use this tool. It's not merged yet, but you can find it in https://github.com/bhalevy/seastar/tree/stall-analyser

You'll need the exact binary (with a matching build-id) the log file containing the stalls. The defaults are usually good enough so you can simply run scripts/stall-analyser -e path/to/scylla stalls.log. Note that since clang produces binaries that cause the elf parsers to spew errors, usually you should redirect the stderr, e.g. by appending 2> /dev/null.

bhalevy commented 3 years ago

The stall-analyser PR is https://github.com/scylladb/seastar/pull/880