scylladb / scylladb

NoSQL data store using the seastar framework, compatible with Apache Cassandra
http://scylladb.com
GNU Affero General Public License v3.0
13.58k stars 1.29k forks source link

Read throughput performance degradation #10061

Closed roydahan closed 1 year ago

roydahan commented 2 years ago

Comparing scylla 4.6.rc5 and previous releases we see a throughput degradation of: 19% less throughput than 4.6.rc2 (86K vs 106K op/s). 22% less throughput than 4.5.3 (86K vs 109K op/s).

Installation details Scylla version (or git commit hash): Scylla version 4.6.rc5-0.20220203.5694ec189 with build-id f5d85bf5abe6d2f9fd3487e2469ce1c34304cc14 Cluster size: 3 OS (RHEL/CentOS/Ubuntu/AWS AMI): AWS AMI

Test name: perf-regression.test_read

After looking at the results above and comparing 4.6.rc5 and 4.5.3, I noticed that in 4.6 run, there are still compactions leftovers from the write population phase "dragged" into to read phase. This wasn't expected, but in order to make the test tighter, I ran another version of the test, this time waiting for "no compactions" between population phase and read phase.

In this case, the degradation was reduced to: 15% less than 4.5.3 (91K vs 107K ops) (the new result of 4.5.3 dropped a bit). XX% less than 4.6.rc2 (running now).

While looking at the monitors from the 2 runs, I could see 2 intresteting things:

  1. During the "read" phase there were still occasionally some compactions (1 per node) - see compaction messages below.
  2. Even when there weren't compactions running, the throughput was less than "good run".

Compactions "waking up" during read phase:

< t:2022-02-08 22:15:25,355  perf-regression-perf-v8--db-node-30e43911-2 !INFO    | scylla:  [shard 2] compaction - [Compact keyspace1.standard1 98457d60-892c-11ec-89fc-48032cee3182] Compacting [/var/lib/scylla/data/keyspace1/standard1-5114b6b0892511ec905542c93c7ad1fb/md-458-big-Data.db:level=0:origin=memtable,/var/lib/scylla/data/keyspace1/standard1-5114b6b0892511ec905542c93c7ad1fb/md-442-big-Data.db:level=0:origin=compaction]
< t:2022-02-08 22:15:40,361  perf-regression-perf-v8--db-node-30e43911-2 !INFO    | scylla:  [shard 6] compaction - [Compact keyspace1.standard1 a130d0a0-892c-11ec-94b4-48042cee3182] Compacting [/var/lib/scylla/data/keyspace1/standard1-5114b6b0892511ec905542c93c7ad1fb/md-374-big-Data.db:level=0:origin=memtable,/var/lib/scylla/data/keyspace1/standard1-5114b6b0892511ec905542c93c7ad1fb/md-382-big-Data.db:level=0:origin=memtable,/var/lib/scylla/data/keyspace1/standard1-5114b6b0892511ec905542c93c7ad1fb/md-358-big-Data.db:level=0:origin=compaction]
< t:2022-02-08 22:15:55,375  perf-regression-perf-v8--db-node-30e43911-2 !INFO    | scylla:  [shard 2] compaction - [Compact keyspace1.standard1 98457d60-892c-11ec-89fc-48032cee3182] Compacted 2 sstables to [/var/lib/scylla/data/keyspace1/standard1-5114b6b0892511ec905542c93c7ad1fb/md-466-big-Data.db:level=0]. 418MB to 363MB (~87% of original) in 31218ms = 11MB/s. ~1653120 total partitions merged to 1435824.
< t:2022-02-08 22:16:10,378  perf-regression-perf-v8--db-node-30e43911-2 !INFO    | scylla:  [shard 6] compaction - [Compact keyspace1.standard1 a130d0a0-892c-11ec-94b4-48042cee3182] Compacted 3 sstables to [/var/lib/scylla/data/keyspace1/standard1-5114b6b0892511ec905542c93c7ad1fb/md-390-big-Data.db:level=0]. 574MB to 439MB (~76% of original) in 28919ms = 15MB/s. ~2267648 total partitions merged to 1731541.
< t:2022-02-08 22:17:25,399  perf-regression-perf-v8--db-node-30e43911-2 !INFO    | scylla:  [shard 3] compaction - [Compact keyspace1.standard1 dc291910-892c-11ec-a776-48022cee3182] Compacting [/var/lib/scylla/data/keyspace1/standard1-5114b6b0892511ec905542c93c7ad1fb/md-331-big-Data.db:level=0:origin=memtable,/var/lib/scylla/data/keyspace1/standard1-5114b6b0892511ec905542c93c7ad1fb/md-339-big-Data.db:level=0:origin=memtable]
< t:2022-02-08 22:22:10,502  perf-regression-perf-v8--db-node-30e43911-2 !INFO    | scylla:  [shard 3] compaction - [Compact keyspace1.standard1 dc291910-892c-11ec-a776-48022cee3182] Compacted 2 sstables to [/var/lib/scylla/data/keyspace1/standard1-5114b6b0892511ec905542c93c7ad1fb/md-347-big-Data.db:level=0]. 333MB to 301MB (~90% of original) in 299121ms = 1MB/s. ~1320704 total partitions merged to 1189518.
< t:2022-02-08 22:24:25,582  perf-regression-perf-v8--db-node-30e43911-2 !INFO    | scylla:  [shard 1] compaction - [Compact keyspace1.standard1 d6b5e020-892d-11ec-8750-48052cee3182] Compacting [/var/lib/scylla/data/keyspace1/standard1-5114b6b0892511ec905542c93c7ad1fb/md-425-big-Data.db:level=0:origin=compaction,/var/lib/scylla/data/keyspace1/standard1-5114b6b0892511ec905542c93c7ad1fb/md-465-big-Data.db:level=0:origin=compaction]
< t:2022-02-08 22:25:43,757  perf-regression-perf-v8--db-node-30e43911-1 !INFO    | scylla:  [shard 4] compaction - [Compact keyspace1.standard1 0c3ba860-892e-11ec-aaf5-4f172f0e0783] Compacting [/var/lib/scylla/data/keyspace1/standard1-5114b6b0892511ec905542c93c7ad1fb/md-404-big-Data.db:level=0:origin=compaction,/var/lib/scylla/data/keyspace1/standard1-5114b6b0892511ec905542c93c7ad1fb/md-380-big-Data.db:level=0:origin=compaction]
< t:2022-02-08 22:28:40,681  perf-regression-perf-v8--db-node-30e43911-2 !INFO    | scylla:  [shard 1] compaction - [Compact keyspace1.standard1 d6b5e020-892d-11ec-8750-48052cee3182] Compacted 2 sstables to [/var/lib/scylla/data/keyspace1/standard1-5114b6b0892511ec905542c93c7ad1fb/md-481-big-Data.db:level=0]. 768MB to 600MB (~78% of original) in 257756ms = 2MB/s. ~3022976 total partitions merged to 2361818.
< t:2022-02-08 22:33:28,978  perf-regression-perf-v8--db-node-30e43911-1 !INFO    | scylla:  [shard 4] compaction - [Compact keyspace1.standard1 0c3ba860-892e-11ec-aaf5-4f172f0e0783] Compacted 2 sstables to [/var/lib/scylla/data/keyspace1/standard1-5114b6b0892511ec905542c93c7ad1fb/md-420-big-Data.db:level=0]. 583MB to 496MB (~85% of original) in 453641ms = 1MB/s. ~2297216 total partitions merged to 1955330.
< t:2022-02-08 22:33:55,805  perf-regression-perf-v8--db-node-30e43911-2 !INFO    | scylla:  [shard 7] compaction - [Compact keyspace1.standard1 2fc8cdc0-892f-11ec-a40e-48072cee3182] Compacting [/var/lib/scylla/data/keyspace1/standard1-5114b6b0892511ec905542c93c7ad1fb/md-359-big-Data.db:level=0:origin=memtable,/var/lib/scylla/data/keyspace1/standard1-5114b6b0892511ec905542c93c7ad1fb/md-375-big-Data.db:level=0:origin=memtable]
< t:2022-02-08 22:41:40,986  perf-regression-perf-v8--db-node-30e43911-2 !INFO    | scylla:  [shard 7] compaction - [Compact keyspace1.standard1 2fc8cdc0-892f-11ec-a40e-48072cee3182] Compacted 2 sstables to [/var/lib/scylla/data/keyspace1/standard1-5114b6b0892511ec905542c93c7ad1fb/md-383-big-Data.db:level=0]. 339MB to 305MB (~90% of original) in 468038ms = 653kB/s. ~1342720 total partitions merged to 1207870.
< t:2022-02-08 23:01:14,741  perf-regression-perf-v8--db-node-30e43911-1 !INFO    | scylla:  [shard 6] compaction - [Compact keyspace1.standard1 fc643b00-8932-11ec-83bd-4f192f0e0783] Compacting [/var/lib/scylla/data/keyspace1/standard1-5114b6b0892511ec905542c93c7ad1fb/md-350-big-Data.db:level=0:origin=memtable,/var/lib/scylla/data/keyspace1/standard1-5114b6b0892511ec905542c93c7ad1fb/md-366-big-Data.db:level=0:origin=memtable]
< t:2022-02-08 23:06:14,942  perf-regression-perf-v8--db-node-30e43911-1 !INFO    | scylla:  [shard 6] compaction - [Compact keyspace1.standard1 fc643b00-8932-11ec-83bd-4f192f0e0783] Compacted 2 sstables to [/var/lib/scylla/data/keyspace1/standard1-5114b6b0892511ec905542c93c7ad1fb/md-374-big-Data.db:level=0]. 342MB to 305MB (~89% of original) in 302958ms = 1MB/s. ~1354112 total partitions merged to 1207745.

Monitor of "bad run" - 4.6.rc5: http://34.239.125.222:3000/d/JeAPF0anz/scylla-per-server-metrics-nemesis-master?orgId=1&from=1644355254464&to=1644361742455&var-by=instance&var-cluster=&var-dc=All&var-node=All&var-shard=All

Live screenshot: https://snapshots.raintank.io/dashboard/snapshot/lkRLFBEOK0RdHSOUCiscZdHAbGXAT6P1

Screen Shot 2022-02-10 at 15 34 22

Monitor of "good run" - 4.5.3: http://3.86.108.186:3000/d/0BjFMAank/scylla-per-server-metrics-nemesis-master?orgId=1&from=1644347313998&to=1644353912400&var-by=instance&var-cluster=&var-dc=All&var-node=All&var-shard=All

Live screenshot: https://snapshots.raintank.io/dashboard/snapshot/bKN3tIUn27LtIbV65w81C4M57rmFdvfY

Screen Shot 2022-02-10 at 15 36 06

slivne commented 2 years ago

@roydahan monitor for 4.6.rc2 do we have one ?

slivne commented 2 years ago

quick look

4.6.rc5

Screenshot from 2022-02-10 15-50-22

we have queued sstable reads and read failures

on 4.5

Screenshot from 2022-02-10 15-50-34

slivne commented 2 years ago

4.6. we have queued sstable reads but there are no cache misses ?

Screenshot from 2022-02-10 15-53-46

slivne commented 2 years ago

Its on a single shard on one node

Screenshot from 2022-02-10 16-06-42

slivne commented 2 years ago

@roydahan I tried to dig through the metrics and can't find whats causing this

Does this reproduce - the queued sstable reads, failed reads is indicating a problem

I need a setup that we can run the load on and then be able to open some logs and try to debug this

avikivity commented 2 years ago

There is no reason for 4.6 rc5 to be slower than 4.6 rc2.

Maybe we picked up a kernel regression from a newer AMI?

roydahan commented 2 years ago

@slivne it kind of reproduced in 4 runs (there is a variance but the highest one wasn't even close to the lowest of the others): 2 runs without waiting compactions - 86K & 90K op/s 2 runs with waiting for compactions - 85K & 91K op/s. I started another run with "keep" for the cluster.

The runs for the previous RCs has finished with the following results: 4.6.rc4 - 98K op/s 4.6.rc2 - 97K op/s (the original one was much better 106K op/s, so there is some variance also here).

Monitors: rc4 - http://3.88.70.194:3000/d/vQPBYBanz/scylla-per-server-metrics-nemesis-master?orgId=1&from=1644499796529&to=1644506132228

rc2 - http://3.92.61.214:3000/d/B2SU9Bank/scylla-per-server-metrics-nemesis-master?orgId=1&from=1644507219095&to=1644514524315

slivne commented 2 years ago

rc4 run has queued sstable reads mainly on one node and one shard

Screenshot from 2022-02-10 21-33-49

rc2 has a bit less

I don't understand whats reading from sstables and why ? everything is returned from cache.

@roydahan just to check - is this with a shard aware driver or not ? - I think we decided not to change the stack so we can have consistent results and testing.

roydahan commented 2 years ago

I think it's without shard-aware but don't remember, I'll check.

I just noticed that in all RC5 runs there are "reads failed", few thousands of them, while in all others there are none.

roydahan commented 2 years ago

@slivne from the number of connections it looks to me like a shard-aware, but from the loader output I don't see the message of "using optimized driver" (maybe it's an older one that we didn't have this print).

roydahan commented 2 years ago

Last run with rc5 is much better, 107K op/s. I have no idea why, but here we didn't have any failed reads.

Cluster:

perf-regression-perf-v8--db-node-8c9a810b-1 scylla-db i3.2xlarge 52.90.200.136 172.30.0.227  
perf-regression-perf-v8--db-node-8c9a810b-2 scylla-db i3.2xlarge 52.90.113.131 172.30.0.170  
perf-regression-perf-v8--db-node-8c9a810b-3 scylla-db i3.2xlarge 54.83.144.127 172.30.0.253  
perf-regression-perf-v8--loader-node-8c9a810b-1 loader c4.2xlarge 35.173.232.104 172.30.0.72  
perf-regression-perf-v8--loader-node-8c9a810b-2 loader c4.2xlarge 52.21.87.134 172.30.0.51  
perf-regression-perf-v8--loader-node-8c9a810b-3 loader c4.2xlarge 54.172.186.48 172.30.0.223  
perf-regression-perf-v8--loader-node-8c9a810b-4 loader c4.2xlarge 3.84.15.162 172.30.0.93  
perf-regression-perf-v8--monitor-node-8c9a810b-1 monitor t3.small 3.83.254.130 172.30.0.74

The stress runs from 4 loaders with the following command:

cassandra-stress read no-warmup cl=QUORUM duration=50m -schema keyspace=keyspace1 'replication(factor=3)' -port jmx=6868 -mode cql3 native -rate threads=100 -pop 'dist=gauss(1..30000000,15000000,1500000)'  -node 172.30.0.227
slivne commented 2 years ago

so the interpretation is not correct

        sm::make_gauge("active_reads", [this] { return max_count_system_concurrent_reads - _system_read_concurrency_sem.available_resources().count; },
                       sm::description("Holds the number of currently active read operations from \"system\" keyspace tables. "),
                       {system_label_instance}),

        sm::make_gauge("queued_reads", [this] { return _streaming_concurrency_sem.waiters(); },
                       sm::description("Holds the number of currently queued read operations on behalf of streaming."),
                       {streaming_label_instance}),

since we moved the semaphore before the cache the metric no longer tells us if there are sstable reads or not

@denesb please confirm (and we need to open a bug for monitoring to change the title) - I recall vaguely you warned all of us of that - and I just fell into the trap

slivne commented 2 years ago

@roydahan we need to get a bad run that has the numbers super high and try to understand whats going on - or maybe we can find for the bad run the logs and look at them.

denesb commented 2 years ago

@denesb please confirm (and we need to open a bug for monitoring to change the title) - I recall vaguely you warned all of us of that - and I just fell into the trap

Yes, these metrics don't refer to sstable reads anymore. I talked about several time but forgot to open an issue about it -- I did that now: https://github.com/scylladb/scylla/issues/10065 to describe the problem in detail.

slivne commented 2 years ago

@roydahan I need a system that has the performance degradation so we will be able to check why.

Lets try and run 3 more runs in parallel to see if we can reproduce this.

avikivity commented 2 years ago

@denesb we'll also need to adjust the dashboards.

roydahan commented 2 years ago

@slivne below you can find a live system that reproduce the issue.

Live monitor - http://34.238.42.30:3000/d/T1MKveank/scylla-per-server-metrics-nemesis-master?orgId=1&from=now-1h&to=now&refresh=30s

Cluster details:

Name NodeType RunByUser keep version Instance Type IPv4 Public IP Private IP Address
perf-regression-perf-v8--db-node-35a28237-1 scylla-db roy alive Perf-Regression i3.2xlarge 34.201.108.111 172.30.0.160  
 perf-regression-perf-v8--db-node-35a28237-2 scylla-db roy alive Perf-Regression i3.2xlarge 3.87.62.39 172.30.0.76  
 perf-regression-perf-v8--db-node-35a28237-3 scylla-db roy alive Perf-Regression i3.2xlarge 44.201.219.150 172.30.0.31  
 perf-regression-perf-v8--loader-node-35a28237-1 loader roy alive Perf-Regression c4.2xlarge 3.91.214.160 172.30.0.178  
 perf-regression-perf-v8--loader-node-35a28237-2 loader roy alive Perf-Regression c4.2xlarge 54.82.178.195 172.30.0.226  
 perf-regression-perf-v8--loader-node-35a28237-3 loader roy alive Perf-Regression c4.2xlarge 52.90.33.43 172.30.0.225  
 perf-regression-perf-v8--loader-node-35a28237-4 loader roy alive Perf-Regression c4.2xlarge 54.210.231.232 172.30.0.138  
 perf-regression-perf-v8--monitor-node-35a28237-1 monitor roy alive Perf-Regression t3.small 34.238.42.30 172.30.0.42

The load that is running in parallel from 4 different loaders is: cassandra-stress read no-warmup cl=QUORUM duration=50m -schema keyspace=keyspace1 'replication(factor=3)' -port jmx=6868 -mode cql3 native -rate threads=100 -pop 'dist=gauss(1..30000000,15000000,1500000)' -node 172.30.0.160

slivne commented 2 years ago

comparing the nodes

node

172.30.0.31 is the slow one on shard 7

shard_7

Non wild guess its related to networking interrupts somehow

115:        620          0          0        808 1173482621          0          0          0  xen-pirq    -msi-x     eth0-Tx-Rx-0
116:          0         96  925795206          0          0          0       1061          0  xen-pirq    -msi-x     eth0-Tx-Rx-1
117:          0       1126        421         73          0  821241071          0          0  xen-pirq    -msi-x     eth0-Tx-Rx-2
118:          0          0          0        117          0          0 1296145608          0  xen-pirq    -msi-x     eth0-Tx-Rx-3
119:          0         86       1479  745827048        118          0          0          0  xen-pirq    -msi-x     eth0-Tx-Rx-4
120:  862650189          0          0          0          0         77          0        649  xen-pirq    -msi-x     eth0-Tx-Rx-5
121:          0 1168290948          0          0        228          0         88       1263  xen-pirq    -msi-x     eth0-Tx-Rx-6
122:          0          0          0          0          0        609          0 1668543774  xen-pirq    -msi-x     eth0-Tx-Rx-7

Trying to move them around

scyllaadm@perf-regression-perf-v8--db-node-35a28237-3:~$ cat /proc/irq/122/smp_affinity
0080
scyllaadm@perf-regression-perf-v8--db-node-35a28237-3:~$ cat /proc/irq/121/smp_affinity
0002
root@perf-regression-perf-v8--db-node-35a28237-3:/home/scyllaadm# echo "0082" >  /proc/irq/121/smp_affinity
root@perf-regression-perf-v8--db-node-35a28237-3:/home/scyllaadm# echo "0002" >  /proc/irq/122/smp_affinity
root@perf-regression-perf-v8--db-node-35a28237-3:/home/scyllaadm# echo "0080" >  /proc/irq/121/smp_affinity
scyllaadm@perf-regression-perf-v8--db-node-35a28237-3:~$ cat /proc/irq/122/smp_affinity
0002
scyllaadm@perf-regression-perf-v8--db-node-35a28237-3:~$ cat /proc/irq/121/smp_affinity
0080

we get this solved

Screenshot from 2022-02-16 11-26-20

if we switch them back

cyllaadm@perf-regression-perf-v8--db-node-35a28237-3:~$ sudo su
root@perf-regression-perf-v8--db-node-35a28237-3:/home/scyllaadm# echo "0082" >  /proc/irq/121/smp_affinity
root@perf-regression-perf-v8--db-node-35a28237-3:/home/scyllaadm# echo "0080" >  /proc/irq/122/smp_affinity
root@perf-regression-perf-v8--db-node-35a28237-3:/home/scyllaadm# echo "0002" >  /proc/irq/121/smp_affinity
root@perf-regression-perf-v8--db-node-35a28237-3:/home/scyllaadm# exit
scyllaadm@perf-regression-perf-v8--db-node-35a28237-3:~$ cat /proc/irq/121/smp_affinity
0002
scyllaadm@perf-regression-perf-v8--db-node-35a28237-3:~$ cat /proc/irq/122/smp_affinity
0080

it goes back to being bad

Screenshot from 2022-02-16 11-29-20

@avikivity need help here - its clear that irq mapping causes this but its unclear to me why a specific mapping makes things worst than another

I also tried to check other nodes the mapping of networking interrupts is not the same in the mapping to cores

3.87.62.39

115:        620 1197432899          0        630          0          0          0          0  xen-pirq    -msi-x     eth0-Tx-Rx-0
116:          0         71          0          0          0          0 1062825516        464  xen-pirq    -msi-x     eth0-Tx-Rx-1
117:          0        452        127 1459657133          0          0          0        895  xen-pirq    -msi-x     eth0-Tx-Rx-2
118:          0        142          0         82          0          0        299  647245083  xen-pirq    -msi-x     eth0-Tx-Rx-3
119:          0          0         52          0         86  928382895          0        406  xen-pirq    -msi-x     eth0-Tx-Rx-4
120:       1238          0          0        255 1033351298         95          0          0  xen-pirq    -msi-x     eth0-Tx-Rx-5
121:          0          0 1207553938          0        507          0         58          0  xen-pirq    -msi-x     eth0-Tx-Rx-6
122: 1152556022          0         23        320          0          0          0       1919  xen-pirq    -msi-x     eth0-Tx-Rx-7
avikivity commented 2 years ago

The interrupt mappings look symmetric to me. One cpu is getting more interrupts, but that's because more connections happened to hash to it.

slivne commented 2 years ago

All cores have 4 CQL connections

Screenshot from 2022-02-16 12-20-37

Yes the mapping looks symmetric to me as well - but when I change the interrupt mapping - we get better performance.

slivne commented 2 years ago

after the call I went and checked kernel versions

scyllaadm@perf-regression-perf-v8--db-node-35a28237-3:~$ scylla --version
4.6.rc5-0.20220203.5694ec189
scyllaadm@perf-regression-perf-v8--db-node-35a28237-3:~$ cat /proc/version
Linux version 5.11.0-1022-aws (buildd@lgw01-amd64-036) (gcc (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0, GNU ld (GNU Binutils for Ubuntu) 2.34) #23~20.04.1-Ubuntu SMP Mon Nov 15 14:03:19 UTC 2021
scyllaadm@ip-10-99-17-17:~$ scylla --version
4.6.rc2-0.20220102.e8a1cfb6f
scyllaadm@ip-10-99-17-17:~$ cat /proc/version
Linux version 5.4.0-1035-aws (buildd@lgw01-amd64-057) (gcc version 9.3.0 (Ubuntu 9.3.0-17ubuntu1~20.04)) #37-Ubuntu SMP Wed Jan 6 21:01:57 UTC 2021

so there is a potential of kernel being the issue - @avikivity do you want us to pin base image for 4.6 (like 4.6.rc2) so we can spin 4.6.rc6 and test this out ?

avikivity commented 2 years ago

All cores have 4 CQL connections

Screenshot from 2022-02-16 12-20-37

Yes the mapping looks symmetric to me as well - but when I change the interrupt mapping - we get better performance.

That's connections per scylla shard. But connections are assigned to kernel CPUs (NIC queues) by hashing the n-tuple, this is a different type of load balancing.

avikivity commented 2 years ago

after the call I went and checked kernel versions

scyllaadm@perf-regression-perf-v8--db-node-35a28237-3:~$ scylla --version
4.6.rc5-0.20220203.5694ec189
scyllaadm@perf-regression-perf-v8--db-node-35a28237-3:~$ cat /proc/version
Linux version 5.11.0-1022-aws (buildd@lgw01-amd64-036) (gcc (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0, GNU ld (GNU Binutils for Ubuntu) 2.34) #23~20.04.1-Ubuntu SMP Mon Nov 15 14:03:19 UTC 2021
scyllaadm@ip-10-99-17-17:~$ scylla --version
4.6.rc2-0.20220102.e8a1cfb6f
scyllaadm@ip-10-99-17-17:~$ cat /proc/version
Linux version 5.4.0-1035-aws (buildd@lgw01-amd64-057) (gcc version 9.3.0 (Ubuntu 9.3.0-17ubuntu1~20.04)) #37-Ubuntu SMP Wed Jan 6 21:01:57 UTC 2021

so there is a potential of kernel being the issue - @avikivity do you want us to pin base image for 4.6 (like 4.6.rc2) so we can spin 4.6.rc6 and test this out ?

We can test this without a new spin.

roydahan commented 2 years ago

I'm running a job with rc2, then I'll update the kernel version to latest version and re-run on the same cluster.

roydahan commented 2 years ago

@slivne please find below the summary of runs I did:

  1. "official run" with 4.6.rc2 (kernel version 5.4.0) ---> 96-97K op/s - monitor

(side note: This run had some "reads failed" on one node, no reads misses (all from cache) but none seems related to our issue).

  1. upgraded the kernel on the same cluster with 4.6.rc2 to 5.11.0 ---> 94-95K op/s (3 different runs after a run to warmup the cache due to the reboot after kernel upgrade) - monitor

  2. Upgrade the same cluster to 4.6.rc5 --> 89-90K op/s (2 different runs after a run to warmup the cache) - monitor

roydahan commented 2 years ago

This is the list of commits between 4.6.rc2 and 4.6.rc5:

* 5694ec189 - (tag: scylla-4.6.rc5) release: prepare for 4.6.rc5 (3 weeks ago) <Yaron Kaikov>
* 34d470967 - commitlog: Fix double clearing of _segment_allocating shared_future. (3 weeks ago) <Calle Wilund>
* 61db571a4 - (tag: scylla-4.6.rc4) commitlog: Ensure we never have more than one new_segment call at a time (3 weeks ago) <Calle Wilund>
* 5b5a300a9 - util: cached_file: Fix corruption after memory reclamation was triggered from population (3 weeks ago) <Tomasz Grabiec>
* 148a65d0d - Update seastar submodule (gratuitous exceptions on allocation failure) (3 weeks ago) <Avi Kivity>
* e3ad14d55 - Point seastar submodule at scylla-seastar.git (3 weeks ago) <Avi Kivity>
* 2b506c2d4 - commitlog: Ensure we don't run continuation (task switch) with queues modified (4 weeks ago) <Calle Wilund>
* 50aad1c66 - Merge 'scylla_raid_setup: use mdmonitor only when RAID level > 0' from Takuya ASADA (4 weeks ago) <Avi Kivity>
* 7bf3f37cd - release: prepare for 4.6.rc4 (4 weeks ago) <Yaron Kaikov>
* 0f7f8585f - reader_permit: release_base_resources(): also update _resources (5 weeks ago) <Botond Dénes>
* 2c65c4a56 - Merge 'db: range_tombstone_list: Deoverlap empty range tombstones' from Tomasz Grabiec (5 weeks ago) <Pavel Emelyanov>
* f85cd289b - Merge "repair: make sure there is one permit per repair with count res" from Botond (5 weeks ago) <Avi Kivity>
* 5e661af9a - release: prepare for 4.6.rc3 (5 weeks ago) <Beni Peled>
* 5629b67d2 - messaging_service: Make dc/rack encryption check for connection more strict (5 weeks ago) <Calle Wilund>
* ad632cf7f - dist: fix scylla-housekeeping uuid file chmod call (6 weeks ago) <Takuya ASADA>
* ca24bebcf - sstables/partition_index_cache: destroy entry ptr on error (7 weeks ago) <Botond Dénes>
* 7dc5abb6f - commitlog: Don't allow error_handler to swallow exception (7 weeks ago) <Calle Wilund>
* e8a1cfb6f - (tag: scylla-4.6.rc2) release: prepare for 4.6.rc2 (7 weeks ago) <Yaron Kaikov>

Running now with 4.6.rc3 and 4.6.rc4 to narrow the list of suspected commits.

slivne commented 2 years ago

OK - we agreed to attempt and do a bit more bisect on rc3/rc4 and see what we get

slivne commented 2 years ago

I am not sure it will be easy to run the bisect between 4.6.rc2 <--> 4.6.rc5 as we have coredumps related to commitlog

roydahan commented 2 years ago

@eliransin this is the job I used to reproduce the runs, you can use it as well: https://jenkins.scylladb.com/job/Perf-Regression/job/scylla-release-perf-regression-aws-test-throughput-wait-for-compactions/

eliransin commented 2 years ago

I have started the bisect with this - please notice I pushed the commitlog fixes down so we won't have any coredumps during the tests:

7cef06f71 (es/bisect-performance, refs/bisect/bad) release: prepare for 4.6.rc5
a2458ae12 util: cached_file: Fix corruption after memory reclamation was triggered from population
16d521498 Update seastar submodule (gratuitous exceptions on allocation failure)
0f9e67362 Point seastar submodule at scylla-seastar.git
1c921f852 commitlog: Ensure we don't run continuation (task switch) with queues modified
81c8b0742 Merge 'scylla_raid_setup: use mdmonitor only when RAID level > 0' from Takuya ASADA
0d4cdbcc4 release: prepare for 4.6.rc4
30bb5ea8b reader_permit: release_base_resources(): also update _resources
2aa2863d6 Merge 'db: range_tombstone_list: Deoverlap empty range tombstones' from Tomasz Grabiec
24bba1202 Merge "repair: make sure there is one permit per repair with count res" from Botond
b3eb2574a release: prepare for 4.6.rc3
2cdcbcd36 messaging_service: Make dc/rack encryption check for connection more strict
2fbdb2d93 dist: fix scylla-housekeeping uuid file chmod call
44e543ced sstables/partition_index_cache: destroy entry ptr on error
181d66720 commitlog: Don't allow error_handler to swallow exception
b01577e9c commitlog: Fix double clearing of _segment_allocating shared_future.
00e33206e commitlog: Ensure we never have more than one new_segment call at a time
e8a1cfb6f (tag: scylla-4.6.rc2, es/tls_fix_consolidation, refs/bisect/good-e8a1cfb6f8742e688049bad895be7cde6b8617e8) release: prepare for 4.6.rc2

One more piece of information: I had to run with my own version of the test: https://jenkins.scylladb.com/view/staging/job/scylla-staging/job/eliran/job/scylla-release-perf-regression-aws-test-throughput-wait-for-compactions/ The reason is that tests failed on not finding the node log (whatever that is) so I took a shot ignoring this error (not throwing) and doubling the timeout and it seams fine. Someone might want to look at it in the future.

eliransin commented 2 years ago

Bisection step: 24bba1202 (HEAD) Merge "repair: make sure there is one permit per repair with count res" from Botond It looks good: observing 104K-107K op/s test id: e2f7cb83-7428-4dcd-b058-7dcb90eb9fee image https://jenkins.scylladb.com/view/staging/job/scylla-staging/job/eliran/job/scylla-release-perf-regression-aws-test-throughput-wait-for-compactions/1 ami: ami-0b804e78c7521ebcc

eliransin commented 2 years ago

Bisection step: 81c8b0742 (HEAD) Merge 'scylla_raid_setup: use mdmonitor only when RAID level > 0' from Takuya ASADAIt looks good: observing 100K-105K op/s test id: f704c39c-9128-4031-894f-3f99de420e47 image https://jenkins.scylladb.com/view/staging/job/scylla-staging/job/eliran/job/scylla-release-perf-regression-aws-test-throughput-wait-for-compactions/2/ ami: ami-045599a4c8bb23abf

eliransin commented 2 years ago

Bisection step: 16d521498 Update seastar submodule (gratuitous exceptions on allocation failure) not good but not very bad also: observing 98K-100K op/s test id: 6c27a36b-2136-40fb-8a8a-c9e0fe8c3bad image https://jenkins.scylladb.com/view/staging/job/scylla-staging/job/eliran/job/scylla-release-perf-regression-aws-test-throughput-wait-for-compactions/3/ ami: ami-048bd52e6f8510adb

eliransin commented 2 years ago

Bisection step: a2458ae12 (HEAD) util: cached_file: Fix corruption after memory reclamation was triggered from population good: ~102K op/s (:open_mouth: ) test id: ab12703f-debe-455a-a93c-cf358eb38dfb image https://jenkins.scylladb.com/view/staging/job/scylla-staging/job/eliran/job/scylla-release-perf-regression-aws-test-throughput-wait-for-compactions/5/ ami: ami-010085e0318dbbb52

eliransin commented 2 years ago

It seams that the performance drop is not 100% reproduced???? It's either that or changing the commits order of commit actually solved the performance problem somehow. I will do a code comparison to rule out the later and will run 4.6.rc5 again to role out the former.

[eliransin@eliransin-desktop scylla-scratch]$ git diff 7cef06f71 scylla-4.6.rc5 | wc -l
0

its not the order of commits. Lets wait for the rerun of 4.6.rc5

eliransin commented 2 years ago

Running with official 4.6.rc5 show degradation. image https://jenkins.scylladb.com/view/staging/job/scylla-staging/job/eliran/job/scylla-release-perf-regression-aws-test-throughput-wait-for-compactions/6/

eliransin commented 2 years ago

Next step is to run with newly created 4.6.rc5, if no degradation happens we will need to search for differences between amis.

eliransin commented 2 years ago

Running with newly created 4.6.rc5 shows good results: test id: ba00620d-c77e-44af-b3bd-a8285b169bc7 image

ami: ami-0108ce1d0be64576e https://jenkins.scylladb.com/view/staging/job/scylla-staging/job/eliran/job/scylla-release-perf-regression-aws-test-throughput-wait-for-compactions/9/

I would like to mention that I also saw one degraded run but I did not record it. I will do a couple more runs on the official 4.6.rc5

eliransin commented 2 years ago

Another run with official 4.6.rc5: image

The hypothesis now is that it is a function of the machines. Next step is to try and get a cluster that performs significantly slower (~10%) and have several iotune runs and compare the numbers to get:

  1. distribution of results.
  2. deviation from the recomputed numbers that comes with our AMIs.
eliransin commented 2 years ago

I have tried to reproduce the bad performance results over 5 times now and I get a consistent ~100K op/s throughput results. I still believe it is as a result of the instances not performing uniformly. Some things that should be done in order to prove it

  1. @roydahan please make sure to not destroy clusters which have failed performance tests (especially if the results are not stable). We will then be able to sample iotune result on them and see if the numbers are inconsistent.
  2. We might want to try and sample iotune several times regardless of this bug and see how does the results distribute.
  3. We might also want to sample some machines that ran scylla for a while (in the cloud) and see if there is a meaningful degradation in the results, if there is, we might need to have a screening procedure in the cloud in order not to start a cluster with a worn out hardware as a starting point.

@slivne please let me know if you want me to measure some random instance with iotune several times or should we wait for a cluster with performance degradation. @roydahan - if you still believe that there is a problem, a stable reproduce is required as the old reproducer doesn't work anymore at all nor it is stable.

roydahan commented 2 years ago

@KnifeyMoloko, I discussed this issue with @eliransin and I think that the new HW benchmarks you added to SCT can help us here. What we want to do is to re-run this test with the new tools that benchmark the HW (cassandra-fio and sysbench). If the degradation reproduces, we will want to look at the output and possibly run also iotune on them. (as a later task, we may want to add iotune to the set of things we run and compare it to the existing io.properties).

The re-runs should all use 4.6.rc5.

eliransin commented 2 years ago

@KnifeyMoloko have you got a chance to rerun this with the new framework?

roydahan commented 2 years ago

@aleksbykov let's have a task to reproduce this issue with the new hardware benchmarks we now collect.

roydahan commented 2 years ago

@eliransin An update: The degradation still exist in 5.0.rc6, however it looks like in 5.1.dev (master) the numbers are back to good numbers.

Here are more details: 5.0.rc6. 20220523.338edcc02 - test_read(wait_for_compaction) - 90K op/s (Same as 4.6).

5.1.dev.20220504.b26a3da584cc - test_read(without explicitly waiting for compaction) - 93K op/s (considered as the same)

5.1.dev.20220513.8e8dc2c930f5 - test_read(without explicitly waiting for compaction) - 156K op/s (considered as the same)

We don't have another AMI between these two dates, here is the list of patches went in:

slivne commented 2 years ago

@avikivity can you please add your thoughts on what can cause this

avikivity commented 2 years ago

/cc @michoecho

avikivity commented 2 years ago

Maybe https://github.com/scylladb/scylla/commit/8e8dc2c930f5541531c3326127f012e7bc6292ef ?

eliransin commented 2 years ago

Lets start with HL commits: git log --oneline --first-parent b26a3da584cc..8e8dc2c930f5

8e8dc2c930 Revert "table: disable_auto_compaction: stop ongoing compactions" 333f6c5ec9 mutation_partition_view: do_accept_gently: keep clustering_row key on stack eb6f4cc839 Merge 'dependencies: add rust' from Wojciech Mitros 00ed4ac74c batchlog_manager: warn when a batch fails to replay 4a5842787e memtable_list: clear_and_add: let caller clear the old memtables eeca9f24e8 Merge 'Docs: improve debugging.md' from Benny Halevy b2450886d7 Merge 'Debugging.md relocatable package updates' from Botond Dénes a9dfe5a8f4 scylla_sysconfig_setup: handle >=32CPUs correctly 043b1c7f89 Update seastar submodule. Unfortunately, also requires two changes to Scylla itself to make it still compile - see below 209c2f5d99 sstables: define generation_type for sstables 3abe4a2696 Adjust scripts/pull_github_pr.sh to check tests status 7501a075bd sstables/index_reader: push down eof() check to advance_to(index_bound&, dht::ring_position_view) 77b1db475c locator: Do not enforce public ip address for broadcast_rpc_address f703e8ded5 Merge 'New failure detector for Raft' from Kamil Braun 2c39c4c284 Merge 'Handle errors during snapshot' from Benny Halevy 1c6163d51f Merge 'cql3: expr: allow bind markers in collection literals' from Michał Sala fd27fbfe64 Merge "Add user types carrier helper" from Pavel Emelyanov ca700bf417 scripts/pull_github_pr.sh: clean up after failed cherry-pick 598ce8111d repair: Handle discarded stopping future 3b4af86ad9 proxy (and suddenly redis): Don't check latency_counter.is_start() 48e3117ebc compaction: move propagate_replacement() into private namespace 3e0f81180e docs: disable link checker 81af9342f1 Merge "Simplify gossiper state map API" from Pavel E 94f677b790 Merge 'sstables/index_reader: short-circuit fast-forward-to when at EOF' from Botond Dénes 603dd72f9e CQL: Replace assert by exception on invalid auth opcode fb1a9e97c9 cql3: restrictions: statement_restrictions: pass arguments to std::bind_front by reference 1ecb87b7a8 Merge 'Harden table truncate' from Benny Halevy 287c01ab4d Merge ' sstables: consumer: reuse the fragmented_temporary_buffer in read_bytes()' from Michał Chojnowski 8e99d3912e compaction: LCS: don't write to disengaged optional on compaction completion 5682393693 compaction: Fix use-after-move when retrying maintenance compaction 9623589c77 Merge 'Futurize data_read_resolver::resolve and to_data_query_result' from Benny Halevy eeec502aee Merge 'gms: feature_service: reduce boilerplate to add a cluster feature' from Avi Kivity 78350a7e1b cdc: Ensure columns removed from log table are registered as dropped 29e09a3292 db/config: command line arguments logger_stdout_timestamps and logger_ostream_type are no longer ignored 063d26bc9e system_keyspace/config: Swallow string->value cast exception

I will start building now - 5 amis to split the list to segment of around ~7 commits each.

eliransin commented 2 years ago

I am aware of the fact that some of those commits are meaningless but I think it is faster to proceed blindly: 8e8dc2c Revert "table: disable_auto_compaction: stop ongoing compactions" - ami-0f794b188bda487bb a9dfe5a scylla_sysconfig_setup: handle >=32CPUs correctly - ami-0a7c08dcb707f44e3 2c39c4c Merge 'Handle errors during snapshot' from Benny Halevy - ami-00270c68c013066e2 3e0f811 docs: disable link checker - ami-055cb2148ed955c3f 8e99d39 compaction: LCS: don't write to disengaged optional on compaction completion - ami-07092a9d7acf93e73

I will edit this as the AMIs are being created.

eliransin commented 2 years ago

@roydahan would you run those AMIs and share the results?