Significant performance drop after one node got ENOSPC

scylladb / scylladb

NoSQL data store using the seastar framework, compatible with Apache Cassandra

http://scylladb.com

GNU Affero General Public License v3.0

13.23k stars 1.26k forks source link

Significant performance drop after one node got ENOSPC #7583

Open enaydanov opened 3 years ago

enaydanov commented 3 years ago

Installation details

Scylla version (or git commit hash): 4.3.rc0-0.20201028.bbef05ae3c with build-id adad9e1dae9eb9dbebab38bd5e7258d5c7360350 Cluster size: 6 nodes Instance type: i3.4xlarge OS (RHEL/CentOS/Ubuntu/AWS AMI): ami-01cc969208fae7a3a (eu-west-1) SCT test id: 24e0cdec-f777-4e29-a7e9-e0a5b8087db7 Job link: https://jenkins.scylladb.com/job/scylla-4.3/job/longevity/job/longevity-50gb-4days-test/3/ Test name: longevity-50gb-4days-test Test config: https://github.com/scylladb/scylla-cluster-tests/blob/branch-4.3/test-cases/longevity/longevity-50GB-4days-authorization-and-tls-ssl.yaml DB logs: https://cloudius-jenkins-test.s3.amazonaws.com/24e0cdec-f777-4e29-a7e9-e0a5b8087db7/20201109_111236/db-cluster-24e0cdec.zip Target node for ENOSPC nemesis: 10.0.0.180

Similar issue (maybe the same): #7575

$ hydra investigate show-monitor 24e0cdec-f777-4e29-a7e9-e0a5b8087db7

bhalevy commented 3 years ago

This scenario is more similar to https://github.com/scylladb/scylla/issues/7580 since the node that runs out of space will isolate itself so it will appear as DOWN to the others.

slivne commented 3 years ago

We had ENOSPC runs in 4.2 - lets check if we saw the same drop - if not its a potential regression and we can bisect (or narrow it down)

roydahan commented 3 years ago

I don't see high latency that is specifically caused by ENOSPC nemesis in 4.2. However, I see some high latencies that starting during nodetool cleanup and continue during long period into other nemesis. It may be the same case also here, because from the screenshots I can see that there are high latency spikes even before that nemesis.

An example of the same test in 4.2: hydra investigate show-monitor 5ee0a19f-c935-45e5-b681-d0cd35e4b50f

roydahan commented 3 years ago

Also in this case the performance issue doesn't start with the ENOSPC, it starts short after nodetool cleanup and during repair (same like in 4.2). You can use the show-monitor or this temporary link to see it: http://13.49.246.135:33631/d/Q5m2bkTGz/scylla-per-server-metrics-nemesis-master?orgId=1&from=1604631680725&to=1604680184427

slivne commented 3 years ago

do we do a select * from system.large_cells ?

roydahan commented 3 years ago

No.

slivne commented 3 years ago

it doesn;t seem a regression
it happens in nodetool cleanup
- all of them are STCS, there is an older bug on compression algorithms one of them is "worse" - need to look it up and continue