Open enaydanov opened 3 years ago
This scenario is more similar to https://github.com/scylladb/scylla/issues/7580 since the node that runs out of space will isolate itself so it will appear as DOWN to the others.
We had ENOSPC runs in 4.2 - lets check if we saw the same drop - if not its a potential regression and we can bisect (or narrow it down)
I don't see high latency that is specifically caused by ENOSPC nemesis in 4.2. However, I see some high latencies that starting during nodetool cleanup and continue during long period into other nemesis. It may be the same case also here, because from the screenshots I can see that there are high latency spikes even before that nemesis.
An example of the same test in 4.2: hydra investigate show-monitor 5ee0a19f-c935-45e5-b681-d0cd35e4b50f
Also in this case the performance issue doesn't start with the ENOSPC, it starts short after nodetool cleanup and during repair (same like in 4.2). You can use the show-monitor or this temporary link to see it: http://13.49.246.135:33631/d/Q5m2bkTGz/scylla-per-server-metrics-nemesis-master?orgId=1&from=1604631680725&to=1604680184427
do we do a select * from system.large_cells ?
No.
Installation details
Scylla version (or git commit hash): 4.3.rc0-0.20201028.bbef05ae3c with build-id adad9e1dae9eb9dbebab38bd5e7258d5c7360350 Cluster size: 6 nodes Instance type: i3.4xlarge OS (RHEL/CentOS/Ubuntu/AWS AMI): ami-01cc969208fae7a3a (eu-west-1) SCT test id: 24e0cdec-f777-4e29-a7e9-e0a5b8087db7 Job link: https://jenkins.scylladb.com/job/scylla-4.3/job/longevity/job/longevity-50gb-4days-test/3/ Test name: longevity-50gb-4days-test Test config: https://github.com/scylladb/scylla-cluster-tests/blob/branch-4.3/test-cases/longevity/longevity-50GB-4days-authorization-and-tls-ssl.yaml DB logs: https://cloudius-jenkins-test.s3.amazonaws.com/24e0cdec-f777-4e29-a7e9-e0a5b8087db7/20201109_111236/db-cluster-24e0cdec.zip Target node for ENOSPC nemesis: 10.0.0.180
Similar issue (maybe the same): #7575