Open juliayakovlev opened 3 years ago
@juliayakovlev what''s the scylla version? what's the compaction strategy?
please send the log of 10.142.0.100
image below shows that rebuild caused that node to use 2x more space. need to figure out if that's a regression in rebuild.
looks like this test never passed. is it a regression?
@juliayakovlev what''s the scylla version? what's the compaction strategy?
Compaction strategy:
scylla[5989]: [shard 0] cql_server - exception while processing connection: std::system_error (error GnuTLS:-10, The specified session has been invalidated for some reason.)
Scylla version (or git commit hash): 4.5.rc2-0.20210602.d92a26636
please send the log of 10.142.0.100
Unfortunately we have no logs. And the node was terminated. I try to reproduce the test now
@raphaelsc
The issue is reproduced. Test runs. Cluster is alive
Test runs same configuration and CorruptThenRebuild nemesis every 10 min on the same node
scylla-per-server-metrics-nemesis
Node, where rebuild runs, IP 35.231.182.209
@juliayakovlev Nice you managed to reproduce. Could you share the logs of 35.231.182.209 before the job finishes. Did we run such test against older version, e.g., scylla 4.4?
@raphaelsc There is no change in the areas of rebuild operations for a while. I checked commits from 4.4.0 to 4.5-rc2.
A wild guess is that rebuild failed in the middle and it performed retry of the rebuild automatically. Withtout RBNO, rebuild will restream all the data again. Then out of disk.
Logs: 35.231.182.209.zip
This test was run in 4.4 and never passed. But it was different problems. Nemesises were not started at all during the tests
I just noticed that these instances in GCE are using only 6 local_ssd drives (each disk is 375GB, total is 2.25 TB). I don't think it's enough for the 1TB dataset, even though we use LCS.
@juliayakovlev can you please retest with this change: https://github.com/scylladb/scylla-cluster-tests/pull/3702
@juliayakovlev can you please retest with this change: scylladb/scylla-cluster-tests#3702
Running with gce_n_local_ssd_disk_db: 16
: https://jenkins.scylladb.com/job/scylla-4.5/job/Reproducers/job/longevity-1tb-5days-gce-test-8908/4/console
@raphaelsc @asias
Test with gce_n_local_ssd_disk_db: 16
finished.
Installation details
Kernel version: 5.12.9-1.el8.elrepo.x86_64
Scylla version (or git commit hash): 4.5.rc2-0.20210602.d92a26636
Cluster size: 4 nodes (n1-highmem-16)
Scylla running with shards number (live nodes):
longevity-tls-1tb-7d-reproduc-db-node-3330ad3e-0-1 (35.243.202.204 | 10.142.0.36): 16 shards
longevity-tls-1tb-7d-reproduc-db-node-3330ad3e-0-2 (34.73.244.18 | 10.142.0.63): 16 shards
longevity-tls-1tb-7d-reproduc-db-node-3330ad3e-0-3 (35.237.51.105 | 10.142.0.157): 16 shards
longevity-tls-1tb-7d-reproduc-db-node-3330ad3e-0-4 (35.196.47.81 | 10.142.0.166): 16 shards
OS (RHEL/CentOS/Ubuntu/AWS AMI): -
(gce: eu-west-1)
Test: longevity-1tb-5days-gce-test-8908
Test name: longevity_test.LongevityTest.test_custom_time
Test config file(s) includes cassandra-stress commands:
Rebuild was run 6 time on the same node (longevity-tls-1tb-7d-reproduc-db-node-3330ad3e-0-3 (35.237.51.105 | 10.142.0.157)). Scylla per server metrix nemesis dashboard
Free space on this node before first rebuild was 5Tb (same to all other nodes) OS Metrix dashboard
With each rebuild free space was going to be less and less OS Metrix dashboard
After 6th rebuild remain 301Gb OS Metrix dashboard
Scylla is active on the node 3, but its status in the cluster is DN
:
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
-- Address Load Tokens Owns Host ID Rack
UN 10.142.0.63 1000.43 GB 256 ? aa263713-b3ab-45c4-948d-fae378628326 d
DN 10.142.0.157 3.54 TB 256 ? 584b1bd2-e75e-4420-8b49-692cf50f95eb d
UN 10.142.0.166 809.97 GB 256 ? fe47b385-e6d5-40ab-a7cb-553fae2d1639 d
UN 10.142.0.36 958.52 GB 256 ? e67cc7bc-c461-4269-8652-244cfa95bfa6 d
Rebuild failed on mview
keyspace (default compaction strategy):
Jun 24 04:31:02 longevity-tls-1tb-7d-reproduc-db-node-3330ad3e-0-3 scylla[245890]: [shard 0] range_streamer - Rebuild with 10.142.0.166 for keyspace=mview, streaming [200, 225) out of 258 ranges
Jun 24 04:31:24 longevity-tls-1tb-7d-reproduc-db-node-3330ad3e-0-3 scylla[245890]: [shard 0] range_streamer - Rebuild with 10.142.0.36 for keyspace=mview failed, took 464.576 seconds: streaming::stream_exceptio
n (Stream failed)
Jun 24 04:31:30 longevity-tls-1tb-7d-reproduc-db-node-3330ad3e-0-3 scylla[245890]: [shard 0] range_streamer - Rebuild with 10.142.0.63 for keyspace=mview failed, took 470.355 seconds: streaming::stream_exceptio
n (Stream failed)
Jun 24 04:31:30 longevity-tls-1tb-7d-reproduc-db-node-3330ad3e-0-3 scylla[245890]: [shard 0] range_streamer - Rebuild with 10.142.0.166 for keyspace=mview failed, took 470.366 seconds: streaming::stream_excepti
on (Stream failed)
Jun 24 04:31:30 longevity-tls-1tb-7d-reproduc-db-node-3330ad3e-0-3 scylla[245890]: [shard 0] range_streamer - Rebuild failed, took 5852 seconds, nr_ranges_remaining=172
Jun 24 04:31:30 longevity-tls-1tb-7d-reproduc-db-node-3330ad3e-0-3 scylla[245890]: [shard 0] range_streamer - Rebuild failed to stream. Will retry in 60 seconds ...
Cluster nodes and monitor are alive
Restore Monitor Stack command: $ hydra investigate show-monitor 3330ad3e-4ac8-4b30-b537-f49a3bb76f94
Show all stored logs command: $ hydra investigate show-logs 3330ad3e-4ac8-4b30-b537-f49a3bb76f94
Test id: 3330ad3e-4ac8-4b30-b537-f49a3bb76f94
Logs: db-cluster - https://cloudius-jenkins-test.s3.amazonaws.com/3330ad3e-4ac8-4b30-b537-f49a3bb76f94/20210624_050242/db-cluster-3330ad3e.zip
From looking at the nemesis code, we don't run "nodetool cleanup" after rebuild. I don't know if we should do it or not, I couldn't find in our documentation any reference about it.
@slivne do we need to run cleanup after every rebuild?
We do not need cleanup for rebuild at all. The problem is that without RBNO, rebuild will bring all the data again and again for each rebuild ops. If compaction is not able to keep up to reduce the duplication, the node will be out of disk space.
I tried to download the metrics and can't - so I can;t really know if in this case compaction is not able to keep with the traffic - or
We have a case in which we have a view using STCS and getting more and more sstables with the exact same data - and reaching the space amplification that STCS may reach which is x7-x12 ...
So I restarted another run to try and see if I can understand whats going on https://jenkins.scylladb.com/job/scylla-4.5/job/Reproducers/job/longevity-1tb-5days-gce-test-8908/5/
Shlomi, in this case it’s LCS. Why weren’t you able to download the metrics? Did you try the hydra show monitor?
On Wed, Jun 30, 2021 at 17:58 Shlomi Livne @.***> wrote:
I tried to download the metrics and can't - so I can;t really know if in this case compaction is not able to keep with the traffic - or
We have a case in which we have a view using STCS and getting more and more sstables with the exact same data - and reaching the space amplification that STCS may reach which is x7-x12 ...
So I restarted another run to try and see if I can understand whats going on https://jenkins.scylladb.com/job/scylla-4.5/job/Reproducers/job/longevity-1tb-5days-gce-test-8908/5/
— You are receiving this because you commented.
Reply to this email directly, view it on GitHub https://github.com/scylladb/scylla/issues/8908#issuecomment-871479440, or unsubscribe https://github.com/notifications/unsubscribe-auth/AE75CYHVSMSD72T22JVCAJLTVMWKVANCNFSM47D5KAKQ .
not sure why the metrics did not download ...
in any case the next run has metrics that can be used
this is the backlog its increasing
If compaction was keeping up the backlog should be decreasing and its not.
what is happening here is that for some reason LCS is not able to reduce the space as the rebuild is being done.
and here is another proof we have huge sstables
6.9G -rw-r--r--. 1 scylla-manager scylla 6.9G Jul 5 12:24 md-45894-big-Data.db
6.9G -rw-r--r--. 1 scylla-manager scylla 6.9G Jul 5 12:24 md-45672-big-Data.db
2.8G -rw-r--r--. 1 scylla-manager scylla 2.8G Jul 5 13:02 md-50219-big-Data.db
5.5G -rw-r--r--. 1 scylla-manager scylla 5.5G Jul 5 13:02 md-50221-big-Data.db
52G -rw-r--r--. 1 scylla-manager scylla 52G Jul 5 13:16 md-50235-big-Data.db
52G -rw-r--r--. 1 scylla-manager scylla 52G Jul 5 13:17 md-50232-big-Data.db
53G -rw-r--r--. 1 scylla-manager scylla 53G Jul 5 13:18 md-50226-big-Data.db
52G -rw-r--r--. 1 scylla-manager scylla 52G Jul 5 13:19 md-50234-big-Data.db
48G -rw-r--r--. 1 scylla-manager scylla 48G Jul 5 13:20 md-50566-big-Data.db
45G -rw-r--r--. 1 scylla-manager scylla 45G Jul 5 13:20 md-50575-big-Data.db
3.1G -rw-r--r--. 1 scylla-manager scylla 3.1G Jul 5 13:20 md-51447-big-Data.db
2.7G -rw-r--r--. 1 scylla-manager scylla 2.7G Jul 5 13:24 md-51713-big-Data.db
39G -rw-r--r--. 1 scylla-manager scylla 39G Jul 5 13:47 md-52637-big-Data.db
47G -rw-r--r--. 1 scylla-manager scylla 47G Jul 5 13:54 md-53063-big-Data.db
49G -rw-r--r--. 1 scylla-manager scylla 49G Jul 5 13:58 md-53233-big-Data.db
3.6G -rw-r--r--. 1 scylla-manager scylla 3.6G Jul 5 14:14 md-55957-big-Data.db
43G -rw-r--r--. 1 scylla-manager scylla 43G Jul 5 14:29 md-54918-big-Data.db
44G -rw-r--r--. 1 scylla-manager scylla 44G Jul 5 14:30 md-54907-big-Data.db
47G -rw-r--r--. 1 scylla-manager scylla 47G Jul 5 14:31 md-56277-big-Data.db
44G -rw-r--r--. 1 scylla-manager scylla 44G Jul 5 14:31 md-54666-big-Data.db
46G -rw-r--r--. 1 scylla-manager scylla 46G Jul 5 14:31 md-54952-big-Data.db
36G -rw-r--r--. 1 scylla-manager scylla 36G Jul 5 14:54 md-56599-big-Data.db
2.8G -rw-r--r--. 1 scylla-manager scylla 2.8G Jul 5 15:05 md-59362-big-Data.db
4.2G -rw-r--r--. 1 scylla-manager scylla 4.2G Jul 5 15:05 md-59375-big-Data.db
38G -rw-r--r--. 1 scylla-manager scylla 38G Jul 5 15:14 md-59373-big-Data.db
42G -rw-r--r--. 1 scylla-manager scylla 42G Jul 5 15:15 md-59360-big-Data.db
42G -rw-r--r--. 1 scylla-manager scylla 42G Jul 5 15:16 md-59364-big-Data.db
35G -rw-r--r--. 1 scylla-manager scylla 35G Jul 5 15:23 md-59793-big-Data.db
45G -rw-r--r--. 1 scylla-manager scylla 45G Jul 5 15:26 md-59698-big-Data.db
47G -rw-r--r--. 1 scylla-manager scylla 47G Jul 5 15:27 md-59807-big-Data.db
48G -rw-r--r--. 1 scylla-manager scylla 48G Jul 5 15:47 md-61102-big-Data.db
3.1G -rw-r--r--. 1 scylla-manager scylla 3.1G Jul 5 15:51 md-62587-big-Data.db
2.9G -rw-r--r--. 1 scylla-manager scylla 2.9G Jul 5 15:55 md-62773-big-Data.db
3.4G -rw-r--r--. 1 scylla-manager scylla 3.4G Jul 5 16:01 md-62420-big-Data.db
2.8G -rw-r--r--. 1 scylla-manager scylla 2.8G Jul 5 16:01 md-62685-big-Data.db
3.3G -rw-r--r--. 1 scylla-manager scylla 3.3G Jul 5 16:01 md-62624-big-Data.db
2.8G -rw-r--r--. 1 scylla-manager scylla 2.8G Jul 5 16:01 md-62513-big-Data.db
39G -rw-r--r--. 1 scylla-manager scylla 39G Jul 5 16:12 md-63174-big-Data.db
39G -rw-r--r--. 1 scylla-manager scylla 39G Jul 5 16:13 md-63098-big-Data.db
40G -rw-r--r--. 1 scylla-manager scylla 40G Jul 5 16:14 md-63275-big-Data.db
41G -rw-r--r--. 1 scylla-manager scylla 41G Jul 5 16:16 md-63333-big-Data.db
36G -rw-r--r--. 1 scylla-manager scylla 36G Jul 5 16:26 md-63517-big-Data.db
35G -rw-r--r--. 1 scylla-manager scylla 35G Jul 5 16:28 md-63425-big-Data.db
43G -rw-r--r--. 1 scylla-manager scylla 43G Jul 5 16:36 md-63700-big-Data.db
43G -rw-r--r--. 1 scylla-manager scylla 43G Jul 5 16:37 md-64096-big-Data.db
42G -rw-r--r--. 1 scylla-manager scylla 42G Jul 5 16:38 md-63650-big-Data.db
38G -rw-r--r--. 1 scylla-manager scylla 38G Jul 5 16:38 md-63823-big-Data.db
2.8G -rw-r--r--. 1 scylla-manager scylla 2.8G Jul 5 16:47 md-65475-big-Data.db
42G -rw-r--r--. 1 scylla-manager scylla 42G Jul 5 16:53 md-64622-big-Data.db
52G -rw-r--r--. 1 scylla-manager scylla 52G Jul 5 16:57 md-65468-big-Data.db
52G -rw-r--r--. 1 scylla-manager scylla 52G Jul 5 16:59 md-65491-big-Data.db
49G -rw-r--r--. 1 scylla-manager scylla 49G Jul 5 17:29 md-67383-big-Data.db
.
.
.
if compaction was keeping up this should not happen
maybe those are sstables generated from streaming that need to be broken up later - still seems to be to large
@roydahan is this a regression - I expect 4.4 would behave the same
I don't think it's a regression, However it may happen only or more commonly in GCE where it was revealed as part of 4.5 tests that weren't part of 4.4 testing.
@raphaelsc is this supposed to be fixed now with the new computation of the backlog controller ?
pushing this out to 5.1
@raphaelsc is this supposed to be fixed now with the new computation of the backlog controller ?
pushing this out to 5.1
@slivne
By comment on https://github.com/scylladb/scylla/issues/8908#issuecomment-867401135, rebuild ran 6x on the same node.
Asias wrote above: "A wild guess is that rebuild failed in the middle and it performed retry of the rebuild automatically. Withtout RBNO, rebuild will restream all the data again. Then out of disk."
LCS is used. Cannot keep up with the write rate, due to its high writeamp. Under heavy write load, LCS will mostly be compacting L0 because L0 is receiving new files very fast. So making it hard for data deduplication on higher levels. The fact that you saw large files when listing files back up this claim.
There's still a pending change for LCS controller[1]. The changes to controller done so far will only help with LCS' L0, which uses tiered backlog calculation. [1]: It's actually my next task.
Until recently, LCS assumed everything should be pushed to last level, aking to major, but again, unrealistic. So this patch of mine can help a lot: https://github.com/scylladb/scylla/commit/9de7abdc80721c14663fc698c7132a0dce878c18 (present in 5.0 branch)
Also, this one https://github.com/scylladb/scylla/commit/ab0217e30e7df2da6d81f7a82b89a36bcc007133 (available in 4.6.0). This one will guarantee higher efficiency under heavy write load, which is what we've with those inefficient rebuilds restreaming all data withou RBNO.
And what will really improve this is off-strategy compaction + RBNO. Off-strategy compaction is enabled for streaming-based bootstrap and replace. I think it can be potentially enabled for streaming-based rebuild too, which is good for increasing compaction efficiency, and as a result, faster deduclication.
Thoughts?
Installation details Scylla version (or git commit hash): 4.5.rc2-0.20210602.d92a26636 Cluster size: 4 nodes OS (RHEL/CentOS/Ubuntu/AWS AMI): https://www.googleapis.com/compute/v1/projects/scylla-images/global/images/1192329434759493726
Test: longevity-1tb-5days-gce-test Test job Test yaml
Problem: During rebuild in CorruptThenRebuild nemesis the target node was run out of space.
Nemesis performed:
Then target node was run out of space.
And rebuild failed:
Target node IP is
10.142.0.100
Test data on all nodes before the CorruptThenRebuild nemesis is1Tb
Overview dashboard for this time period Scylla per server metrix nemesis dashboard for this periodDuring the CorruptThenRebuild nemesis no free space on the target node Overview dashboard for this time period Scylla per server metrix nemesis dashboard for this period
Unfortunately the logs were not saved and the node with the problem was terminated. We have the monitor only. I am starting reproducer now