scylladb / scylladb

NoSQL data store using the seastar framework, compatible with Apache Cassandra
http://scylladb.com
GNU Affero General Public License v3.0
13.26k stars 1.26k forks source link

During rebuild the node was run out of space #8908

Open juliayakovlev opened 3 years ago

juliayakovlev commented 3 years ago

Installation details Scylla version (or git commit hash): 4.5.rc2-0.20210602.d92a26636 Cluster size: 4 nodes OS (RHEL/CentOS/Ubuntu/AWS AMI): https://www.googleapis.com/compute/v1/projects/scylla-images/global/images/1192329434759493726

Test: longevity-1tb-5days-gce-test Test job Test yaml

Problem: During rebuild in CorruptThenRebuild nemesis the target node was run out of space.

[2021-06-16T12:34:27.861Z] < t:2021-06-16 12:34:26,567 f:file_logger.py  l:80   c:sdcm.sct_events.file_logger p:INFO  > 2021-06-16 12:34:26.556: (DisruptionEvent Severity.NORMAL): type=CorruptThenRebuild subtype=start target_node=Node longevity-tls-1tb-7d-4-5-db-node-d14cf845-0-4 [35.185.46.234 | 10.142.0.100] (seed: False) duration=None

Nemesis performed:

  1. Stop Scylla server
  2. destroy a few data files
  3. Start Scylla server
    [2021-06-16T12:36:37.698Z] < t:2021-06-16 12:36:37,181 f:file_logger.py  l:80   c:sdcm.sct_events.file_logger p:INFO  > 2021-06-16 12:36:31.000: (DatabaseLogEvent Severity.NORMAL): type=BOOT regex=Starting Scylla Server line_number=584060 node=Node longevity-tls-1tb-7d-4-5-db-node-d14cf845-0-4 [35.185.46.234 | 10.142.0.100] (seed: False)
    [2021-06-16T12:36:37.698Z] < t:2021-06-16 12:36:37,181 f:file_logger.py  l:80   c:sdcm.sct_events.file_logger p:INFO  > 2021-06-16T12:36:31+00:00  longevity-tls-1tb-7d-4-5-db-node-d14cf845-0-4 !INFO    | systemd[1]: Starting Scylla Server...

Then target node was run out of space.

[2021-06-16T13:22:35.454Z] < t:2021-06-16 13:22:33,834 f:file_logger.py  l:80   c:sdcm.sct_events.file_logger p:INFO  > 2021-06-16 13:22:23.000: (DatabaseLogEvent Severity.ERROR): type=NO_SPACE_ERROR regex=No space left on device line_number=590316 node=Node longevity-tls-1tb-7d-4-5-db-node-d14cf845-0-4 [35.185.46.234 | 10.142.0.100] (seed: False)
[2021-06-16T13:22:35.454Z] < t:2021-06-16 13:22:33,834 f:file_logger.py  l:80   c:sdcm.sct_events.file_logger p:INFO  > 2021-06-16T13:22:23+00:00  longevity-tls-1tb-7d-4-5-db-node-d14cf845-0-4 !WARNING | scylla[257344]: [shard 0] storage_service - Disk error: std::system_error (error system:28, No space left on device)

And rebuild failed:

[2021-06-16T13:35:44.603Z] < t:2021-06-16 13:35:44,172 f:cluster.py      l:2773 c:sdcm.cluster_gce     p:CRITICAL > Node longevity-tls-1tb-7d-4-5-db-node-d14cf845-0-4 [35.185.46.234 | 10.142.0.100] (seed: False): Command '/usr/bin/nodetool -u cassandra -pw cassandra  rebuild ' erro
r: Encountered a bad command exit code!
[2021-06-16T13:35:44.603Z] < t:2021-06-16 13:35:44,172 f:cluster.py      l:2773 c:sdcm.cluster_gce     p:CRITICAL > 
[2021-06-16T13:35:44.603Z] < t:2021-06-16 13:35:44,172 f:cluster.py      l:2773 c:sdcm.cluster_gce     p:CRITICAL > Command: '/usr/bin/nodetool -u cassandra -pw cassandra  rebuild '
[2021-06-16T13:35:44.603Z] < t:2021-06-16 13:35:44,172 f:cluster.py      l:2773 c:sdcm.cluster_gce     p:CRITICAL > 
[2021-06-16T13:35:44.603Z] < t:2021-06-16 13:35:44,172 f:cluster.py      l:2773 c:sdcm.cluster_gce     p:CRITICAL > Exit code: 1
[2021-06-16T13:35:44.603Z] < t:2021-06-16 13:35:44,172 f:cluster.py      l:2773 c:sdcm.cluster_gce     p:CRITICAL > 
[2021-06-16T13:35:44.603Z] < t:2021-06-16 13:35:44,172 f:cluster.py      l:2773 c:sdcm.cluster_gce     p:CRITICAL > Stdout:
[2021-06-16T13:35:44.603Z] < t:2021-06-16 13:35:44,172 f:cluster.py      l:2773 c:sdcm.cluster_gce     p:CRITICAL > 
[2021-06-16T13:35:44.603Z] < t:2021-06-16 13:35:44,172 f:cluster.py      l:2773 c:sdcm.cluster_gce     p:CRITICAL > nodetool: Scylla API server HTTP POST to URL '/storage_service/rebuild' failed: streaming::stream_exception (Stream failed)
[2021-06-16T13:35:44.603Z] < t:2021-06-16 13:35:44,172 f:cluster.py      l:2773 c:sdcm.cluster_gce     p:CRITICAL > See 'nodetool help' or 'nodetool help <command>'.
[2021-06-16T13:35:44.603Z] < t:2021-06-16 13:35:44,172 f:cluster.py      l:2773 c:sdcm.cluster_gce     p:CRITICAL > 
[2021-06-16T13:35:44.603Z] < t:2021-06-16 13:35:44,172 f:cluster.py      l:2773 c:sdcm.cluster_gce     p:CRITICAL > Stderr:
[2021-06-16T13:35:44.603Z] < t:2021-06-16 13:35:44,172 f:cluster.py      l:2773 c:sdcm.cluster_gce     p:CRITICAL > 
[2021-06-16T13:35:44.603Z] < t:2021-06-16 13:35:44,172 f:cluster.py      l:2773 c:sdcm.cluster_gce     p:CRITICAL > 

Target node IP is 10.142.0.100 Test data on all nodes before the CorruptThenRebuild nemesis is 1Tb Overview dashboard for this time period Scylla per server metrix nemesis dashboard for this period Screenshot from 2021-06-22 17-57-20

During the CorruptThenRebuild nemesis no free space on the target node Overview dashboard for this time period Scylla per server metrix nemesis dashboard for this period Screenshot from 2021-06-22 18-17-36

Unfortunately the logs were not saved and the node with the problem was terminated. We have the monitor only. I am starting reproducer now

raphaelsc commented 3 years ago

@juliayakovlev what''s the scylla version? what's the compaction strategy?

raphaelsc commented 3 years ago

please send the log of 10.142.0.100

raphaelsc commented 3 years ago

image below shows that rebuild caused that node to use 2x more space. need to figure out if that's a regression in rebuild.

jun_22_8908

looks like this test never passed. is it a regression?

juliayakovlev commented 3 years ago

@juliayakovlev what''s the scylla version? what's the compaction strategy?

Compaction strategy:

Scylla version (or git commit hash): 4.5.rc2-0.20210602.d92a26636

please send the log of 10.142.0.100

Unfortunately we have no logs. And the node was terminated. I try to reproduce the test now

juliayakovlev commented 3 years ago

@raphaelsc

The issue is reproduced. Test runs. Cluster is alive

Test runs same configuration and CorruptThenRebuild nemesis every 10 min on the same node

Job

scylla-per-server-metrics-nemesis

Screenshot from 2021-06-23 08-14-10

os-metrics

Screenshot from 2021-06-23 08-14-24

Node, where rebuild runs, IP 35.231.182.209

asias commented 3 years ago

@juliayakovlev Nice you managed to reproduce. Could you share the logs of 35.231.182.209 before the job finishes. Did we run such test against older version, e.g., scylla 4.4?

@raphaelsc There is no change in the areas of rebuild operations for a while. I checked commits from 4.4.0 to 4.5-rc2.

A wild guess is that rebuild failed in the middle and it performed retry of the rebuild automatically. Withtout RBNO, rebuild will restream all the data again. Then out of disk.

juliayakovlev commented 3 years ago

Logs: 35.231.182.209.zip

This test was run in 4.4 and never passed. But it was different problems. Nemesises were not started at all during the tests

roydahan commented 3 years ago

I just noticed that these instances in GCE are using only 6 local_ssd drives (each disk is 375GB, total is 2.25 TB). I don't think it's enough for the 1TB dataset, even though we use LCS.

roydahan commented 3 years ago

@juliayakovlev can you please retest with this change: https://github.com/scylladb/scylla-cluster-tests/pull/3702

juliayakovlev commented 3 years ago

@juliayakovlev can you please retest with this change: scylladb/scylla-cluster-tests#3702

Running with gce_n_local_ssd_disk_db: 16: https://jenkins.scylladb.com/job/scylla-4.5/job/Reproducers/job/longevity-1tb-5days-gce-test-8908/4/console

juliayakovlev commented 3 years ago

@raphaelsc @asias

Test with gce_n_local_ssd_disk_db: 16 finished.

Installation details Kernel version: 5.12.9-1.el8.elrepo.x86_64 Scylla version (or git commit hash): 4.5.rc2-0.20210602.d92a26636 Cluster size: 4 nodes (n1-highmem-16) Scylla running with shards number (live nodes): longevity-tls-1tb-7d-reproduc-db-node-3330ad3e-0-1 (35.243.202.204 | 10.142.0.36): 16 shards longevity-tls-1tb-7d-reproduc-db-node-3330ad3e-0-2 (34.73.244.18 | 10.142.0.63): 16 shards longevity-tls-1tb-7d-reproduc-db-node-3330ad3e-0-3 (35.237.51.105 | 10.142.0.157): 16 shards longevity-tls-1tb-7d-reproduc-db-node-3330ad3e-0-4 (35.196.47.81 | 10.142.0.166): 16 shards OS (RHEL/CentOS/Ubuntu/AWS AMI): - (gce: eu-west-1)

Test: longevity-1tb-5days-gce-test-8908 Test name: longevity_test.LongevityTest.test_custom_time Test config file(s) includes cassandra-stress commands:

Rebuild was run 6 time on the same node (longevity-tls-1tb-7d-reproduc-db-node-3330ad3e-0-3 (35.237.51.105 | 10.142.0.157)). Scylla per server metrix nemesis dashboard Screenshot from 2021-06-24 09-59-19

Free space on this node before first rebuild was 5Tb (same to all other nodes) OS Metrix dashboard Screenshot from 2021-06-24 09-53-58

With each rebuild free space was going to be less and less OS Metrix dashboard Screenshot from 2021-06-24 09-56-24

After 6th rebuild remain 301Gb OS Metrix dashboard Screenshot from 2021-06-24 09-57-14

Scylla is active on the node 3, but its status in the cluster is DN:

Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
--  Address       Load       Tokens       Owns    Host ID                               Rack
UN  10.142.0.63   1000.43 GB  256          ?       aa263713-b3ab-45c4-948d-fae378628326  d
DN  10.142.0.157  3.54 TB    256          ?       584b1bd2-e75e-4420-8b49-692cf50f95eb  d
UN  10.142.0.166  809.97 GB  256          ?       fe47b385-e6d5-40ab-a7cb-553fae2d1639  d
UN  10.142.0.36   958.52 GB  256          ?       e67cc7bc-c461-4269-8652-244cfa95bfa6  d

Rebuild failed on mview keyspace (default compaction strategy):

Jun 24 04:31:02 longevity-tls-1tb-7d-reproduc-db-node-3330ad3e-0-3 scylla[245890]:  [shard 0] range_streamer - Rebuild with 10.142.0.166 for keyspace=mview, streaming [200, 225) out of 258 ranges
Jun 24 04:31:24 longevity-tls-1tb-7d-reproduc-db-node-3330ad3e-0-3 scylla[245890]:  [shard 0] range_streamer - Rebuild with 10.142.0.36 for keyspace=mview failed, took 464.576 seconds: streaming::stream_exceptio
n (Stream failed)
Jun 24 04:31:30 longevity-tls-1tb-7d-reproduc-db-node-3330ad3e-0-3 scylla[245890]:  [shard 0] range_streamer - Rebuild with 10.142.0.63 for keyspace=mview failed, took 470.355 seconds: streaming::stream_exceptio
n (Stream failed)
Jun 24 04:31:30 longevity-tls-1tb-7d-reproduc-db-node-3330ad3e-0-3 scylla[245890]:  [shard 0] range_streamer - Rebuild with 10.142.0.166 for keyspace=mview failed, took 470.366 seconds: streaming::stream_excepti
on (Stream failed)
Jun 24 04:31:30 longevity-tls-1tb-7d-reproduc-db-node-3330ad3e-0-3 scylla[245890]:  [shard 0] range_streamer - Rebuild failed, took 5852 seconds, nr_ranges_remaining=172
Jun 24 04:31:30 longevity-tls-1tb-7d-reproduc-db-node-3330ad3e-0-3 scylla[245890]:  [shard 0] range_streamer - Rebuild failed to stream. Will retry in 60 seconds ...

Cluster nodes and monitor are alive

Restore Monitor Stack command: $ hydra investigate show-monitor 3330ad3e-4ac8-4b30-b537-f49a3bb76f94 Show all stored logs command: $ hydra investigate show-logs 3330ad3e-4ac8-4b30-b537-f49a3bb76f94

Test id: 3330ad3e-4ac8-4b30-b537-f49a3bb76f94

Logs: db-cluster - https://cloudius-jenkins-test.s3.amazonaws.com/3330ad3e-4ac8-4b30-b537-f49a3bb76f94/20210624_050242/db-cluster-3330ad3e.zip

sct-runner - https://cloudius-jenkins-test.s3.amazonaws.com/3330ad3e-4ac8-4b30-b537-f49a3bb76f94/20210624_050242/sct-runner-3330ad3e.zip

Jenkins job URL

roydahan commented 3 years ago
Screen Shot 2021-06-24 at 12 13 16

From looking at the nemesis code, we don't run "nodetool cleanup" after rebuild. I don't know if we should do it or not, I couldn't find in our documentation any reference about it.

@slivne do we need to run cleanup after every rebuild?

asias commented 3 years ago

We do not need cleanup for rebuild at all. The problem is that without RBNO, rebuild will bring all the data again and again for each rebuild ops. If compaction is not able to keep up to reduce the duplication, the node will be out of disk space.

slivne commented 3 years ago

I tried to download the metrics and can't - so I can;t really know if in this case compaction is not able to keep with the traffic - or

We have a case in which we have a view using STCS and getting more and more sstables with the exact same data - and reaching the space amplification that STCS may reach which is x7-x12 ...

So I restarted another run to try and see if I can understand whats going on https://jenkins.scylladb.com/job/scylla-4.5/job/Reproducers/job/longevity-1tb-5days-gce-test-8908/5/

roydahan commented 3 years ago

Shlomi, in this case it’s LCS. Why weren’t you able to download the metrics? Did you try the hydra show monitor?

On Wed, Jun 30, 2021 at 17:58 Shlomi Livne @.***> wrote:

I tried to download the metrics and can't - so I can;t really know if in this case compaction is not able to keep with the traffic - or

We have a case in which we have a view using STCS and getting more and more sstables with the exact same data - and reaching the space amplification that STCS may reach which is x7-x12 ...

So I restarted another run to try and see if I can understand whats going on https://jenkins.scylladb.com/job/scylla-4.5/job/Reproducers/job/longevity-1tb-5days-gce-test-8908/5/

— You are receiving this because you commented.

Reply to this email directly, view it on GitHub https://github.com/scylladb/scylla/issues/8908#issuecomment-871479440, or unsubscribe https://github.com/notifications/unsubscribe-auth/AE75CYHVSMSD72T22JVCAJLTVMWKVANCNFSM47D5KAKQ .

slivne commented 3 years ago

not sure why the metrics did not download ...

in any case the next run has metrics that can be used

backlog

this is the backlog its increasing

If compaction was keeping up the backlog should be decreasing and its not.

what is happening here is that for some reason LCS is not able to reduce the space as the rebuild is being done.

slivne commented 3 years ago

and here is another proof we have huge sstables

6.9G -rw-r--r--. 1 scylla-manager scylla 6.9G Jul  5 12:24 md-45894-big-Data.db
6.9G -rw-r--r--. 1 scylla-manager scylla 6.9G Jul  5 12:24 md-45672-big-Data.db
2.8G -rw-r--r--. 1 scylla-manager scylla 2.8G Jul  5 13:02 md-50219-big-Data.db
5.5G -rw-r--r--. 1 scylla-manager scylla 5.5G Jul  5 13:02 md-50221-big-Data.db
 52G -rw-r--r--. 1 scylla-manager scylla  52G Jul  5 13:16 md-50235-big-Data.db
 52G -rw-r--r--. 1 scylla-manager scylla  52G Jul  5 13:17 md-50232-big-Data.db
 53G -rw-r--r--. 1 scylla-manager scylla  53G Jul  5 13:18 md-50226-big-Data.db
 52G -rw-r--r--. 1 scylla-manager scylla  52G Jul  5 13:19 md-50234-big-Data.db
 48G -rw-r--r--. 1 scylla-manager scylla  48G Jul  5 13:20 md-50566-big-Data.db
 45G -rw-r--r--. 1 scylla-manager scylla  45G Jul  5 13:20 md-50575-big-Data.db
3.1G -rw-r--r--. 1 scylla-manager scylla 3.1G Jul  5 13:20 md-51447-big-Data.db
2.7G -rw-r--r--. 1 scylla-manager scylla 2.7G Jul  5 13:24 md-51713-big-Data.db
 39G -rw-r--r--. 1 scylla-manager scylla  39G Jul  5 13:47 md-52637-big-Data.db
 47G -rw-r--r--. 1 scylla-manager scylla  47G Jul  5 13:54 md-53063-big-Data.db
 49G -rw-r--r--. 1 scylla-manager scylla  49G Jul  5 13:58 md-53233-big-Data.db
3.6G -rw-r--r--. 1 scylla-manager scylla 3.6G Jul  5 14:14 md-55957-big-Data.db
 43G -rw-r--r--. 1 scylla-manager scylla  43G Jul  5 14:29 md-54918-big-Data.db
 44G -rw-r--r--. 1 scylla-manager scylla  44G Jul  5 14:30 md-54907-big-Data.db
 47G -rw-r--r--. 1 scylla-manager scylla  47G Jul  5 14:31 md-56277-big-Data.db
 44G -rw-r--r--. 1 scylla-manager scylla  44G Jul  5 14:31 md-54666-big-Data.db
 46G -rw-r--r--. 1 scylla-manager scylla  46G Jul  5 14:31 md-54952-big-Data.db
 36G -rw-r--r--. 1 scylla-manager scylla  36G Jul  5 14:54 md-56599-big-Data.db
2.8G -rw-r--r--. 1 scylla-manager scylla 2.8G Jul  5 15:05 md-59362-big-Data.db
4.2G -rw-r--r--. 1 scylla-manager scylla 4.2G Jul  5 15:05 md-59375-big-Data.db
 38G -rw-r--r--. 1 scylla-manager scylla  38G Jul  5 15:14 md-59373-big-Data.db
 42G -rw-r--r--. 1 scylla-manager scylla  42G Jul  5 15:15 md-59360-big-Data.db
 42G -rw-r--r--. 1 scylla-manager scylla  42G Jul  5 15:16 md-59364-big-Data.db
 35G -rw-r--r--. 1 scylla-manager scylla  35G Jul  5 15:23 md-59793-big-Data.db
 45G -rw-r--r--. 1 scylla-manager scylla  45G Jul  5 15:26 md-59698-big-Data.db
 47G -rw-r--r--. 1 scylla-manager scylla  47G Jul  5 15:27 md-59807-big-Data.db
 48G -rw-r--r--. 1 scylla-manager scylla  48G Jul  5 15:47 md-61102-big-Data.db
3.1G -rw-r--r--. 1 scylla-manager scylla 3.1G Jul  5 15:51 md-62587-big-Data.db
2.9G -rw-r--r--. 1 scylla-manager scylla 2.9G Jul  5 15:55 md-62773-big-Data.db
3.4G -rw-r--r--. 1 scylla-manager scylla 3.4G Jul  5 16:01 md-62420-big-Data.db
2.8G -rw-r--r--. 1 scylla-manager scylla 2.8G Jul  5 16:01 md-62685-big-Data.db
3.3G -rw-r--r--. 1 scylla-manager scylla 3.3G Jul  5 16:01 md-62624-big-Data.db
2.8G -rw-r--r--. 1 scylla-manager scylla 2.8G Jul  5 16:01 md-62513-big-Data.db
 39G -rw-r--r--. 1 scylla-manager scylla  39G Jul  5 16:12 md-63174-big-Data.db
 39G -rw-r--r--. 1 scylla-manager scylla  39G Jul  5 16:13 md-63098-big-Data.db
 40G -rw-r--r--. 1 scylla-manager scylla  40G Jul  5 16:14 md-63275-big-Data.db
 41G -rw-r--r--. 1 scylla-manager scylla  41G Jul  5 16:16 md-63333-big-Data.db
 36G -rw-r--r--. 1 scylla-manager scylla  36G Jul  5 16:26 md-63517-big-Data.db
 35G -rw-r--r--. 1 scylla-manager scylla  35G Jul  5 16:28 md-63425-big-Data.db
 43G -rw-r--r--. 1 scylla-manager scylla  43G Jul  5 16:36 md-63700-big-Data.db
 43G -rw-r--r--. 1 scylla-manager scylla  43G Jul  5 16:37 md-64096-big-Data.db
 42G -rw-r--r--. 1 scylla-manager scylla  42G Jul  5 16:38 md-63650-big-Data.db
 38G -rw-r--r--. 1 scylla-manager scylla  38G Jul  5 16:38 md-63823-big-Data.db
2.8G -rw-r--r--. 1 scylla-manager scylla 2.8G Jul  5 16:47 md-65475-big-Data.db
 42G -rw-r--r--. 1 scylla-manager scylla  42G Jul  5 16:53 md-64622-big-Data.db
 52G -rw-r--r--. 1 scylla-manager scylla  52G Jul  5 16:57 md-65468-big-Data.db
 52G -rw-r--r--. 1 scylla-manager scylla  52G Jul  5 16:59 md-65491-big-Data.db
 49G -rw-r--r--. 1 scylla-manager scylla  49G Jul  5 17:29 md-67383-big-Data.db

.
.
.

if compaction was keeping up this should not happen

maybe those are sstables generated from streaming that need to be broken up later - still seems to be to large

@roydahan is this a regression - I expect 4.4 would behave the same

roydahan commented 3 years ago

I don't think it's a regression, However it may happen only or more commonly in GCE where it was revealed as part of 4.5 tests that weren't part of 4.4 testing.

slivne commented 2 years ago

@raphaelsc is this supposed to be fixed now with the new computation of the backlog controller ?

pushing this out to 5.1

raphaelsc commented 2 years ago

@raphaelsc is this supposed to be fixed now with the new computation of the backlog controller ?

pushing this out to 5.1

@slivne

By comment on https://github.com/scylladb/scylla/issues/8908#issuecomment-867401135, rebuild ran 6x on the same node.

Asias wrote above: "A wild guess is that rebuild failed in the middle and it performed retry of the rebuild automatically. Withtout RBNO, rebuild will restream all the data again. Then out of disk."

LCS is used. Cannot keep up with the write rate, due to its high writeamp. Under heavy write load, LCS will mostly be compacting L0 because L0 is receiving new files very fast. So making it hard for data deduplication on higher levels. The fact that you saw large files when listing files back up this claim.

There's still a pending change for LCS controller[1]. The changes to controller done so far will only help with LCS' L0, which uses tiered backlog calculation. [1]: It's actually my next task.

Until recently, LCS assumed everything should be pushed to last level, aking to major, but again, unrealistic. So this patch of mine can help a lot: https://github.com/scylladb/scylla/commit/9de7abdc80721c14663fc698c7132a0dce878c18 (present in 5.0 branch)

Also, this one https://github.com/scylladb/scylla/commit/ab0217e30e7df2da6d81f7a82b89a36bcc007133 (available in 4.6.0). This one will guarantee higher efficiency under heavy write load, which is what we've with those inefficient rebuilds restreaming all data withou RBNO.

And what will really improve this is off-strategy compaction + RBNO. Off-strategy compaction is enabled for streaming-based bootstrap and replace. I think it can be potentially enabled for streaming-based rebuild too, which is good for increasing compaction efficiency, and as a result, faster deduclication.

Thoughts?