Scylla operator / manager backup restore is very slow

lukaszsurfer commented 3 months ago

Hello!

On following setup, latest GKE Standard cluster, we do have 5 nodes with 4 NVME Local SSD each:

sctool status
Cluster: scylla/scylla-cluster (45d3e040-c349-4e86-8893-e2aba7a037c6)
Datacenter: europe-west4
+----+----------+----------+--------------+-----------+------+---------+--------+-------+--------------------------------------+
|    | CQL      | REST     | Address      | Uptime    | CPUs | Memory  | Scylla | Agent | Host ID                              |
+----+----------+----------+--------------+-----------+------+---------+--------+-------+--------------------------------------+
| UN | UP (3ms) | UP (2ms) | 10.229.0.177 | 48h18m0s  | 4    | 31.336G | 6.0.2  | 3.3.0 | 436983f8-e365-4eb6-95ff-4a2c64dba3f7 |
| UN | UP (1ms) | UP (3ms) | 10.229.1.81  | 48h17m29s | 4    | 31.336G | 6.0.2  | 3.3.0 | 78228069-d4e4-4752-88cd-e357152d2371 |
| UN | UP (2ms) | UP (1ms) | 10.229.2.210 | 48h17m41s | 4    | 31.336G | 6.0.2  | 3.3.0 | 27e53089-193f-45b8-b20d-bbe789468a2b |
| UN | UP (2ms) | UP (0ms) | 10.229.3.39  | 48h18m1s  | 4    | 31.336G | 6.0.2  | 3.3.0 | a26b6771-6673-4686-a428-af4d5613f440 |
| UN | UP (4ms) | UP (1ms) | 10.229.3.79  | 48h17m56s | 4    | 31.336G | 6.0.2  | 3.3.0 | b1ca3b78-0007-48bc-b952-adff7d1920f8 |
+----+----------+----------+--------------+-----------+------+---------+--------+-------+--------------------------------------+

The backup of ~120GB of data has been taken to Google Cloud Storage bucket in the same region, with rate limit 0:

sctool info -c scylla/scylla-cluster backup/3647ccc3-d024-4158-a8b5-c3db974ec001
Name:   backup/3647ccc3-d024-4158-a8b5-c3db974ec001
Cron:   {"spec":"","start_date":"0001-01-01T00:00:00Z"} (no activations scheduled)
Tz:     UTC
Retry:  3 (initial backoff 10m)

Properties:
- location: 'gcs:scylla-backup-staging'
- rate-limit: '0'

+--------------------------------------+------------------------+----------+---------+
| ID                                   | Start time             | Duration | Status  |
+--------------------------------------+------------------------+----------+---------+
| 9e30d8d2-5bc1-11ef-8d27-a284ce8fc832 | 16 Aug 24 11:21:01 UTC | 17s      | RUNNING |
+--------------------------------------+------------------------+----------+---------+

and it took only less than 7 minutes to complete the backup:

sctool progress -c scylla/scylla-cluster backup/3647ccc3-d024-4158-a8b5-c3db974ec001
Run:            9e30d8d2-5bc1-11ef-8d27-a284ce8fc832
Status:         DONE
Start time:     16 Aug 24 11:21:01 UTC
End time:       16 Aug 24 11:27:57 UTC
Duration:       6m56s
Progress:       100%
Snapshot Tag:   sm_20240816112101UTC
Datacenters:
  - europe-west4

+--------------+----------+---------+---------+--------------+--------+
| Host         | Progress |    Size | Success | Deduplicated | Failed |
+--------------+----------+---------+---------+--------------+--------+
| 10.229.0.177 |     100% | 24.295G | 24.295G |            0 |      0 |
| 10.229.1.81  |     100% | 25.892G | 25.892G |            0 |      0 |
| 10.229.2.210 |     100% | 22.922G | 22.922G |            0 |      0 |
| 10.229.3.39  |     100% | 21.661G | 21.661G |            0 |      0 |
| 10.229.3.79  |     100% | 25.201G | 25.201G |            0 |      0 |
+--------------+----------+---------+---------+--------------+--------+

The process of restore of backup is very very slow:

sctool info restore/df887540-43c0-4112-8791-7dbd0ffbd4a4 -c scylla/scylla-cluster
Name:   restore/df887540-43c0-4112-8791-7dbd0ffbd4a4
Cron:   {"spec":"","start_date":"0001-01-01T00:00:00Z"} (no activations scheduled)
Tz:     UTC
Retry:  3 (initial backoff 10m)

Properties:
- batch-size: 1000
- location: 'gcs:scylla-backup-staging'
- parallel: 1000
- restore-tables: true
- snapshot-tag: sm_20240816112101UTC

+--------------------------------------+------------------------+----------+---------+
| ID                                   | Start time             | Duration | Status  |
+--------------------------------------+------------------------+----------+---------+
| a1ad1c7f-5bc4-11ef-8d2a-a284ce8fc832 | 16 Aug 24 11:42:35 UTC | 1h0m40s  | RUNNING |
+--------------------------------------+------------------------+----------+---------+

as we can see ☝️ the batch-size and parallel is set to 1000, and it looks like it takes around 1 minute to restore 1 GB of data:

sctool progress restore/df887540-43c0-4112-8791-7dbd0ffbd4a4 -c scylla/scylla-cluster
Restore progress
Run:            a1ad1c7f-5bc4-11ef-8d2a-a284ce8fc832
Status:         RUNNING (restoring backed-up data)
Start time:     16 Aug 24 11:42:35 UTC
Duration:       1h16m1s
Progress:       60% | 77%
Snapshot Tag:   sm_20240816112101UTC

+--------------------+-----------+----------+---------+------------+--------+
| Keyspace           |  Progress |     Size | Success | Downloaded | Failed |
+--------------------+-----------+----------+---------+------------+--------+
| dataforseo         | 60% | 77% | 119.969G | 73.108G |    92.401G |      0 |
| system_traces      |      100% |        0 |       0 |          0 |      0 |
| system_distributed |      100% |        0 |       0 |          0 |      0 |
+--------------------+-----------+----------+---------+------------+--------+

Simultaneously, the average load of cluster reported via Prometheus/Grafana dashboard is about ~25% only. Confirmed with simple kubectl check:

kubectl top pods -A | grep scylla
scylla                        scylla-cluster-europe-west4-europe-west4-a-0              426m         24630Mi         
scylla                        scylla-cluster-europe-west4-europe-west4-b-0              514m         24582Mi         
scylla                        scylla-cluster-europe-west4-europe-west4-b-1              310m         24631Mi         
scylla                        scylla-cluster-europe-west4-europe-west4-c-0              687m         24629Mi         
scylla                        scylla-cluster-europe-west4-europe-west4-c-1              2327m        24643Mi         
scylla-manager                scylla-manager-7fbf77594-4n4nq                            13m          64Mi            
scylla-manager                scylla-manager-cluster-manager-dc-manager-rack-0          12m          206Mi           
scylla-manager                scylla-manager-controller-67d55b97bb-rpv92                1m           13Mi            
scylla-manager                scylla-manager-controller-67d55b97bb-tr9lt                1m           15Mi            
scylla-operator               scylla-operator-5f55589444-2sv5b                          1m           12Mi            
scylla-operator               scylla-operator-5f55589444-5w2kg                          4m           37Mi            
scylla-operator               webhook-server-6dc45f4dfd-gls2z                           2m           15Mi            
scylla-operator               webhook-server-6dc45f4dfd-wcmqm                           1m           15Mi            
scylla-operator-node-tuning   cluster-node-setup-2ngpd                                  1m           17Mi            
scylla-operator-node-tuning   cluster-node-setup-5g6fd                                  1m           18Mi            
scylla-operator-node-tuning   cluster-node-setup-q56k2                                  1m           17Mi            
scylla-operator-node-tuning   cluster-node-setup-vhfxs                                  1m           18Mi            
scylla-operator-node-tuning   cluster-node-setup-z9l68                                  1m           17Mi

Changing the parallel and batch-size values does not result with improvement of restore speed, neither the average load of the cluster.

As we do aim to store a multiple TB of data on GKE running Scylla Operator with Local SSD, with current performance of backup restore, it would take days to handle those terabytes restoration 🙅

Any ideas how to improve the restore speed would be helpful, especially in the context that creation of a backup is very fast.

mykaul commented 3 months ago

Restoring is obviously different than backing up and cannot be compared to. We are working these days on improving the restore speed, with various changes to the restore strategy, parallelism and so forth. It'd be great if you can share logs, so we can see what we already suspect is indeed the issue: some shards get 'large' sstables to work on while other shards get smaller ones, so they finish theirs quickly and are actually idle (as batches are per node, not per shard). From that perspective, large batch size is not helpful at all. Would be great to see which table and how it's organized to see if we do have this imbalance.

Other areas we look into is to disable compaction during restore, to increase streaming performance (small improvement - see https://github.com/scylladb/scylladb/pull/20187 ), parallel download and restore (but I suspect it's also a small improvement).

lukaszsurfer commented 3 months ago

Sure @mykaul , will share the logs of subsequent restore execution. Quick question: what batch-size and parallel values would you recommend for running the restore on the setup described above ( 4 CPU / 31 GB RAM / 5 nodes)?

mykaul commented 3 months ago

Sure @mykaul , will share the logs of subsequent restore execution. Quick question: what batch-size and parallel values would you recommend for running the restore on the setup described above ( 4 CPU / 31 GB RAM / 5 nodes)?

@Michal-Leszczynski - I believe a batch-size of 100 or so is OK, and for parallel, what is the optimal here?

lukaszsurfer commented 3 months ago

Here are the logs of another execution of restore, 1h 11 minutes for 49 GB of data:

scylla@scylla-manager-7fbf77594-4xtmk:/$ sctool backup list -c scylla/scylla-cluster --all-clusters --location gcs:scylla-backup-staging
Cluster: 45d3e040-c349-4e86-8893-e2aba7a037c6

backup/9ce3c718-5596-4cda-b02b-415c6dfb728b
Snapshots:
  - sm_20240819100817UTC (49.271G, 5 nodes)
Keyspaces:
  - dataforseo (1 table)
  - system_schema (15 tables)
  - system_traces (5 tables)
  - system_distributed (4 tables)
  - system_distributed_everywhere (1 table)

scylla@scylla-manager-7fbf77594-4xtmk:/$ sctool info restore/1ad04728-8c98-4474-bda0-9ca6f3887d04 -c scylla/scylla-cluster
Name:   restore/1ad04728-8c98-4474-bda0-9ca6f3887d04
Cron:   {"spec":"","start_date":"0001-01-01T00:00:00Z"} (no activations scheduled)
Tz:     UTC
Retry:  3 (initial backoff 10m)

Properties:
- batch-size: 100
- location: 'gcs:scylla-backup-staging'
- parallel: 5
- restore-tables: true
- snapshot-tag: sm_20240819100817UTC

+--------------------------------------+------------------------+----------+--------+
| ID                                   | Start time             | Duration | Status |
+--------------------------------------+------------------------+----------+--------+
| 7c0b4faa-5e41-11ef-a03d-6a134b736877 | 19 Aug 24 15:41:21 UTC | 1h11m43s | DONE   |
+--------------------------------------+------------------------+----------+--------+

the node5 log has way more content than the other:

node1.log node5.log node4.log node3.log node2.log

karol-kokoszka commented 2 months ago

Most likely the restore here was not utilizing all the nodes equally during the process. Before https://github.com/scylladb/scylla-manager/issues/3981 , only SSTables from a single node/table were divided into the batches. It means that with the high value set to batch-size it's possible that there will be no batches available to all nodes at the moment and eventually some of the nodes are idle during the restore. There will be an improvement in 3.4 version of Scylla Manager, where the issue mentioned in this comment will be included into the release. The 3.4 release (marked as 3.3.3 in GH) is the current focus for manager.

scylladb / scylla-manager

Scylla operator / manager backup restore is very slow #3976