scylladb / scylla-manager

The Scylla Manager
https://manager.docs.scylladb.com/stable/
Other
52 stars 34 forks source link

prepare for the testing with scylla's backup/restore API #3973

Open tchaikov opened 3 months ago

tchaikov commented 3 months ago

We are enhancing ScyllaDB with a native RESTful API to efficiently backup and restore SSTables to and from object storage services. The existing backup process is documented at https://github.com/scylladb/scylla-manager/blob/master/docs/source/backup/index.rst#process , which is quite similar to the process using the native backup API. The only difference is that the existing implementation uses an agent running on scylla instance. And this agent utilizes rclone for upload/download. And it supports multiple backend. While scylla's backup/restore APIs only support S3 at the time of writing. Our initial focus is on supporting Amazon S3. Unlike third-party solutions such as rclone, this native implementation offers several advantages:

Now that we've implemented these two APIs

Integration of the new API:

Enhancement of testing coverage:

It's important to note that these efforts extend beyond Scylla Manager. We should also:

This holistic approach will ensure that both scylla-manager and scylladb are fully aligned with the new native backup and restore functionality, providing a robust and efficient solution for users.

As the initial phase, I will

  1. conduct an inventory of existing tests related to backup and restore functionality.
  2. identify the ones that require reworking
  3. draft a preliminary plan
tchaikov commented 3 months ago

@Michal-Leszczynski hi Michal, what do you think? i am working with Pavel on this project. and wanted to help on the testing front. would be great if we could work together to move this forward.

cc @bhalevy @xemul @regevran

regevran commented 3 months ago

@tchaikov - please contact @pehala for support on collecting the data on existing tests.

tchaikov commented 2 months ago

in the backup/restore process of scylla-manager, each <keyspace, table, snapshot_tag> tuple is mapped to a path looks like $keyspace/$table/snapshots/$snapshot_tag, which is located under the specified bucket. under this "directory", it preserves:

it queries an external CQL database to track the backup progress.

some interesting findings:

but since scylla-manager will be using scylladb's backup/restore APIs, these differences won't be visible from scylla-manager, which should be using them instead of RcloneMoveDir() in pkg/service/backup/worker_upload.go.

unfortunately, we don't have a unit test for Worker.Upload() yet. there are some noticeable differences:

  1. versioned file: scylla-manager uses the idea of "versioned file", which encodes the version number in the object name as its suffix. the purpose is to avoid conflict of different sstable with the same name from the same node. and the snapshot tag is used as the suffix. but we don't use this technique anymore when working with the newer version sstable.

  2. currently we use an agent running on scylladb node to serve as a web server, which handles the requests from scylla-manager. this agent provides the service like copy / move to / from object storage. i think, we could

    • add new method to scyllaclient/client_scylla.go, so it supports the backup and restore APIs
    • optionally use them in uploadSnapshotDir() instead in service/backup/worker_upload.go
  3. backup testing: we don't have tests for the sync/copydir API yet, probably we could add them for rclone/rcserver ? when it comes to scylla's backup/restore integration, i think we should perform end-to-end test, as we should expose the directory arrangement of backup to its callers. unless we believe it is part of its public interface.

  4. restore implementation: we upload the sstables to scylla nodes in pkg/service/restore/tablesdir_worker.go, which

    1. downloads sstables in batch to scylla node using RcloneCopyFile() RPC call. this is completed in the StageData.
    2. calls the RESTful API of /storage_service/sstables/{keyspace} of scylla
  5. restore schema testing: we have two test suites: pkg/service/restore/restore_integration_test.go and pkg/service/restore/service_restore_integration_test.go. both of them use the same methodology as below. and this end-to-end test still applies to the new scylla AP:

    1. calls the "Backup" service
    2. list the newly created backup
    3. calls the "Restore" service
    4. validates the schema and data
regevran commented 2 months ago

What is the expected behavior from the user point of view - what is the level of specification one should supply in order to backup? i.e. I guess $provider, $location_path and a must. what about $keyspace? $snapshot_version?

regevran commented 2 months ago

I am not sure I follow the existing API vs. the new API and how one should map between the two.

regevran commented 2 months ago

I guess that overall we'll need two sets of tests that run in parallel for the transition period - until all customers upgrade to the version that supports S3 backup from within Scylla. This is because we'll probably change/fix tests for the existing flow as well as for the new flow.

pehala commented 2 months ago

@regevran @tchaikov Correct me if I am wrong, but I believe the flow for the customers should stay the same.

tchaikov commented 2 months ago

@pehala yeah, you are correct.

Michal-Leszczynski commented 2 months ago

@tchaikov sorry for such a later response, I've been busy with patching current SM restore implementation, as it was the hottest priority from SM POV. I will take a look at those two PRs tomorrow:

In case you have any question about how SM restore/backup operate (or why does it work like that), please ask me. We can even schedule a call if needed.

karol-kokoszka commented 2 months ago

And this agent utilizes rclone for upload/download. And it supports multiple backend. While scylla's backup/restore APIs only support S3 at the time of writing. Our initial focus is on supporting Amazon S3. Unlike third-party solutions such as rclone, this native implementation offers several advantages:

I think it's a really good choice to rely on AWS SDK instead of delegating it to 3rd part tools/libs like RClone. You will definitely have much better control over the whole copy/move/delete process. We were thinking about removing RClone and replacing it with pure SDK usage.

There are many tools that are compatible with the S3 API, like Minio (https://min.io/) so it may not be AWS only. There are some customers (or prospects) that use Minio already.

BTW, it's worth to include it into integration tests. We do that in Scylla Manager already. https://github.com/scylladb/scylla-manager/tree/master/testing

We are enhancing ScyllaDB with a native RESTful API to efficiently backup and restore SSTables to and from object storage services

@tchaikov Do you have some swagger designing the API for backup and restore already that we could take a look on ?

tchaikov commented 2 months ago

And this agent utilizes rclone for upload/download. And it supports multiple backend. While scylla's backup/restore APIs only support S3 at the time of writing. Our initial focus is on supporting Amazon S3. Unlike third-party solutions such as rclone, this native implementation offers several advantages:

I think it's a really good choice to rely on AWS SDK instead of delegating it to 3rd part tools/libs like RClone. You will definitely have much better control over the whole copy/move/delete process. We were thinking about removing RClone and replacing it with pure SDK usage.

yeah, i agree. probably you are talking about scylla's S3 implementation? if that's the case, the reason is that we need to have an implementation which uses the seastar framework, so we have to reinvent the wheel.

There are many tools that are compatible with the S3 API, like Minio (https://min.io/) so it may not be AWS only. There are some customers (or prospects) that use Minio already.

yeah, i knew. by AWS S3, i meant S3 API. not limited to AWS.

BTW, it's worth to include it into integration tests. We do that in Scylla Manager already. https://github.com/scylladb/scylla-manager/tree/master/testing

what do you mean by "it"?

We are enhancing ScyllaDB with a native RESTful API to efficiently backup and restore SSTables to and from object storage services

@tchaikov Do you have some swagger designing the API for backup and restore already that we could take a look on ?

sure.

karol-kokoszka commented 2 months ago

BTW, it's worth to include it into integration tests. We do that in Scylla Manager already. https://github.com/scylladb/scylla-manager/tree/master/testing

what do you mean by "it"?

I mean Minio.

tchaikov commented 2 months ago

@tchaikov sorry for such a later response, I've been busy with patching current SM restore implementation, as it was the hottest priority from SM POV. I will take a look at those two PRs tomorrow:

In case you have any question about how SM restore/backup operate (or why does it work like that), please ask me. We can even schedule a call if needed.

hi @Michal-Leszczynski thanks for your reply. i just sent a meeting invite to you. hopefully the time works for you so you can hop in and we can sync up with each other.

tchaikov commented 2 months ago

BTW, it's worth to include it into integration tests. We do that in Scylla Manager already. https://github.com/scylladb/scylla-manager/tree/master/testing

what do you mean by "it"?

I mean Minio.

yeah, we are already using minio for testing. see https://github.com/scylladb/scylladb/blob/e4b213f041b131f38a9d782c67152e1203bd3a7e/test/pylib/minio_server.py#L28

regevran commented 2 months ago

hi @Michal-Leszczynski thanks for your reply. i just sent a meeting invite to you. hopefully the time works for you so you can hop in and we can sync up with each other.

May I join too?

tchaikov commented 2 months ago

hi @Michal-Leszczynski thanks for your reply. i just sent a meeting invite to you. hopefully the time works for you so you can hop in and we can sync up with each other.

May I join too?

it's the weekly backup and restore meeting. so you are already invited.

regevran commented 1 month ago

@tchaikov - I suggest we close this issue as we learnt a lot since it was opened. Some of the information here is not updated and even confusing.