prepare for the testing with scylla's backup/restore API

tchaikov commented 3 months ago

We are enhancing ScyllaDB with a native RESTful API to efficiently backup and restore SSTables to and from object storage services. The existing backup process is documented at https://github.com/scylladb/scylla-manager/blob/master/docs/source/backup/index.rst#process , which is quite similar to the process using the native backup API. The only difference is that the existing implementation uses an agent running on scylla instance. And this agent utilizes rclone for upload/download. And it supports multiple backend. While scylla's backup/restore APIs only support S3 at the time of writing. Our initial focus is on supporting Amazon S3. Unlike third-party solutions such as rclone, this native implementation offers several advantages:

Improved control: We can fine-tune resource consumption during backup and restore operations.
Enhanced efficiency: ScyllaDB's inherent understanding of data distribution allows for more optimized backup and restore processes.
Seamless integration: The native API will provide a more cohesive experience within the ScyllaDB ecosystem.
Scalability: The solution is designed to handle large-scale deployments more effectively.

Now that we've implemented these two APIs

[x] backup API: https://github.com/scylladb/scylladb/pull/19890
[x] restore API: https://github.com/scylladb/scylladb/pull/20305 Now is an opportune time to conduct a comprehensive review of the existing backup and restore implementation, along with its associated testing in Scylla Manager. This review should focus on two key areas:

Integration of the new API:

Analyze how we can seamlessly incorporate the new RESTful API into the existing workflow.
Identify potential bottlenecks or conflicts that may arise during integration.

Enhancement of testing coverage:

Evaluate the current test suite and identify gaps in coverage, particularly for the updated workflows.
Design new test cases to thoroughly validate the functionality of the native S3 backup and restore features.

It's important to note that these efforts extend beyond Scylla Manager. We should also:

Review and update relevant components within scylladb itself.
Ensure consistency between scylla-manager and scylladb in terms of API usage and behavior.
Update documentation across both projects to reflect the new capabilities and any changes in operational procedures.

This holistic approach will ensure that both scylla-manager and scylladb are fully aligned with the new native backup and restore functionality, providing a robust and efficient solution for users.

As the initial phase, I will

conduct an inventory of existing tests related to backup and restore functionality.
identify the ones that require reworking
draft a preliminary plan

tchaikov commented 3 months ago

@Michal-Leszczynski hi Michal, what do you think? i am working with Pavel on this project. and wanted to help on the testing front. would be great if we could work together to move this forward.

cc @bhalevy @xemul @regevran

regevran commented 3 months ago

@tchaikov - please contact @pehala for support on collecting the data on existing tests.

tchaikov commented 2 months ago

in the backup/restore process of scylla-manager, each <keyspace, table, snapshot_tag> tuple is mapped to a path looks like $keyspace/$table/snapshots/$snapshot_tag, which is located under the specified bucket. under this "directory", it preserves:

manifest: manifest.json
schema: schema.cql
data files

it queries an external CQL database to track the backup progress.

some interesting findings:

scylladb backup uses the path hierarchy like : $bucket/$table_name/$snapshot_name/$sstable_component
scylladb-manager puts the sstables under the path of: $dc:$provider:$location_path/backup/sst/cluster/$cluster_id/dc/$dc_id/node/$node_id/keyspace/$keyspace/table/$table_name/$snapshot_version in which, the value of $provider could be one of "s3", "gcs" and "azure". when it comes to "s3", the $location_path should be the bucket name. Location describes a certain object storage endpoint. it should be mapped to the "endpoint" definition of scylladb, when setting up a scylladb instance which supports backup/restore to/from object storage.

but since scylla-manager will be using scylladb's backup/restore APIs, these differences won't be visible from scylla-manager, which should be using them instead of RcloneMoveDir() in pkg/service/backup/worker_upload.go.

unfortunately, we don't have a unit test for Worker.Upload() yet. there are some noticeable differences:

versioned file: scylla-manager uses the idea of "versioned file", which encodes the version number in the object name as its suffix. the purpose is to avoid conflict of different sstable with the same name from the same node. and the snapshot tag is used as the suffix. but we don't use this technique anymore when working with the newer version sstable.
currently we use an agent running on scylladb node to serve as a web server, which handles the requests from scylla-manager. this agent provides the service like copy / move to / from object storage. i think, we could
- add new method to scyllaclient/client_scylla.go, so it supports the backup and restore APIs
- optionally use them in uploadSnapshotDir() instead in service/backup/worker_upload.go
backup testing: we don't have tests for the sync/copydir API yet, probably we could add them for rclone/rcserver ? when it comes to scylla's backup/restore integration, i think we should perform end-to-end test, as we should expose the directory arrangement of backup to its callers. unless we believe it is part of its public interface.
restore implementation: we upload the sstables to scylla nodes in pkg/service/restore/tablesdir_worker.go, which
1. downloads sstables in batch to scylla node using RcloneCopyFile() RPC call. this is completed in the StageData.
2. calls the RESTful API of /storage_service/sstables/{keyspace} of scylla
restore schema testing: we have two test suites: pkg/service/restore/restore_integration_test.go and pkg/service/restore/service_restore_integration_test.go. both of them use the same methodology as below. and this end-to-end test still applies to the new scylla AP:
1. calls the "Backup" service
2. list the newly created backup
3. calls the "Restore" service
4. validates the schema and data

regevran commented 2 months ago

What is the expected behavior from the user point of view - what is the level of specification one should supply in order to backup? i.e. I guess $provider, $location_path and a must. what about $keyspace? $snapshot_version?

regevran commented 2 months ago

I am not sure I follow the existing API vs. the new API and how one should map between the two.

regevran commented 2 months ago

I guess that overall we'll need two sets of tests that run in parallel for the transition period - until all customers upgrade to the version that supports S3 backup from within Scylla. This is because we'll probably change/fix tests for the existing flow as well as for the new flow.

pehala commented 2 months ago

@regevran @tchaikov Correct me if I am wrong, but I believe the flow for the customers should stay the same.

tchaikov commented 2 months ago

@pehala yeah, you are correct.

Michal-Leszczynski commented 2 months ago

@tchaikov sorry for such a later response, I've been busy with patching current SM restore implementation, as it was the hottest priority from SM POV. I will take a look at those two PRs tomorrow:

In case you have any question about how SM restore/backup operate (or why does it work like that), please ask me. We can even schedule a call if needed.

karol-kokoszka commented 2 months ago

And this agent utilizes rclone for upload/download. And it supports multiple backend. While scylla's backup/restore APIs only support S3 at the time of writing. Our initial focus is on supporting Amazon S3. Unlike third-party solutions such as rclone, this native implementation offers several advantages:

I think it's a really good choice to rely on AWS SDK instead of delegating it to 3rd part tools/libs like RClone. You will definitely have much better control over the whole copy/move/delete process. We were thinking about removing RClone and replacing it with pure SDK usage.

There are many tools that are compatible with the S3 API, like Minio (https://min.io/) so it may not be AWS only. There are some customers (or prospects) that use Minio already.

BTW, it's worth to include it into integration tests. We do that in Scylla Manager already. https://github.com/scylladb/scylla-manager/tree/master/testing

We are enhancing ScyllaDB with a native RESTful API to efficiently backup and restore SSTables to and from object storage services

@tchaikov Do you have some swagger designing the API for backup and restore already that we could take a look on ?

tchaikov commented 2 months ago

And this agent utilizes rclone for upload/download. And it supports multiple backend. While scylla's backup/restore APIs only support S3 at the time of writing. Our initial focus is on supporting Amazon S3. Unlike third-party solutions such as rclone, this native implementation offers several advantages:

I think it's a really good choice to rely on AWS SDK instead of delegating it to 3rd part tools/libs like RClone. You will definitely have much better control over the whole copy/move/delete process. We were thinking about removing RClone and replacing it with pure SDK usage.

yeah, i agree. probably you are talking about scylla's S3 implementation? if that's the case, the reason is that we need to have an implementation which uses the seastar framework, so we have to reinvent the wheel.

There are many tools that are compatible with the S3 API, like Minio (https://min.io/) so it may not be AWS only. There are some customers (or prospects) that use Minio already.

yeah, i knew. by AWS S3, i meant S3 API. not limited to AWS.

BTW, it's worth to include it into integration tests. We do that in Scylla Manager already. https://github.com/scylladb/scylla-manager/tree/master/testing

what do you mean by "it"?

We are enhancing ScyllaDB with a native RESTful API to efficiently backup and restore SSTables to and from object storage services

@tchaikov Do you have some swagger designing the API for backup and restore already that we could take a look on ?

sure.

for backup, see https://github.com/scylladb/scylladb/blob/e4b213f041b131f38a9d782c67152e1203bd3a7e/api/api-doc/storage_service.json#L750
for restore, see https://github.com/scylladb/scylladb/pull/20305/files#diff-7771677331ece83ba219b8ed9f7625afdae1f9fc4738d44c532c623c13f5eb13R798 (still pending on review)

karol-kokoszka commented 2 months ago

BTW, it's worth to include it into integration tests. We do that in Scylla Manager already. https://github.com/scylladb/scylla-manager/tree/master/testing

what do you mean by "it"?

I mean Minio.

tchaikov commented 2 months ago

@tchaikov sorry for such a later response, I've been busy with patching current SM restore implementation, as it was the hottest priority from SM POV. I will take a look at those two PRs tomorrow:

Integrated backup scylladb#19890

Integrated restore scylladb#20305

In case you have any question about how SM restore/backup operate (or why does it work like that), please ask me. We can even schedule a call if needed.

hi @Michal-Leszczynski thanks for your reply. i just sent a meeting invite to you. hopefully the time works for you so you can hop in and we can sync up with each other.

tchaikov commented 2 months ago

BTW, it's worth to include it into integration tests. We do that in Scylla Manager already. https://github.com/scylladb/scylla-manager/tree/master/testing

what do you mean by "it"?

I mean Minio.

yeah, we are already using minio for testing. see https://github.com/scylladb/scylladb/blob/e4b213f041b131f38a9d782c67152e1203bd3a7e/test/pylib/minio_server.py#L28

regevran commented 2 months ago

hi @Michal-Leszczynski thanks for your reply. i just sent a meeting invite to you. hopefully the time works for you so you can hop in and we can sync up with each other.

May I join too?

tchaikov commented 2 months ago

hi @Michal-Leszczynski thanks for your reply. i just sent a meeting invite to you. hopefully the time works for you so you can hop in and we can sync up with each other.

May I join too?

it's the weekly backup and restore meeting. so you are already invited.

regevran commented 1 month ago

@tchaikov - I suggest we close this issue as we learnt a lot since it was opened. Some of the information here is not updated and even confusing.

scylladb / scylla-manager

prepare for the testing with scylla's backup/restore API #3973

Integration of the new API:

Enhancement of testing coverage: