scylladb / scylladb

NoSQL data store using the seastar framework, compatible with Apache Cassandra
http://scylladb.com
GNU Affero General Public License v3.0
13.23k stars 1.25k forks source link

accept list of object keys in "restore" API #20461

Open tchaikov opened 1 week ago

tchaikov commented 1 week ago

Before scylla is able to take care of the dedup by itself (see https://github.com/scylladb/scylladb/issues/20459), scylla manager (SM for short) composes a list of objects from different prefixes representing different backups (snapshorts) when performing restore. So SM needs to selectively pass a set of keys of objects to scylla, instead of restoring from all objects with a given prefix.

Restore

SM will send a list of the names of objects / files to be downloaded from remote in the HTTP request.

{
          "path":"/storage_service/restore",
          "operations":[
              {
                  "method":"POST",
                  "summary":"Starts copying SSTables from a designated bucket in object storage to a specified keyspace",
                  "type":"string",
                  "nickname":"start_restore",
                  "produces":[
                      "application/json"
                  ],
                  "parameters":[
                      {
                          "name":"endpoint",
                          "description":"ID of the configured object storage endpoint to copy SSTables from",
                          "required":true,
                          "allowMultiple":false,
                          "type":"string",
                          "paramType":"query"
                      },
                      {
                          "name":"bucket",
                          "description":"Name of the bucket to read SSTables from",
                          "required":true,
                          "allowMultiple":false,
                          "type":"string",
                          "paramType":"query"
                      },
                      {
                          "name":"prefix",
                          "description":"The prefix of the object keys for the backuped sstables",
                          "required":true,
                          "allowMultiple":false,
                          "type":"string",
                          "paramType":"query"
                      },
                      {
                          "in": "body",
                          "name": "sstables",
                          "description": "the list of the object keys of the TOC component of the SSTables to be restored",
                          "required":true,
                          "schema" :{
                              "type": "array",
                              "items": {
                                  "type": "string"
                              }
                          }
                      },
                      {
                          "name":"keyspace",
                          "description":"Name of a keyspace to copy SSTables to",
                          "required":true,
                          "allowMultiple":false,
                          "type":"string",
                          "paramType":"query"
                      },
                      {
                          "name":"table",
                          "description":"Name of a table to copy SSTables to",
                          "required":false,
                          "allowMultiple":false,
                          "type":"string",
                          "paramType":"query"
                      },
                  ]
              }
          ]
      }

the restore API

  1. download the sstables with given keys storing their TOC
  2. load and stream these downloaded sstables

this API performs the task in background: when it starts, it returns a string which is a task id right away. with this task id, one can interact with task represented by it using the task-manager API and nodetool commands:

GET /task_manager/task_status/{task_id}
GET /task_manager/wait_task/{task_id}
POST /task_manager/abort_task/{task_id}

see also #20335 for the previous change which adds the "prefix" parameter to "backup" and "restore" APIs.

the key to an object holding an sstable component can be composed with two parts:

the "prefix" query parameter will be used to pass the shared prefix, as it could be lengthy. sometimes, when we dump the payload for debugging, a shorter payload could help. and the body of HTTP request will contain a JSON-encoded array. (as an alternative, we could put a JSON-encoded object with a single array property for better extensibility in future).

notes on implementation

the existing seastar-json2code.py does not generate any code helping us to unmarshal the JSON in payload, but with the helpers in utils/rjson.hh, we should be able to parse a JSON object from a given string. please note, we don't have a wrapper for parsing a stream, but RapidJson does support parsing a stream.

we should filter out the manifest and schema file when listing a snapshot directory created by SM, if we are interested in the TOC.

tchaikov commented 1 week ago

@xemul @kreuzerkrieg @Michal-Leszczynski does this make sense to you?

kreuzerkrieg commented 1 week ago

@xemul @kreuzerkrieg @Michal-Leszczynski does this make sense to you?

setting list of files on the query string sounds awkward. why we cant do it exactly the same way as in backup?

bhalevy commented 1 week ago

I think we can also keep the manifest in a system table, plus store manifest files on backup, so they can be used for restore, if the cluster has been lost.

Michal-Leszczynski commented 1 week ago

setting list of files on the query string sounds awkward. why we cant do it exactly the same way as in backup?

In backup we always want to back up all of the files from snapshot dir (SM takes care of the deduplication by removing deduplicated files from snapshot dir between snapshot and backup API calls).

SM stores files from many different snapshot in the same dir (for deduplication purposes), so in restore we don't want to restore all of the files from the remote dir, only a subset of them defined in the manifest created by SM during the backup.

Michal-Leszczynski commented 1 week ago

SM will send a list of the names of objects / files to be downloaded from remote in the HTTP request.

This also allows for batching on SM side. If SM was to send manifest path, Scylla would probably restore all of the table's data described in this manifest. This does not provide great granularity, but also could result in an uneven workload distribution for scenarios when backed up cluster consisted of different amount of nodes than the restore destination cluster (SM creates a single manifest for a single backed up node).

tchaikov commented 1 week ago

@xemul @kreuzerkrieg @Michal-Leszczynski does this make sense to you?

setting list of files on the query string sounds awkward. why we cant do it exactly the same way as in backup?

@kreuzerkrieg Hi Ernest, IIUC, you are not against the idea of sending the list of files in the request, but just don't want to pass it in query string, right?

tchaikov commented 1 week ago

I think we can also keep the manifest in a system table, plus store manifest files on backup, so they can be used for restore, if the cluster has been lost.

may i learn what's the advantage of doing so? also, if the manifest is persisted both in a system table and in object storage, which copy are we going to use for restore? two options i can think of:

xemul commented 1 week ago

In backup we always want to back up all of the files from snapshot dir (SM takes care of the deduplication by removing deduplicated files from snapshot dir between snapshot and backup API calls).

This obsoletes the "manifest" requirement for backup call, @tchaikov

xemul commented 1 week ago

... but it’s still desirable to pass the names in the payload for better performance. as presumably, the stack is not optimised for large query string.

ACK -- list of files to restore should come in request body, http request query_paremters in seastar is map<sstring, sstring>, so the long list of files will be represented as single long sstring, which is not great

tchaikov commented 1 week ago

In backup we always want to back up all of the files from snapshot dir (SM takes care of the deduplication by removing deduplicated files from snapshot dir between snapshot and backup API calls).

This obsoletes the "manifest" requirement for backup call, @tchaikov

ack. so the workflow is like:

sequenceDiagram
Title: backup
SM->>+agent: take a snapshot (snap.foo) for ks.cf
agent->>+scylla: take a snapshot (snap.foo) for ks.cf
scylla-->>-agent: 
agent-->>-SM: 
SM->>+agent: index snapshot directories in the cloud
agent -->> S3: read all snapshot directories
agent-->>-SM: 
agent-->agent: remove the duplicated files from the snapshot dir
SM->>+agent: create and upload the manifest to the cloud
agent-->> S3: upload the manifest
agent -->> SM: 
SM ->>+ agent: get and upload table schema to the cloud
agent-->> S3: upload the schema
agent -->>- SM: 
SM ->> SM: dedup for the sstables by filtering out the already-uploaded ones
SM ->>+ agent: backup the snap.foo for ks.cf
agent ->>+ scylla: backup the snap.foo for ks.cf
scylla --> S3: upload the sstables in the snapshot dir
scylla -->>- agent: 
agent -->>- SM: 
tchaikov commented 1 week ago

@Michal-Leszczynski please correct me if i am wrong. so we do NOT need the manifest parameter in the backup API, right?

regevran commented 1 week ago

It seems that the backup process is too complicated (and confusing), as it repeats the same data more than once. For me, it is either that SM asks Scylla to backup everything in a given snapshot, or that it transfer a list of files to backup, but not both. The confusion is demonstrated in the last two steps in the drawing above: backup snap.foo for ks.cf is followed by upload ... in the snapshot dir.

As it seems that the the deduplication phase should be done in SM, it is best to pass from SM to Scylla the list of tables to backup. This requires Scylla to verify that there is a snapshot for the requested files before uploading.

kreuzerkrieg commented 1 week ago

@xemul @kreuzerkrieg @Michal-Leszczynski does this make sense to you?

setting list of files on the query string sounds awkward. why we cant do it exactly the same way as in backup?

@kreuzerkrieg Hi Ernest, IIUC, you are not against the idea of sending the list of files in the request, but just don't want to pass it in query string, right?

right. why not to add it to the POST body?

kreuzerkrieg commented 1 week ago

Ok, I guess I'm missing some understanding here how exactly its going to work. The backup will work the same way it is implemented here (as I understand), meaning, just take everything from the snapshot directory and upload it to the given prefix. So far, so good. Then we try to restore and here I'm lost. The "the list of the object keys from which SSTables are restored" part, how it is supposed to look? Is it the full "path" to the SSTable on S3? just a list of SSTables to copy? Then how do I figure out the full path to it?

tchaikov commented 1 week ago

Ok, I guess I'm missing some understanding here how exactly its going to work. The backup will work the same way it is implemented here (as I understand), meaning, just take everything from the snapshot directory and upload it to the given prefix. So far, so good.

yes. this is also my understanding.

Then we try to restore and here I'm lost. The "the list of the object keys from which SSTables are restored" part, how it is supposed to look? Is it the full "path" to the SSTable on S3? just a list of SSTables to copy? Then how do I figure out the full path to it?

it could be the full "path" or a partial "path". as the sstables could come from different snapshots, they do not necessarily share the same full "prefix", so SM would have to provide

so that scylla can compose the full key name of sstable components: backup/sst/cluster/$cluster_id/dc/$data_center_name/node/$node_id/keyspace/$keyspace/table/$table_name/$table_schema_version/$sstable_component

i am not sure which way we prefer. @Michal-Leszczynski @xemul @kreuzerkrieg what do you think? after an offline discussion, we will pass the shared prefix with the query parameter, and the unique part with the payload.

tchaikov commented 6 days ago

v2:

Michal-Leszczynski commented 6 days ago

I think that deduplicating files by removing them from snapshot dir by SM before calling Scylla API to upload the rest of the files to backup location is better because:

Michal-Leszczynski commented 6 days ago

drop the proposal on changing the "backup" API: https://github.com/scylladb/scylladb/pull/20413 is good enough, as we don't need to pass "manifest" to scylla anymore.

That's correct, although take a look at https://github.com/scylladb/scylladb/pull/20413#issuecomment-2339314489.

the unique part: $table_schema_version/$sstable_component and add "prefix" parameter to the "restore" API.

And we are still going to provide a list of unique parts alongside the prefix, right?

tchaikov commented 6 days ago

drop the proposal on changing the "backup" API: #20413 is good enough, as we don't need to pass "manifest" to scylla anymore.

That's correct, although take a look at #20413 (comment).

indeed. we still don't support the "table" parameter at this moment. @xemul @kreuzerkrieg we do plan to support it, right? if yes, i will create another issue to track this.

the unique part: $table_schema_version/$sstable_component and add "prefix" parameter to the "restore" API.

And we are still going to provide a list of unique parts alongside the prefix, right?

yes, see

{
                          "in": "body",
                          "name": "sstables",
                          "description": "the list of the object keys of the TOC component of the SSTables to be restored",
                          "required":true,
                          "schema" :{
                              "type": "array",
                              "items": {
                                  "type": "string"
                              }
                          }
                      },

it's encoded in the HTTP payload.

regevran commented 5 days ago

A clear back/restore API document is supposed to be released soon (prepared by both @Michal-Leszczynski and @regevran). When it is ready we'll be able to discuss it in the document itself and when the dust settles, we'll implement accordingly.

Michal-Leszczynski commented 3 days ago
"path":"/storage_service/restore",
 "operations":[
{
"method":"POST",
 "summary":"Starts copying SSTables from a designated bucket in object storage to a specified keyspace",

@tchaikov a claryfing question - does this API just downloads given batch of sstables, or does it also restores it? Should SM call load&stream afterwards on its own, or is it handled by the Scylla in the scope of this API call? I'm asking because I'm not sure about the "copy sstables to keyspace" part, as it sounds like it is just about downloading the files to the upload (or maybe even data) directory.

"in": "body",
"name": "sstables",
"description": "the list of the object keys of the TOC component of the SSTables to be restored",

Does it mean that SM would be specifying only the files with "TOC.txt" suffix, not all of the component files? That's good in terms of safety (not possible to restore only half of the component files), but wouldn't it mean that Scylla needs to read all of those remote files? Is there a chance that it impacts performance?

"name":"table",
"description":"Name of a table to copy SSTables to",
"required":false,

How can it be not required? What would happen, if it is not specified? I guess that Scylla needs to know to which keyspace.table does the downloaded sstables belong to.

Also, what does this Scylla API call return? Some Scylla task ID that can be used by SM to check progress or cancel this API call?

@tchaikov but in general, proposed API would allow SM to use it as a drop-in replacement for:

xemul commented 3 days ago

@tchaikov a claryfing question - does this API just downloads given batch of sstables, or does it also restores it? Should SM call load&stream afterwards on its own, or is it handled by the Scylla in the scope of this API call?

Load-and-stream will be done by Scylla

Does it mean that SM would be specifying only the files with "TOC.txt" suffix, not all of the component files?

Only TOCs would work.

That's good in terms of safety (not possible to restore only half of the component files), but wouldn't it mean that Scylla > needs to read all of those remote files? Is there a chance that it impacts performance?

It will need to read TOCs anyway. Load-and-stream assumes that sstable is loaded into Scylla, and loading means reading TOC file

Also, what does this Scylla API call return? Some Scylla task ID that can be used by SM to check progress or cancel this API call?

It returns a task ID that can be later accessed via task-manager API:

GET /task_manager/task_status/{task_id}
GET /task_manager/wait_task/{task_id}
POST /task_manager/abort_task/{task_id}
tchaikov commented 2 days ago

@Michal-Leszczynski as Pavel has addressed your queries. i updated description of this issue to clarify the confusions. with TOC, i think

the list of the object keys of the TOC component of the SSTables to be restored

should already explains that only TOC components are accepted.