Open tchaikov opened 1 week ago
@xemul @kreuzerkrieg @Michal-Leszczynski does this make sense to you?
@xemul @kreuzerkrieg @Michal-Leszczynski does this make sense to you?
setting list of files on the query string sounds awkward. why we cant do it exactly the same way as in backup?
I think we can also keep the manifest in a system table, plus store manifest files on backup, so they can be used for restore, if the cluster has been lost.
setting list of files on the query string sounds awkward. why we cant do it exactly the same way as in backup?
In backup we always want to back up all of the files from snapshot dir (SM takes care of the deduplication by removing deduplicated files from snapshot dir between snapshot and backup API calls).
SM stores files from many different snapshot in the same dir (for deduplication purposes), so in restore we don't want to restore all of the files from the remote dir, only a subset of them defined in the manifest created by SM during the backup.
SM will send a list of the names of objects / files to be downloaded from remote in the HTTP request.
This also allows for batching on SM side. If SM was to send manifest path, Scylla would probably restore all of the table's data described in this manifest. This does not provide great granularity, but also could result in an uneven workload distribution for scenarios when backed up cluster consisted of different amount of nodes than the restore destination cluster (SM creates a single manifest for a single backed up node).
@xemul @kreuzerkrieg @Michal-Leszczynski does this make sense to you?
setting list of files on the query string sounds awkward. why we cant do it exactly the same way as in backup?
@kreuzerkrieg Hi Ernest, IIUC, you are not against the idea of sending the list of files in the request, but just don't want to pass it in query string, right?
I think we can also keep the manifest in a system table, plus store manifest files on backup, so they can be used for restore, if the cluster has been lost.
may i learn what's the advantage of doing so? also, if the manifest is persisted both in a system table and in object storage, which copy are we going to use for restore? two options i can think of:
In backup we always want to back up all of the files from snapshot dir (SM takes care of the deduplication by removing deduplicated files from snapshot dir between snapshot and backup API calls).
This obsoletes the "manifest" requirement for backup call, @tchaikov
... but it’s still desirable to pass the names in the payload for better performance. as presumably, the stack is not optimised for large query string.
ACK -- list of files to restore should come in request body, http request query_paremters in seastar is map<sstring, sstring>, so the long list of files will be represented as single long sstring, which is not great
In backup we always want to back up all of the files from snapshot dir (SM takes care of the deduplication by removing deduplicated files from snapshot dir between snapshot and backup API calls).
This obsoletes the "manifest" requirement for backup call, @tchaikov
ack. so the workflow is like:
sequenceDiagram
Title: backup
SM->>+agent: take a snapshot (snap.foo) for ks.cf
agent->>+scylla: take a snapshot (snap.foo) for ks.cf
scylla-->>-agent:
agent-->>-SM:
SM->>+agent: index snapshot directories in the cloud
agent -->> S3: read all snapshot directories
agent-->>-SM:
agent-->agent: remove the duplicated files from the snapshot dir
SM->>+agent: create and upload the manifest to the cloud
agent-->> S3: upload the manifest
agent -->> SM:
SM ->>+ agent: get and upload table schema to the cloud
agent-->> S3: upload the schema
agent -->>- SM:
SM ->> SM: dedup for the sstables by filtering out the already-uploaded ones
SM ->>+ agent: backup the snap.foo for ks.cf
agent ->>+ scylla: backup the snap.foo for ks.cf
scylla --> S3: upload the sstables in the snapshot dir
scylla -->>- agent:
agent -->>- SM:
@Michal-Leszczynski please correct me if i am wrong. so we do NOT need the manifest parameter in the backup API, right?
It seems that the backup process is too complicated (and confusing), as it repeats the same data more than once.
For me, it is either that SM asks Scylla to backup everything in a given snapshot, or that it transfer a list of files to backup, but not both. The confusion is demonstrated in the last two steps in the drawing above: backup snap.foo for ks.cf
is followed by upload ... in the snapshot dir
.
As it seems that the the deduplication phase should be done in SM, it is best to pass from SM to Scylla the list of tables to backup. This requires Scylla to verify that there is a snapshot for the requested files before uploading.
@xemul @kreuzerkrieg @Michal-Leszczynski does this make sense to you?
setting list of files on the query string sounds awkward. why we cant do it exactly the same way as in backup?
@kreuzerkrieg Hi Ernest, IIUC, you are not against the idea of sending the list of files in the request, but just don't want to pass it in query string, right?
right. why not to add it to the POST body?
Ok, I guess I'm missing some understanding here how exactly its going to work. The backup will work the same way it is implemented here (as I understand), meaning, just take everything from the snapshot directory and upload it to the given prefix. So far, so good. Then we try to restore and here I'm lost. The "the list of the object keys from which SSTables are restored" part, how it is supposed to look? Is it the full "path" to the SSTable on S3? just a list of SSTables to copy? Then how do I figure out the full path to it?
Ok, I guess I'm missing some understanding here how exactly its going to work. The backup will work the same way it is implemented here (as I understand), meaning, just take everything from the snapshot directory and upload it to the given prefix. So far, so good.
yes. this is also my understanding.
Then we try to restore and here I'm lost. The "the list of the object keys from which SSTables are restored" part, how it is supposed to look? Is it the full "path" to the SSTable on S3? just a list of SSTables to copy? Then how do I figure out the full path to it?
it could be the full "path" or a partial "path". as the sstables could come from different snapshots, they do not necessarily share the same full "prefix", so SM would have to provide
$table_schema_version/$sstable_component
and backup/sst/cluster/$cluster_id/dc/$data_center_name/node/$node_id/keyspace/$keyspace/table/$table_name
.so that scylla can compose the full key name of sstable components: backup/sst/cluster/$cluster_id/dc/$data_center_name/node/$node_id/keyspace/$keyspace/table/$table_name/$table_schema_version/$sstable_component
i am not sure which way we prefer. @Michal-Leszczynski @xemul @kreuzerkrieg what do you think? after an offline discussion, we will pass the shared prefix with the query parameter, and the unique part with the payload.
v2:
I think that deduplicating files by removing them from snapshot dir by SM before calling Scylla API to upload the rest of the files to backup location is better because:
drop the proposal on changing the "backup" API: https://github.com/scylladb/scylladb/pull/20413 is good enough, as we don't need to pass "manifest" to scylla anymore.
That's correct, although take a look at https://github.com/scylladb/scylladb/pull/20413#issuecomment-2339314489.
the unique part: $table_schema_version/$sstable_component and add "prefix" parameter to the "restore" API.
And we are still going to provide a list of unique parts alongside the prefix, right?
drop the proposal on changing the "backup" API: #20413 is good enough, as we don't need to pass "manifest" to scylla anymore.
That's correct, although take a look at #20413 (comment).
indeed. we still don't support the "table" parameter at this moment. @xemul @kreuzerkrieg we do plan to support it, right? if yes, i will create another issue to track this.
the unique part: $table_schema_version/$sstable_component and add "prefix" parameter to the "restore" API.
And we are still going to provide a list of unique parts alongside the prefix, right?
yes, see
{
"in": "body",
"name": "sstables",
"description": "the list of the object keys of the TOC component of the SSTables to be restored",
"required":true,
"schema" :{
"type": "array",
"items": {
"type": "string"
}
}
},
it's encoded in the HTTP payload.
A clear back/restore API document is supposed to be released soon (prepared by both @Michal-Leszczynski and @regevran). When it is ready we'll be able to discuss it in the document itself and when the dust settles, we'll implement accordingly.
"path":"/storage_service/restore",
"operations":[
{
"method":"POST",
"summary":"Starts copying SSTables from a designated bucket in object storage to a specified keyspace",
@tchaikov a claryfing question - does this API just downloads given batch of sstables, or does it also restores it? Should SM call load&stream afterwards on its own, or is it handled by the Scylla in the scope of this API call? I'm asking because I'm not sure about the "copy sstables to keyspace" part, as it sounds like it is just about downloading the files to the upload (or maybe even data) directory.
"in": "body",
"name": "sstables",
"description": "the list of the object keys of the TOC component of the SSTables to be restored",
Does it mean that SM would be specifying only the files with "TOC.txt" suffix, not all of the component files? That's good in terms of safety (not possible to restore only half of the component files), but wouldn't it mean that Scylla needs to read all of those remote files? Is there a chance that it impacts performance?
"name":"table",
"description":"Name of a table to copy SSTables to",
"required":false,
How can it be not required? What would happen, if it is not specified? I guess that Scylla needs to know to which keyspace.table does the downloaded sstables belong to.
Also, what does this Scylla API call return? Some Scylla task ID that can be used by SM to check progress or cancel this API call?
@tchaikov but in general, proposed API would allow SM to use it as a drop-in replacement for:
@tchaikov a claryfing question - does this API just downloads given batch of sstables, or does it also restores it? Should SM call load&stream afterwards on its own, or is it handled by the Scylla in the scope of this API call?
Load-and-stream will be done by Scylla
Does it mean that SM would be specifying only the files with "TOC.txt" suffix, not all of the component files?
Only TOCs would work.
That's good in terms of safety (not possible to restore only half of the component files), but wouldn't it mean that Scylla > needs to read all of those remote files? Is there a chance that it impacts performance?
It will need to read TOCs anyway. Load-and-stream assumes that sstable is loaded into Scylla, and loading means reading TOC file
Also, what does this Scylla API call return? Some Scylla task ID that can be used by SM to check progress or cancel this API call?
It returns a task ID that can be later accessed via task-manager API:
GET /task_manager/task_status/{task_id}
GET /task_manager/wait_task/{task_id}
POST /task_manager/abort_task/{task_id}
@Michal-Leszczynski as Pavel has addressed your queries. i updated description of this issue to clarify the confusions. with TOC, i think
the list of the object keys of the TOC component of the SSTables to be restored
should already explains that only TOC components are accepted.
Before scylla is able to take care of the dedup by itself (see https://github.com/scylladb/scylladb/issues/20459), scylla manager (SM for short) composes a list of objects from different prefixes representing different backups (snapshorts) when performing restore. So SM needs to selectively pass a set of keys of objects to scylla, instead of restoring from all objects with a given prefix.
Restore
SM will send a list of the names of objects / files to be downloaded from remote in the HTTP request.
[a-z0-9-\.]
. There is no need to escape themthe restore API
this API performs the task in background: when it starts, it returns a string which is a task id right away. with this task id, one can interact with task represented by it using the task-manager API and nodetool commands:
see also #20335 for the previous change which adds the "prefix" parameter to "backup" and "restore" APIs.
the key to an object holding an sstable component can be composed with two parts:
backup/sst/cluster/$cluster_id/dc/$data_center_name/node/$node_id/keyspace/$keyspace/table/$table_name
.$table_schema_version/$sstable_component
the "prefix" query parameter will be used to pass the shared prefix, as it could be lengthy. sometimes, when we dump the payload for debugging, a shorter payload could help. and the body of HTTP request will contain a JSON-encoded array. (as an alternative, we could put a JSON-encoded object with a single array property for better extensibility in future).
notes on implementation
the existing
seastar-json2code.py
does not generate any code helping us to unmarshal the JSON in payload, but with the helpers inutils/rjson.hh
, we should be able to parse a JSON object from a given string. please note, we don't have a wrapper for parsing a stream, but RapidJson does support parsing a stream.we should filter out the manifest and schema file when listing a snapshot directory created by SM, if we are interested in the TOC.