peak / s5cmd

Parallel S3 and local filesystem execution tool.
MIT License
2.7k stars 239 forks source link

s5cmd sync -- any way for efficient --incremental ? #746

Open yarikoptic opened 3 months ago

yarikoptic commented 3 months ago

We have a big (in TBs but also billions of keys) bucket to download/sync locally. FWIW, versioning turned on, so keys have versionIds assigned.

Is there some way to make sync efficient as to utilize some S3 API (or extra AWS service) to avoid listing/going through all keys it saw on the previous run, but rather operate on the "log" of the diff since prior state? (just get bunch of keys which were added or modified and some DeleteKey markers for deleted keys)

kabilar commented 3 weeks ago

Hi @yarikoptic, it looks like this may have been previously addressed (issues: #441, #447; fix: https://github.com/peak/s5cmd/pull/483) and tested with the use cases of millions of files.

cc @puja-trivedi @aaronkanzer

yarikoptic commented 3 weeks ago
kucukaslan commented 3 weeks ago

s5cmd doesn't support it at the moment.

Straightforward workaround would be manually grouping keys (potentially by directories if I may say) and running sync for each group (subset) of the bucket.

yarikoptic commented 3 weeks ago

Straightforward workaround would be manually grouping keys (potentially by directories if I may say) and running sync for each group (subset) of the bucket.

That would nohow change total number of keys being listed (e.g. 1 billion) even in the case when none got changed since the last run.

kucukaslan commented 3 weeks ago

Right it wouldn't change the number of keys being listed. But it will help avoiding the sorting such a long list which is a performance bottleneck.

Moreover, if this is going to be done once, calling s5cmd for each subset of directories seems an acceptable compromise to me. If s5cmd fails for any reason you'd only need to run only for the corresponding subset of directories instead of the full bucket which will hopefully reduce the # of total list requests. The incrementality would be provided manually.

It is also necessary to make a list request, at least once, to find what is in the source & destination buckets. IIRC every list request brings 1000 keys, and 1000 List requests costs $0.005 . So for a billion documents listing cost will be about $5.

Is there some way to make sync efficient as to utilize some S3 API (or extra AWS service) to avoid listing/going through all keys it saw on the previous run, but rather operate on the "log" of the diff since prior state?

s5cmd doesn't store a restorable state/log, so not possible for now.

just get bunch of keys which were added or modified and some DeleteKey markers for deleted keys

this might have been easier if AWS s3 api had a way to send a "modified since" option but, there isn't afaics. https://pkg.go.dev/github.com/aws/aws-sdk-go-v2/service/s3#ListObjectsV2Input

yarikoptic commented 3 weeks ago

yeap -- "modified since" would have been perfect! Really shame they didn't provide it.

Moreover, if this is going to be done once

well, the idea is to do it once a day or so ;)

FWIW, my initial attempt on our bucket without doing any fancy manual "splitting" -- interrupted `--dry-run sync` after about 8 hours and process reaching 76GB virtual memory utilization ```shell dandi@drogon:~/proj/s5cmd-dandi$ duct ../s5cmd/s5cmd --dry-run sync s3://dandiarchive/* dandiarchive/ 2024-10-28T11:08:51-0400 [INFO ] con-duct: duct is executing '../s5cmd/s5cmd --dry-run sync s3://dandiarchive/* dandiarchive/'... 2024-10-28T11:08:51-0400 [INFO ] con-duct: Log files will be written to .duct/logs/2024.10.28T11.08.51-2733714_ 2024-10-28T18:33:10-0400 [INFO ] con-duct: Summary: Exit Code: 137 Command: ../s5cmd/s5cmd --dry-run sync s3://dandiarchive/* dandiarchive/ Log files location: .duct/logs/2024.10.28T11.08.51-2733714_ Wall Clock Time: 26657.861 sec Memory Peak Usage (RSS): 64.5 GB Memory Average Usage (RSS): 27.8 GB Virtual Memory Peak Usage (VSZ): 76.4 GB Virtual Memory Average Usage (VSZ): 30.3 GB Memory Peak Percentage: 95.7% Memory Average Percentage: 41.20029702206471% CPU Peak Usage: 100.0% Average CPU Usage: 52.88295787687072% ```

PS edit: for fun of it will now run it on a box with 1TB of RAM to see if it ever completes -- how long it would take ;-)

yarikoptic commented 2 weeks ago
FTR: that dry run finished, listing about 375M keys for the "dry" cp in 225841.831 sec (62 hours, so over 3 days) ```shell 2024-11-02T05:43:28-0400 [INFO ] con-duct: Summary: Exit Code: 0 Command: ../../s5cmd/s5cmd --dry-run --log debug sync s3://dandiarchive/* dandiarchive/ Log files location: .duct/logs/2024.10.30T14.59.27-418623_ Wall Clock Time: 225841.831 sec Memory Peak Usage (RSS): 2.4 GB Memory Average Usage (RSS): 966.7 MB Virtual Memory Peak Usage (VSZ): 10.6 GB Virtual Memory Average Usage (VSZ): 5.8 GB Memory Peak Percentage: 0.2% Memory Average Percentage: 0.059080347653249383% CPU Peak Usage: 667.0% Average CPU Usage: 413.55872652902883% [INFO ] == Command exit (modification check follows) ===== run(ok): /home/yoh/proj/dandi/s5cmd-dandi (dataset) [duct ../../s5cmd/s5cmd --dry-run --log d...] add(ok): .duct/logs/2024.10.30T14.59.27-418623_info.json (file) add(ok): .duct/logs/2024.10.30T14.59.27-418623_stderr (file) add(ok): .duct/logs/2024.10.30T14.59.27-418623_stdout (file) add(ok): .duct/logs/2024.10.30T14.59.27-418623_usage.json (file) save(ok): . (dataset) yoh@typhon:~/proj/dandi/s5cmd-dandi$ wc -l .duct/logs/2024.10.30T14.59.27-418623_stdout 375455768 .duct/logs/2024.10.30T14.59.27-418623_stdout yoh@typhon:~/proj/dandi/s5cmd-dandi$ duct --version duct 0.8.0 ```