peak / s5cmd

Parallel S3 and local filesystem execution tool.
MIT License
2.64k stars 231 forks source link

s5cmd sync -- any way for efficient --incremental ? #746

Open yarikoptic opened 2 months ago

yarikoptic commented 2 months ago

We have a big (in TBs but also billions of keys) bucket to download/sync locally. FWIW, versioning turned on, so keys have versionIds assigned.

Is there some way to make sync efficient as to utilize some S3 API (or extra AWS service) to avoid listing/going through all keys it saw on the previous run, but rather operate on the "log" of the diff since prior state? (just get bunch of keys which were added or modified and some DeleteKey markers for deleted keys)

kabilar commented 10 hours ago

Hi @yarikoptic, it looks like this may have been previously addressed (issues: #441, #447; fix: https://github.com/peak/s5cmd/pull/483) and tested with the use cases of millions of files.

cc @puja-trivedi @aaronkanzer

yarikoptic commented 8 hours ago