Open yarikoptic opened 3 months ago
Hi @yarikoptic, it looks like this may have been previously addressed (issues: #441, #447; fix: https://github.com/peak/s5cmd/pull/483) and tested with the use cases of millions of files.
cc @puja-trivedi @aaronkanzer
s5cmd doesn't support it at the moment.
Straightforward workaround would be manually grouping keys (potentially by directories if I may say) and running sync for each group (subset) of the bucket.
Straightforward workaround would be manually grouping keys (potentially by directories if I may say) and running sync for each group (subset) of the bucket.
That would nohow change total number of keys being listed (e.g. 1 billion) even in the case when none got changed since the last run.
Right it wouldn't change the number of keys being listed. But it will help avoiding the sorting such a long list which is a performance bottleneck.
Moreover, if this is going to be done once, calling s5cmd for each subset of directories seems an acceptable compromise to me. If s5cmd fails for any reason you'd only need to run only for the corresponding subset of directories instead of the full bucket which will hopefully reduce the # of total list requests. The incrementality would be provided manually.
It is also necessary to make a list request, at least once, to find what is in the source & destination buckets. IIRC every list request brings 1000 keys, and 1000 List requests costs $0.005 . So for a billion documents listing cost will be about $5.
Is there some way to make sync efficient as to utilize some S3 API (or extra AWS service) to avoid listing/going through all keys it saw on the previous run, but rather operate on the "log" of the diff since prior state?
s5cmd doesn't store a restorable state/log, so not possible for now.
just get bunch of keys which were added or modified and some DeleteKey markers for deleted keys
this might have been easier if AWS s3 api had a way to send a "modified since" option but, there isn't afaics. https://pkg.go.dev/github.com/aws/aws-sdk-go-v2/service/s3#ListObjectsV2Input
yeap -- "modified since" would have been perfect! Really shame they didn't provide it.
Moreover, if this is going to be done once
well, the idea is to do it once a day or so ;)
PS edit: for fun of it will now run it on a box with 1TB of RAM to see if it ever completes -- how long it would take ;-)
We have a big (in TBs but also billions of keys) bucket to download/sync locally. FWIW, versioning turned on, so keys have versionIds assigned.
Is there some way to make sync efficient as to utilize some S3 API (or extra AWS service) to avoid listing/going through all keys it saw on the previous run, but rather operate on the "log" of the diff since prior state? (just get bunch of keys which were added or modified and some DeleteKey markers for deleted keys)