peak / s5cmd

Parallel S3 and local filesystem execution tool.
MIT License
2.66k stars 237 forks source link

MD5 option for use with overwrite choice / sync #152

Open Krobar opened 4 years ago

Krobar commented 4 years ago

Would be great if this option could be added. I know it requires a custom metadata addition but it would be really useful.

Use Case: Using for copy of static site generator output to S3. S5cmd is way faster than alternatives but unfortunately copies files that don't need updating which makes it more expensive.

igungor commented 4 years ago

Since we use multipart upload, object ETag changes if user changes part-size of a file. Relevant package: https://github.com/peak/s3hash/

It's not as safe as hash control, but cp -n -s practically does the same job for use-cases like this.

Duplicate of #43

Krobar commented 4 years ago

Thank you for the reply. I tried -s and it doesn't quite work for this use case. The reason is if I make a minor change to the page output (eg. Capitalise a letter) then the size does not change and it does not upload. -n is not appropriate for this use case as the generated files always have a new modified date than the previous files.

I don't think the ETag is reliable these days as it is no longer contains an MD5 hash of the upload. Some other (much slower) S3 utilities add a custom MD5 tag and check for this; this is not perfect but would work perfectly for this use case. Would be good if it could be considered.

Nowaker commented 2 years ago

ETag isn't reliable. aws s3 sync has been reportedly broken for years as it doesn't guarantee an actual sync. See https://github.com/aws/aws-cli/issues/3273

kishaningithub commented 1 year ago

Can the approach taken by s4cmd not be used here? https://github.com/bloomreach/s4cmd#additional-technical-notes