peak / s5cmd

Parallel S3 and local filesystem execution tool.
MIT License
2.68k stars 239 forks source link

Feature Request: Sync based on hash missmatch #561

Open pr0ton11 opened 1 year ago

pr0ton11 commented 1 year ago

The sync feature between 2 buckets could be extended by providing a new strategy based on md5 or sha256 hashes respectively. This could also be used with local files as an these hash calculations should not be that expensive to use. In most cases, a simple head requests provides these checksum on the s3 side (maybe someone could confirm this for me for AWS S3).

In my case (Ceph RGW) this is provided as etag: "md5sum"

Thanks you

salim-b commented 2 weeks ago

etag: "md5sum"

ETag header values of S3-compatible object stores only directly correspond to the file's MD5 hash if the file has not been created via multipart upload.

For multipart uploads, the following applies:

The ETag of each individual part is the MD5 hash of the contents of the part. The ETag of the completed multipart object is the hash of the MD5 sums of each of the constituent parts concatenated together followed by a hyphen and the number of parts uploaded.

As igungor pointed out here:

Since we use multipart upload, object ETag changes if user changes part-size of a file.


The above does not speak against the usefulness of ETag-based sync in general, it's just important to keep in mind since not all object storage providers with an "S3-compatible API" implement this in the way the original AWS S3 does. Not least because AWS doesn't publicly document how they implement such important details (though that particular linked statement is clearly outdated as the multipart-ETag calculation is publicly known by now).