s3tools / s3cmd

Official s3cmd repo -- Command line tool for managing S3 compatible storage services (including Amazon S3 and CloudFront).
https://s3tools.org/s3cmd
GNU General Public License v2.0
4.5k stars 902 forks source link

Cloudflare R2: WARNING: MD5 Sums don't match! #1273

Open Lusitaniae opened 1 year ago

Lusitaniae commented 1 year ago

Happens when upload large (>5G) files that require multi part upload to Cloudflare R2.

s3cmd put -d -v my-file.tar.zst s3://my-bucket/ 
DEBUG: Canonical Request:
PUT
/my-file.tar.zst
partNumber=1&uploadId=[redacted]
content-length:15728640
host:[redacted].r2.cloudflarestorage.com
x-amz-content-sha256:22c5bf1bd95afe12f8cd6e13ae5db4299a9defcb6df2cfc69285488e2deb5c09
x-amz-date:20220818T052008Z

content-length;host;x-amz-content-sha256;x-amz-date
22c5bf1bd95afe12f8cd6e13ae5db4299a9defcb6df2cfc69285488e2deb5c09
----------------------
DEBUG: signature-v4 headers: {'content-length': '15728640', 'x-amz-date': '20220818T052008Z', 'Authorization': '[redacted]', 'x-amz-content-sha256': '22c5bf1bd95afe12f8cd6e13ae5db4299a9defcb6df2cfc69285488e2deb5c09'}
DEBUG: get_hostname([redacted]): [redacted].r2.cloudflarestorage.com
DEBUG: ConnMan.get(): re-using connection: https://[redacted].r2.cloudflarestorage.com#6
DEBUG: format_uri(): /my-file.tar.zst?partNumber=1&uploadId=[redacted]
    65536 of 15728640     0% in    0s     4.97 MB/sDEBUG: ConnMan.put(): connection put back to pool (https://[redacted].r2.cloudflarestorage.com#7)
DEBUG: Response:
{'data': b'',
 'headers': {'cf-ray': '73c832d5dd7e15cb-EWR',
             'connection': 'keep-alive',
             'content-length': '0',
             'date': 'Thu, 18 Aug 2022 05:20:13 GMT',
             'etag': '"ABwcsNIEIx/3TXD+37wkhu1YIh8AgUg/++I5bsBm9MiQotlGsTOkpQhTeRkj/p5IFx2PSa/ouG94ghv+Mniyltsnj6QDUb9omfJfRLd0hJVqTPReu9NfKcBp0Z9NTBHcwf83xI3u49eLDXsDH9rS/EDF9ALqJ6Y6HmUCfB4g6bwZSeAgly77Amaqib1kkH+uta/NcIfe1ot1he0iaLC5ZIwruHOrG+F5gsZkmJ1qZXpWrYLBVUhyFPZ6Yo1LlKjSJw=="',
             'expect-ct': 'max-age=604800, '
                          'report-uri="https://report-uri.cloudflare.com/cdn-cgi/beacon/expect-ct"',
             'server': 'cloudflare',
             'vary': 'Accept-Encoding'},
 'reason': 'OK',
 'size': 15728640,
 'status': 200}
 15728640 of 15728640   100% in    4s     3.34 MB/s  done
DEBUG: MD5 sums: computed=e7df577f795e45df5535f558c9931973, received=ABwcsNIEIx/3TXD+37wkhu1YIh8AgUg/++I5bsBm9MiQotlGsTOkpQhTeRkj/p5IFx2PSa/ouG94ghv+Mniyltsnj6QDUb9omfJfRLd0hJVqTPReu9NfKcBp0Z9NTBHcwf83xI3u49eLDXsDH9rS/EDF9ALqJ6Y6HmUCfB4g6bwZSeAgly77Amaqib1kkH+uta/NcIfe1ot1he0iaLC5ZIwruHOrG+F5gsZkmJ1qZXpWrYLBVUhyFPZ6Yo1LlKjSJw==
WARNING: MD5 Sums don't match!
WARNING: Too many failures. Giving up on 'my-file.tar.zst'
s3cmd --version
s3cmd version 2.2.0

Maybe Cloudflare is missing API compatability? Docs looks ok to me

https://developers.cloudflare.com/r2/platform/s3-compatibility/api/#object-level-operations

Issue in Cloudflare Community forums: https://community.cloudflare.com/t/multi-part-uploads-from-s3cmd-broken/412143

fviard commented 1 year ago

Sadly, there is not so much we can do at the moment if Cloudflare does not fix their api.

I think that you can still use s3cmd with the following flag: "--no-check-md5". You will not have the md5 checked for "sync", but if you only use "put", that should not change too much.

Also, if you are willing to give a try to a hack to the source code of s3cmd, you can try to do something:
in S3/s3.py, replace all occurences of:
'-' not in md5_from_s3
by:
'-' not in md5_from_s3 and len(md5_from_s3) < 50

For example here: https://github.com/s3tools/s3cmd/blob/b7520e5c25e1bf25c1a8bf5aa2eadb299be8f606/S3/S3.py#L1844

Currently, in the code, we have some detection for ETAGs that are not "hash", and to overcome that we our own customer header. But we do that by detecting the character "-" inside the value. Because, on AWS, for multipart parts there will be a minus with the number of the part.

If the modification works, you can give a try asking Cloudflare if, by chance, they would not want to change their "ETAG" to one with a syntax that match what is expected.
Some possibilities:

vlovich commented 1 year ago

Is this about the ETag for UploadPart or for the completed download? For UploadPart we're not going to be returning the MD5 and that's an intentional deviation. If that's the case, so far I've only heard of s3cmd having an issue. For completed multipart we return <hash>-<nparts> as the etag but "hash" is not the same as how S3 computes it.

fviard commented 1 year ago

@vlovich Yes, here we are speaking about the ETag of the UploadPart. For the completed multipart upload, it is expected that the ETag might be different and particular for some providers.

But, for a single part upload, there is no reason that the ETag behavior would not be the same as for a simple non-multipart file. As far as I know, Cloudflare is the single S3 (not-)compatible implementation that does not put the MD5 as ETag of one part.

We might try to detect that it is not a md5 based on the size, but it is a little bit sad to have to do a hack just because of Cloudflare implementation...

anuaimi commented 1 year ago

anyone know if this issue has been fixed. I'm looking to upload large files to R2