Proposal: s3-sha2-256-x for Amazon's parallel checksums

multiformats / multicodec

Compact self-describing codecs. Save space by using predefined multicodec tables.

MIT License

334 stars 200 forks source link

Proposal: s3-sha2-256-x for Amazon's parallel checksums #342

Closed drernie closed 5 months ago

drernie commented 5 months ago

We would like to use a multihash for the new Amazon S3 Checkums, specifically their SHA-256 variant. Is this the right place and way to request a new prefix for that?

These differ from traditional sha-256 in that they are computed in parallel using a variable block size. Being amazon, some of their SDKs use 8mb and others use 5mb (though it is user-configurable).

Specifically, I would like to propose (by analogy with dbl-sha2-256 and ssz-sha2-256-bmt):

s3-sha2-256-5,                   multihash,      0xb515,           draft,
s3-sha2-256-8,                   multihash,      0xb518,           draft,

This could be used both for the native S3 algorithm, as well as compatible implementations (such as ours).

I believe this satisfies the "two implementations" requirement for mulitformats registration. Is that sufficient? Should I just create a PR, or is there a more formal process?

Thanks!

burdiyan commented 5 months ago

A block in IPFS is normally around 1-2 MB, and bigger files are split into chunks around this size, and then hashed. Even though reserving these multicodecs could be useful, they may be incompatible with how IPFS currently works.

These are my assumptions only though, I'm not a maintainer.

rvagg commented 5 months ago

~The 1-2 Mb limit is somewhat artificial and don't apply to most places that IPLD blocks are used; I think bitswap is the only place where block sizes are most constraintd. Plus it's not really a big concern for multicodecs because we don't assume that you're going to be passing them through IPFS systems or even using them in CIDs.~ (see below, not my beef)

@drernie I think a pull request should be fine for these, what you say here sounds reasonable, one of us will probably have a look in a bit more detail just to sanity check that it's logical for them to have separate entries.

vmx commented 5 months ago

We would like to use a multihash for the new Amazon S3 Checkums, specifically their SHA-256 variant.

I've read the blog post. Do I understand correctly that the resulting hash of the Amazon S3 Checksum is equal to the on I'd calculate locally with a standard SHA-256 tool? If yes, what is your use case to have a different multicodec and not using the existing SHA-256 multihash identifier?

The general idea about those identifiers is, that the identify how a hash was calculated algorithmically, independent of the implementation. To me it sounds like the Amazon S3 Checksum, is just an implementation detail.

aschmahmann commented 5 months ago

As with @vmx's question above I'm not sure based just on your description and the data from that link what this is. Some concerns (assuming that the end result isn't SHA256(the bytes) like @vmx is asking about.

Being amazon, some of their SDKs use 8mb and others use 5mb (though it is user-configurable).

Since it's user configurable you wouldn't want two codes because there could be arbitrarily many. Instead you'd probably want a format where the multihash is <multihash-codec><varint-for-chunk-size><digest>.

However, it's not clear to me whether Amazon requires fixed chunk sizes. In this case there's no actual way to concisely describe the mapping from bytes to digest, which means that you'd more likely want something like a CID with an IPLD codec to describe your graph format (i.e. if the root hash is just SHA256(....) then you'd want a codec for concatenated-sha2-256 with a CID like <0x01 (i.e. the CIDv1 code)><concatenated-sha2-256><sha2-256><0x20>(i.e. the digest length in bytes)><digest>.

aschmahmann commented 5 months ago

The 1-2 Mb limit is somewhat artificial and don't apply to most places that IPLD blocks are used

@rvagg while I get that you're trying to clear up here (i.e. that none of multihash, CIDs, IPLD or even IPFS constrain the block sizes at the spec level) I think you're spreading the same misinformation I've seen elsewhere that this is just some wacky Bitswap protocol anomaly - it's not https://discuss.ipfs.tech/t/supporting-large-ipld-blocks/15093#why-block-limits-5.

I understand this is off topic (and that you weren't the one to bring up the block sizes in the first place), and am happy to continue the conversation elsewhere (e.g. that forum post).

drernie commented 5 months ago

Thanks for the rapid responses!

From @vmx

Do I understand correctly that the resulting hash of the Amazon S3 Checksum is equal to the one I'd calculate locally with a standard SHA-256 tool?

No. I believe SHA-256 is inherently serial. It assumes all the data is hashed linearly. The AWS algorithm is intrinsically parallel: it hashes each block individually, then hashes the results all together. That is why block size matters.

From @aschmahmann

However, it's not clear to me whether Amazon requires fixed chunk sizes. In this case there's no actual way to concisely describe the mapping from bytes to digest, which means that you'd more likely want something like a CID with an IPLD codec to describe your graph format (i.e. if the root hash is just SHA256(....) then you'd want a codec for concatenated-sha2-256 with a CID like <0x01 (i.e. the CIDv1 code)><0x20>(i.e. the digest length in bytes)>.

You are absolutely correct. The algorithm as defined can be arbitrarily complex in terms of block sizes. However, in practice the only implementations in actual use are the 5mb and 8mb, which is what we want to formalize.

From @rvagg

@drernie I think a pull request should be fine for these, what you say here sounds reasonable, one of us will probably have a look in a bit more detail just to sanity check that it's logical for them to have separate entries.

Thanks!

vmx commented 5 months ago

Do I understand correctly that the resulting hash of the Amazon S3 Checksum is equal to the one I'd calculate locally with a standard SHA-256 tool?

No. I believe SHA-256 is inherently serial. It assumes all the data is hashed linearly. The AWS algorithm is intrinsically parallel: it hashes each block individually, then hashes the results all together. That is why block size matters.

@drernie that spawned my interest, so I tried it out.

There is an AWS tutorial on how to use SHA-256 checksums that describes at the end on how to actually calculate the checksum of a file. My version of it is:

sha256sum mydata.dat | cut --bytes -64| xxd --revert --plain | base64

It's using the usual sha256sum tool and not any custom one. But let's verify that it actually works.

Allocate a file with a size that cannot be divided by 5MB/MiB or 8Mb/MiB in order to make sure the calculation is incidentally correct (as those are the chunk sizes mentioned in the original description). It's also bigger than 16MB to make sure it's a multipart upload, which potentially could also trigger different hashing.

$ fallocate --length 23456789 mydata.dat

Calculate the SHA-256 checksum of it:

$ sha256sum mydata.dat | cut --bytes -64| xxd --revert --plain | base64
34PAmeNpBEbY7kQBvFK+JY2mtu2BkoieWIKF0zz7vrY=

Put it onto S3:

$ aws s3api put-object --bucket <yout-bucket> --key mydata.dat --body mydata.dat --checksum-sha256 '34PAmeNpBEbY7kQBvFK+JY2mtu2BkoieWIKF0zz7vrY='
{
    "ETag": "\"47b18dae3b998a0e8ea25df8556e90e2\"",
    "ChecksumSHA256": "34PAmeNpBEbY7kQBvFK+JY2mtu2BkoieWIKF0zz7vrY=",
    "ServerSideEncryption": "AES256"
}

Double check that it wasn't a coincidence and try a valid, but incorrect SHA256 checksum:

$ aws s3api put-object --bucket <yout-bucket> --key mydata.dat --body mydata.dat --checksum-sha256 'FSlsvXVl1rNYP8/h2SJGy+J72bfFhieEx48RdsO2IrE='
An error occurred (BadDigest) when calling the PutObject operation: The SHA256 you specified did not match the calculated checksum.

So my conclusion is: the Amazon S3 SHA-256 checksum is equal to the usual SHA256 algorithm.

sir-sigurd commented 5 months ago

It's also bigger than 16MB to make sure it's a multipart upload, which potentially could also trigger different hashing.

But put-object doesn't use multipart upload, for multipart upload you have to use create-multipart-upload, upload-part and complete-multipart-upload.

IMO the easiest way to try S3 multipart checksums is to upload a file using S3 console enabling "Additional checksums" under "Properties". I think file should be >= 32 MiB so it start using multipart upload. After that you can inspect resulting checksum and checksums of individual parts with get-object-attributes.

vmx commented 5 months ago

Thanks @sir-sigurd for the information, I don't have much AWS knowledge.

Let me try it again with making sure it's a multipart upload.

Allocate a file > 32MiB:

$ fallocate --length 34567890 mydata.dat

Upload the file through the AWS console in the browser (as it's easer than from the CLI). Then get the SHA-256 checksome of it:

$ aws s3api get-object-attributes --bucket <your-bucket> --key mydata.dat --object-attributes 'Checksum'
{
    "LastModified": "2024-02-16T10:27:10+00:00",
    "Checksum": {
        "ChecksumSHA256": "eS1aSUoSnbLv53dDOSSjmhilAUkzfJsEiZKg3+lCjBc="
    }
}

Check what the SHA-256 hash of the local file is:

$ sha256sum mydata.dat | cut --bytes -64| xxd --revert --plain | base64
kcqYuUA3ieUC27HjC5ikS8/J5Av3dPbqjhqveNtOWXs=

They don't match. Can we reproduce that hash someone? Let's try the multipart upload with the CLI again.

Start a multipart upload, with SHA-256 checksums enabled:

$ aws s3api create-multipart-upload --bucket <your-bucket> --key mydata.dat --checksum-algorithm SHA256
{
    "ServerSideEncryption": "AES256",
    "ChecksumAlgorithm": "SHA256",
    "Bucket": <your-bucket>",
    "Key": "mydata.dat",
    "UploadId": "yUqjwbmY8qCHkR6vkIK3rnebBpcg0xpQOCS5oypnbXxIS.Z9UC2UbFWgUEZL6dbSFjlX4_1k30pg6yFrJU5P3jPUKuZbIYH352Ws9FfYBHr4oeDn4hIzPB9ORD24omGL"
}

Split the original file into two pieces named mydata_01.dat and mydata_02.dat:

$ split --bytes 23456789 --numeric-suffixes=1 --additional-suffix '.dat' mydata.dat mydata_

Let's check the SHA-256 hashes of the individual parts:

$ sha256sum mydata_01.dat | cut --bytes -64| xxd --revert --plain | base64
34PAmeNpBEbY7kQBvFK+JY2mtu2BkoieWIKF0zz7vrY=
$ sha256sum mydata_02.dat | cut --bytes -64| xxd --revert --plain | base64
JYEm3isfHSWZVlgLxgQyp670tv7Zq6e1OGC41HVxOgY=

Upload the first part:

$ aws s3api upload-part --bucket <your-bucket> --key mydata.dat --part-number 1 --body mydata_01.dat --upload-id 'yUqjwbmY8qCHkR6vkIK3rnebBpcg0xpQOCS5oypnbXxIS.Z9UC2UbFWgUEZL6dbSFjlX4_1k30pg6yFrJU5P3jPUKuZbIYH352Ws9FfYBHr4oeDn4hIzPB9ORD24omGL' --checksum-algorithm SHA256
{
    "ServerSideEncryption": "AES256",
    "ETag": "\"47b18dae3b998a0e8ea25df8556e90e2\"",
    "ChecksumSHA256": "34PAmeNpBEbY7kQBvFK+JY2mtu2BkoieWIKF0zz7vrY="
}

And the second part:

$ aws s3api upload-part --bucket <you-bucket> --key mydata.dat --part-number 2 --body mydata_02.dat --upload-id 'yUqjwbmY8qCHkR6vkIK3rnebBpcg0xpQOCS5oypnbXxIS.Z9UC2UbFWgUEZL6dbSFjlX4_1k30pg6yFrJU5P3jPUKuZbIYH352Ws9FfYBHr4oeDn4hIzPB9ORD24omGL' --checksum-algorithm SHA256
{
    "ServerSideEncryption": "AES256",
    "ETag": "\"b677de10e5b49d46d81948168383c3dc\"",
    "ChecksumSHA256": "JYEm3isfHSWZVlgLxgQyp670tv7Zq6e1OGC41HVxOgY="
}

In order to complete the upload we need to create a JSON file which contains the parts:

echo '{"Parts":[{"ChecksumSHA256":"34PAmeNpBEbY7kQBvFK+JY2mtu2BkoieWIKF0zz7vrY=","ETag":"47b18dae3b998a0e8ea25df8556e90e2","PartNumber":1},{"ChecksumSHA256":"JYEm3isfHSWZVlgLxgQyp670tv7Zq6e1OGC41HVxOgY=","ETag":"b677de10e5b49d46d81948168383c3dc","PartNumber":2}]}' > mydata_parts.json

Now finish the multipart upload:

$ aws s3api complete-multipart-upload --multipart-upload file://mydata_parts.json --bucket <yout-bucket> --key mydata.dat --upload-id 'yUqjwbmY8qCHkR6vkIK3rnebBpcg0xpQOCS5oypnbXxIS.Z9UC2UbFWgUEZL6dbSFjlX4_1k30pg6yFrJU5P3jPUKuZbIYH352Ws9FfYBHr4oeDn4hIzPB9ORD24omGL'
{
    "ServerSideEncryption": "AES256",
    "Location": "https://<your-bucket>.s3.eu-north-1.amazonaws.com/mydata.dat",
    "Bucket": <your-bucket>,
    "Key": "mydata.dat",
    "ETag": "\"ad31d1c7112c160438ff0ea063ec2c75-2\"",
    "ChecksumSHA256": "XtqASA/PuE6k06Ccd7uxnoykSwypoulhrxFkXi2Y1qY=-2"
}

Get the checksum again to double-check:

$ aws s3api get-object-attributes --bucket <your-bucket> --key mydata.dat --object-attributes 'Checksum'
{
    "LastModified": "2024-02-16T11:19:25+00:00",
    "Checksum": {
        "ChecksumSHA256": "XtqASA/PuE6k06Ccd7uxnoykSwypoulhrxFkXi2Y1qY="
    }
}

That is different from the SHA-256 hash of the file:

$ sha256sum mydata.dat | cut --bytes -64| xxd --revert --plain | base64
kcqYuUA3ieUC27HjC5ikS8/J5Av3dPbqjhqveNtOWXs=

So the individual parts match, but the hash of the whole file does not. Let's see if the hash of the file changes if we upload the parts in reverse order.

$ aws s3api create-multipart-upload --bucket <your-bucket> --key mydata_reverse.dat --checksum-algorithm SHA256
{
    "ServerSideEncryption": "AES256",
    "ChecksumAlgorithm": "SHA256",
    "Bucket": "<your-bucket>",
    "Key": "mydata_reverse.dat",
    "UploadId": "LDBFP0WvRixFCm9mcwgWUa7xlcG87hRgg2GhmQKwEPqQJHfmmzjtzgOmi7kY5YFqqFHLbOAmr01LbUV4oE38DWyWWafJGS6e5bblaedhNLvDvmpyFwQjbGNdpg8A_jwI"
}
$ aws s3api upload-part --bucket <your-bucket> --key mydata_reverse.dat --part-number 1 --body mydata_02.dat --upload-id 'LDBFP0WvRixFCm9mcwgWUa7xlcG87hRgg2GhmQKwEPqQJHfmmzjtzgOmi7kY5YFqqFHLbOAmr01LbUV4oE38DWyWWafJGS6e5bblaedhNLvDvmpyFwQjbGNdpg8A_jwI' --checksum-algorithm SHA256
{
    "ServerSideEncryption": "AES256",
    "ETag": "\"b677de10e5b49d46d81948168383c3dc\"",
    "ChecksumSHA256": "JYEm3isfHSWZVlgLxgQyp670tv7Zq6e1OGC41HVxOgY="
}
$ aws s3api upload-part --bucket <your-bucket> --key mydata_reverse.dat --part-number 2 --body mydata_01.dat --upload-id 'LDBFP0WvRixFCm9mcwgWUa7xlcG87hRgg2GhmQKwEPqQJHfmmzjtzgOmi7kY5YFqqFHLbOAmr01LbUV4oE38DWyWWafJGS6e5bblaedhNLvDvmpyFwQjbGNdpg8A_jwI' --checksum-algorithm SHA256
{
    "ServerSideEncryption": "AES256",
    "ETag": "\"47b18dae3b998a0e8ea25df8556e90e2\"",
    "ChecksumSHA256": "34PAmeNpBEbY7kQBvFK+JY2mtu2BkoieWIKF0zz7vrY="
}
$ echo '{"Parts":[{"ChecksumSHA256":"JYEm3isfHSWZVlgLxgQyp670tv7Zq6e1OGC41HVxOgY=","ETag":"b677de10e5b49d46d81948168383c3dc","PartNumber":1},{"ChecksumSHA256":"34PAmeNpBEbY7kQBvFK+JY2mtu2BkoieWIKF0zz7vrY=","ETag":"47b18dae3b998a0e8ea25df8556e90e2","PartNumber":2}]}' > mydata_reverse_parts.json
$ aws s3api complete-multipart-upload --multipart-upload file://mydata_reverse_parts.json --bucket <your-bucket> --key mydata_reverse.dat --upload-id 'LDBFP0WvRixFCm9mcwgWUa7xlcG87hRgg2GhmQKwEPqQJHfmmzjtzgOmi7kY5YFqqFHLbOAmr01LbUV4oE38DWyWWafJGS6e5bblaedhNLvDvmpyFwQjbGNdpg8A_jwI'
{
    "ServerSideEncryption": "AES256",
    "Location": "https://<your-bucket>.s3.eu-north-1.amazonaws.com/mydata_reverse.dat",
    "Bucket": <your-bucket>",
    "Key": "mydata_reverse.dat",
    "ETag": "\"1f4348ad0ad0456aa78ae6e3d1717dc6-2\"",
    "ChecksumSHA256": "AonVPvSgK/alROsay8U8FU7XCNoccOG6CdgJObEgZ2E=-2"
}
$ aws s3api get-object-attributes --bucket <your-bucket> --key mydata_reverse.dat --object-attributes 'Checksum'
{
    "LastModified": "2024-02-16T12:47:27+00:00",
    "Checksum": {
        "ChecksumSHA256": "AonVPvSgK/alROsay8U8FU7XCNoccOG6CdgJObEgZ2E="
    }
}
$ cat mydata_02.dat mydata_01.dat | sha256sum | cut --bytes -64| xxd --revert --plain | base64
kcqYuUA3ieUC27HjC5ikS8/J5Av3dPbqjhqveNtOWXs=

This means that the checksum of a file uploaded in multiple parts depends on the way the file is split. The exact same bytes (of the final file) can have different hashes. This also means that it does not depend on the block size the AWS SHA-256 implementation is using, the SHA-256 hash of the individual parts is just as expected.

The last open question left for me is: how are those hashes of the full files calculated then? I suspect it's a concatenation of the hashes of the parts, let's try that:

$ (sha256sum mydata_01.dat | cut --bytes -64| xxd --revert --plain; sha256sum mydata_02.dat | cut --bytes -64| xxd --revert --plain ) | sha256sum | cut --bytes -64| xxd --revert --plain | base64
XtqASA/PuE6k06Ccd7uxnoykSwypoulhrxFkXi2Y1qY=
$ (sha256sum mydata_02.dat | cut --bytes -64| xxd --revert --plain; sha256sum mydata_01.dat | cut --bytes -64| xxd --revert --plain ) | sha256sum | cut --bytes -64| xxd --revert --plain | base64            
AonVPvSgK/alROsay8U8FU7XCNoccOG6CdgJObEgZ2E=

Yes, that's exactly it.

To conclude: I don't think we should introduce a new multihash code, as the AWS SHA-256 hashing is algorithmically equal to the existing SHA-256 hash we already have.

drernie commented 5 months ago

Wow, thank you for diving into this so deeply! I really appreciate it.

Let me explain our use case, and hopefully you can tell me if there's a better option.

Our product creates "manifests" that associate an S3 URI with a content-hash. We need to support different types of content hashes, which must be clearly labeled so customers can independently verify that the URI matches its hash.

We want to use multihash encodings so customers have a consistent way to interpret and execute hashing as simple string (versus a complex struct with a seperate, non-standard type field). Note that these manifests are archival documents that are shared throughout our ecosystem, e.g. potentially in FDA filings, so it is important they be globally interpretable.

The specific algorithm we are using is probably better characterized as:

s3-sha2-256-boto, multihash, 0xb511, draft, Boto3-compatible implementation Amazon S3's parallel SHA2-256 checksums

That is, we want to signal that this is encoding using the precise block-chaining strategy used by Amazon's boto3 Python library.

Can you suggest an alternate way to satisfy that use case?

nl0 commented 5 months ago

To conclude: I don't think we should introduce a new multihash code, as the AWS SHA-256 hashing is algorithmically equal to the existing SHA-256 hash we already have.

it is a hash of hashes (in case of multipart uploads), so it's not algorithmically equal to plain sha256
we'd like to encode the specific part size selection algorithm (the one currently used by default by botocore), so that the hashes can be easily reproduced

rvagg commented 5 months ago

Are you intending on making CIDs from these, or is this purely about identifying the hashing algorithm?

Unfortunately you're hitting on a subtle hinge point in the multiformats and IPLD definitions that have been causing pain for some time. It goes something like this, apologies for the length, this is an evolving philosophical point:

A hash algorithm is a one-way function that takes input bytes and produces a digest
View from a content-addressing perspective, a hash function has a reverse in that it's an "address" that we can use to identify, i.e. look up, those original bytes.

You can see how this two-way perspective on hash functions is driven by the way we use CIDs, they're a way to get back to the original bytes.

But in your case, we have a bit of a conflict, that also comes up in other places: The hash function in question is already defined and we can identify input bytes and output digest. SHA2-256 over a set of bytes which just happen to be a concatenation of other hash digests, but this can be seen as incidental, the hash function doesn't care.

But what's the reverse operation if we made a CID out of this thing? It would take you to a "block" (in IPLD terms) that yields 2 or more other CIDs, each of which also use SHA2-256. Following those CIDs we get to raw bytes, which are chunks of your original file.

A fairly standard merklisation. In fact, I'd be interested to see what AWS does when you expand the number of chunks, does it do a straight concatenation or do they start making a tree out of the hashes?

But here's the awkward part which you're bumping in to, all you care about is inputs and outputs: the input is the raw original bytes and the output is a digest. As far as you're concerned it's a one-way hash function that isn't SHA2-256 but just happens to use SHA2-256, basically in the same way that IPFS uses DAG-PB and UnixFS to pick apart a file, make a merkle tree out of it and present a root that also happens to use SHA2-256. The default answer, that you're going to get from most of us, is that those details matter, it just "doesn't happen to ...", that's part of the reverse process of turning this thing into an "address".

A similar problem that I personally use as a reference to think about these things is the Bitcoin transaction merkle. It has a single root digest, that is placed into the block header, but it's a binary merkle of SHA2-256(SHA2-256(bytes)) all the way down to potentially hundreds (or thousands?) of transactions, each of which can be represented as a sequence of bytes. So from a content-addressing lens where we are always thinking about going backward we have two choices:

Strictly binary merkle, hash function is dbl-sha2-256 (but also note how this whole thing open the can of worms about that as a hash function, maybe it's just a sha2-256 that is an address for another address that's a sha2-256???).
- "Navigating" back down from the root to transactions means finding every single intermediate binary merkle node where you have two more addresses and you pick your path.
- In this way, you can just address 1 or more transactions by providing a root and a path. Neat, but also really tedious because it's a long path through a binary merkle.
The hash function itself called binary-merkle that addresses all of the transactions in one go (whoops, that was removed a couple of years ago https://github.com/multiformats/multicodec/pull/270, but maybe it shouldn't have been?).
- "Navigating" back down from the root to the transactions means you get a sequence of bytes that represents the whole list of transactions, all of them!
- In this way, you must always be addressing all the transactions, there's no in-between, your IPLD block is potentially huge and you get everything. You'd use within-block selectors to identify pieces but there's no slicing and dicing at decoding time.

There's also other subtleties that make this question complicated, like in the case of Filecion where we opted for the first for "Piece CID" (CommP) but didn't really expect anyone to be doing this navigating so neglected the fact that you can never identify when you're at the bottom layer of the stack so you don't know what it means to be at the leaves! Maybe they're still hashes and you can't find them?? In the Bitcoin case you have one benefit of knowing that the leaves are not going to be exactly the length of two SHA2-256 digests! But that's a bit of a crappy termination clause tbqh, because as it turns out, there are cases where it is and you just have to try to decode to find out! https://github.com/ipld/js-bitcoin/blob/cf51aae657cbddef5074539496fc9a8dbad2ded9/src/bitcoin-tx.js#L186-L199). So, in the Filecoin case, there's a case being made to flip on the definition and reinterpret the tree as a hash function itself: https://github.com/filecoin-project/FIPs/discussions/759, although this partly hinges on the desire to get the depth of the tree into the multihash somehow too.

In summary, I've become more sympathetic over time to the "it's just all a hash function of the base bytes" argument, even when viewing the digest as a navigable content address, as long as you're willing to accept the consequence of the address leading you to a really large blob (especially in your case!). But, you only have one integer to play with and the question then becomes whether you can squeeze in the information you need into that signal. The question here would then center on chunk ordering, and also apparently something to do with chunk size? If those things are stable, such that the same operation is always going to result in the same digest regardless of upload network conditions, then maybe there's a case here.

But that's just me too, there's a few people to argue with here about this so we'd have to see what they think and what I might be missing.

dimaryaz commented 5 months ago

Let me see if I can answer at least some of the questions...

We don't have an explicit goal of making CIDs - though in some cases, we do store the content using its hash as the file name, so maybe that's the same thing? But the main goal is to identify the hashing algorithm.

Conflict from using a hash that's already defined: this is going to be a pretty weak argument, but, AWS's algorithm actually produces a SHA256 hash concatenated with a number of parts used to generate it - so it's different from just a SHA256 hash. (Technically, it's a base64-encoded SHA256 hash, a dash, and a number of parts - e.g., AonVPvSgK/alROsay8U8FU7XCNoccOG6CdgJObEgZ2E=-2 - but of course we could encode that more efficiently.) That said, the number of parts is not particularly useful: it's neither necessary to avoid collisions, nor sufficient to reproduce the hash without knowing exact part sizes.

Expanding the number of chunks: it's always a straight concatenation. There's actually a limit of 10,000 chunks, so there's probably no need for a tree.

Chunk ordering, chunk size, network conditions: AWS itself allows using (almost) any chunk size, including different sizes for different chunks, which would result in different hashes. However, our goal is to standardize on boto3's default of 8MB chunks, which would make the hash always be the same for the same input. Chunks can be uploaded in any order (in fact, the whole point of this algorithm is to support parallel uploads). Each upload comes with a part number, which lets AWS reassemble it correctly on their end.

rvagg commented 5 months ago

We'll see what @vmx says but I buy this when viewed from "code for the function that makes a hash digest of my source bytes" where "source bytes" is from the user perspective. Especially if you're limiting it to boto and boto is predictable both in its chunk size and the order in which it sends it, and you have "boto" in the name already.

vmx commented 5 months ago

@dimaryaz thanks for the explanation on what problem you try to solve. I (just as @rvagg) also came to the conclusion that it some cases it makes sense to treat something that uses an existing hash function in some different way (like in your case hashing of the hashes of the parts) as a new hash function that gets a new codec.

What I want to prevent is something like all those Stein or Blake entries, where you have a long list of things for different sizes. So if you would ask for a hundred possible different chunk sizes, it would be a no from me. But in your case it's really just a single one, which sounds good to me.

2. View from a content-addressing perspective, a hash function has a reverse in that it's an "address" that we can use to identify, i.e. look up, those original bytes.

This is the case I personally care about most. I'd like to be able to reproduce the hash, by just knowing the identifier and the input bytes. Hence it would be great if you could write the algorithm down somewhere we can link to, so that people could implement it themselves/can also verify their links. Maybe also the description could include more information, like e.g. the fixed chunk size.

aschmahmann commented 5 months ago

What I want to prevent is something like all those Stein or Blake entries, where you have a long list of things for different sizes.

👍, the skein and blake entries are ridiculous.

So if you would ask for a hundred possible different chunk sizes, it would be a no from me. But in your case it's really just a single one, which sounds good to me.

It's two here (not so bad), although my understanding from this issue is that the number of codes is basically a function of how Amazon (not the entity requesting the codes) changes the defaults in their libraries. If they change the chunk size tomorrow to be 13MiB by default then we'll be getting another request for 13MiB chunk sizes.

Not sure what others think but I'm inclined to recommend something along the lines of my suggestion from here https://github.com/multiformats/multicodec/issues/342#issuecomment-1946310305 where only one code is needed and there's no need to come back for more if Amazon changes the defaults in their python library (or if someone makes a Java library with different defaults that becomes popular).

If others feel two codes is fine, then I can live with that 🤷

Especially if you're limiting it to boto and boto is predictable both in its chunk size and the order in which it sends it, and you have "boto" in the name already.

Note: boto is a specific library not controlled by the people who are applying for the codec. It'd probably be better to call it something more related to what it is and then have a link to a spec in the table metadata.

aschmahmann commented 5 months ago

Following up on some of @rvagg's comments on the general case:

This isn't the only example of this kind of merkle-tree hash function for files. For example, BitTorrent-v2 files could be treated as either trees with a SHA-256 root or as a BitTorrent-v2 hash. So this is unlikely to be the last time someone shows up requesting a code for a merkle-tree built of primitives already in the table.

I wonder if for now reasonable guidance would be:

If someone reasonably wants a multihash code that seems like it'd be covered by IPLD + existing multihashes already that they can still get it
If the multihash they want for their construction is similar to another one in the table they should take an extensible version rather than a specific code (e.g. if someone showed up and asked for the 5MiB chunker they'd probably get it, but if later someone showed up asking for 8 we'd request they add chunk information to the "digest" so we don't get a third request)
- Similarly, if there's a request for another dimension (e.g. if branching factor was added alongside chunk size) we'd request they make a further extensible codec adding more information into the "digest".
  - I'm thinking that once we have some layers we effectively encourage more rather than going slower (e.g. having a new code for variable chunk size but fixed branching until new branching comes along) since I suspect asking for some variability means there's a higher probability of more being requested.

Some downsides I see here are:

If it's suspected that new variations will emerge people could get tripped up by whether the specific or general version gets used
Implementations might choose to only implement some of the variants which might be confusing when people look for which multihashes are supported in a given implementation
At some level the size of the multihash "digests" will get large enough as to prompt debate about whether "compression" should occur via a new table entry vs logic closer to the application layer

FWIW you could argue based on things like the bmt code(s) that we should, based on this request, already be specifying both chunking and branching factor (in this case 5/8MiB chunks and 10k branching) however skipping this seems fine given that AFAICT nobody is using bmt this way. It also allows us to kick the can a little bit down the road here regarding other hash functions (e.g. would we use recursive multihash codes to say something was 5MiB chunks, 10k branching, and used SHA2-256 rather than SHA1?). Given that Amazon supports more than just SHA-256 this could plausibly come up here, but delaying and following the guidance above seems reasonable to me.

vmx commented 5 months ago

Note: boto is a specific library not controlled by the people who are applying for the codec. It'd probably be better to call it something more related to what it is and then have a link to a spec in the table metadata.

Agreed. Something that describes that it's hashes of hashes with a certain chunk size.

drernie commented 5 months ago

Note: boto is a specific library not controlled by the people who are applying for the codec. It'd probably be better to call it something more related to what it is and then have a link to a spec in the table metadata.

Agreed. Something that describes that it's hashes of hashes with a certain chunk size.

Excellent point! Thank you all for the thoughtful feedback. That approach makes a lot of sense.

I've tried to formally document the exact algorithm we use here:

https://github.com/quiltdata/quilt/blob/s3_sha256/docs/PARALLEL_CHECKSUMS.md

And updated my registration request to:

sha2-256-parallel-8,            multihash,      0xb518,         draft,      Parallel hash of hashes starting with 8MiB chunks

Does that clarify things sufficiently?

I haven't seen links to specs in the table, but I can certainly add one (once we have a final URL for it).

rvagg commented 5 months ago

Sorry to be picky and stretch this out @drernie (thanks for being patient btw), but "parallel" seems to be an implementation detail that doesn't leak into the hash function itself? So maybe the name isn't quite right. concat, chunked, merkle, or something like that? The description could be something like "Hash of concatenated SHA2-256 digests of 8MiB source chunks".

Also please do consider @aschmahmann suggestion above about combining length and hash function if you think you may end up needing extensibility so we can avoid number explosion in the table later on. You may have better insight into the way Amazon is likely to change this so it's probably your call I'd say. But you could end up with an identifier that's two varints together: 0xb518 + 0x08, so you'd end up with 4 bytes as your identifier: 0x98ea0208, and your software just knows to always decode two in a row to figure out the hash function and length. Or if you're mixing hash functions in this scheme and this is the only one that has a length, then you just branch on that, read first varint, find a 0xb518 then continue on to read a second. But also, perhaps it's the uniqueness that counts and you never need to decode them. Depending on your use-case.

drernie commented 5 months ago

Sorry to be picky and stretch this out @drernie (thanks for being patient btw),

As my officemate used to say, "abuse is a form of love" :-). Thanks for caring!

So maybe the name isn't quite right. concat, chunked, merkle, or something like that? The description could be something like "Hash of concatenated SHA2-256 digests of 8MiB source chunks".

Deal:

sha2-256-chunked, multihash, 0xb510, draft, Hash of concatenated SHA2-256 digests of 8 MiB source chunks

if you think you may end up needing extensibility

Ah. That is the key insight. WE don't need extensibility. S3 may choose to add or change implementations, but that is not OUR problem. Quilt has two implementations that use this, and if Amazon wants to stay compatible with it (and their past self), great, but that's a bonus not a requirement.

I've switched the code point to "0xb510" to avoid over-indexing on the 8 MiB. Ready to call this done and move on?

nl0 commented 5 months ago

sha2-256-chunked, multihash, 0xb510, draft, Hash of concatenated SHA2-256 digests of 8 MiB source chunks

nit: the chunks are not necessarily 8 MiB, so the description feels confusing

drernie commented 5 months ago

I originally had:

Hash of concatenated SHA2-256 digests of 8*2^n MiB source chunks (n>0 if number of chunks > 10K)

Is that too verbose?

nl0 commented 5 months ago

personally i think that's better, tho i'm not sure where this complexity/verbosity should ultimately end up (like, we could make the description in the table more concise and just point to our docs specifying the algo in details).

also, another nit: shouldn't we point out that if the input is smaller than the threshold (8MiB), then it's just a plain sha digest of the input?

vmx commented 5 months ago

if you think you may end up needing extensibility

Ah. That is the key insight. WE don't need extensibility.

If that's the case, then I'd bake the 8MiB into the name as it sounds like you'd always want to keep it at 8MiB and this way it keeps reproducible and doesn't change arbitrarily.

sha2-256-chunked, multihash, 0xb510, draft, Hash of concatenated SHA2-256 digests of 8 MiB source chunks
nit: the chunks are not necessarily 8 MiB, so the description feels confusing

What about "Hash of concatenated SHA2-256 digest, 8 MiB chunk size"

For me the detail is "chunk size", which for me implies, that you try to chunk the data at 8MiB boundaries and in the end take whatever is left. Please let me know if that's just me having this understanding of "chunk size" or if others feels the same.

nl0 commented 5 months ago

For me the detail is "chunk size", which for me implies, that you try to chunk the data at 8MiB boundaries and in the end take whatever is left. Please let me know if that's just me having this understanding of "chunk size" or if others feels the same.

chunk size starts with 8MiB, but the number of chunks is limited to 10k, so if the total size is > 8MiB*10k, we double the chunk size until it fits, so that the chunks are uniform (except the last one) and don't exceed the selected chunk size.

also, if the input size is less then 8MiB, we don't chunk the data at all, but just hash it directly.

vmx commented 5 months ago

chunk size starts with 8MiB, but the number of chunks is limited to 10k, so if the total size is > 8MiB*10k, we double the chunk size until it fits, so that the chunks are uniform (except the last one) and don't exceed the selected chunk size.

Thanks for the explanation (sorry I should've read https://github.com/quiltdata/quilt/blob/s3_sha256/docs/PARALLEL_CHECKSUMS.md carefully).

What about something like "Hash of concatenated SHA2-256 digests of 8*2^n MiB source chunks (n = data_size / 10000)"?

drernie commented 5 months ago

Sure:

sha2-256-chunked,               multihash,      0xb510,         draft,      Hash of concatenated SHA2-256 digests of 8*2^n MiB source chunks; n = source_size / (10000 * 8MiB)

We also changed our proposal to match the description (and also the name) by dropping the caveat of not double-hashing for small files :-)

vmx commented 5 months ago

To me it sounds good now (I left one more comment at the PR).

The name "sha2-256-chunked" works for me, though I'm not sure if it's too generic.

@rvagg @aschmahmann is anything missing?

drernie commented 5 months ago

One last question, since you've been so helpful. :)

What is the proper hash for an empty file? How do other "chunked" or tree algorithms handle that?

0
sha256(empty string)
sha256 (sha256(empty string))

The original line of thinking is that if the input size is under the threshold, we take the whole input as the one and only chunk, and in case of an empty file this chunk is empty (has length 0). However, this is different than we would get from a naive implementation (chunks = [] vs chunks = [b'']), so we welcome an outside opinion...

rvagg commented 5 months ago

I'm fine with this, the name is fairly generic for something with so specific rules, but there's no competitor yet for the name so we can deal with that later if it comes up!

Regarding that last question, mostly I see hash('') in these kinds of places, basically make few assumptions or special cases. Hash functions can (usually) deal with empty strings, and the 1:1 relationship between the empty string and its hash is usually important and useful, so just go with it. I think in the way you're expressing it, I'd just go with chunks = [b''] and treat the the threshold range as [0,threshold) rather than (0,threshold) with a special case for 0.

vmx commented 5 months ago

This issue is closed by https://github.com/multiformats/multicodec/pull/343. Thanks for everyone involved and being patient to really get to the root of this.

drernie commented 5 months ago

Yes, you guys rock. One of my favorite PRs of all time!

On Fri, Feb 23, 2024 at 03:21 Volker Mische @.***> wrote:

This issue is closed by #343 https://github.com/multiformats/multicodec/pull/343. Thanks for everyone involved and being patient to really get to the root of this.

— Reply to this email directly, view it on GitHub https://github.com/multiformats/multicodec/issues/342#issuecomment-1961152911, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAAE2T5Z73J73KZZ2PIT5I3YVB3T7AVCNFSM6AAAAABDJMBJBSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSNRRGE2TEOJRGE . You are receiving this because you were mentioned.Message ID: @.***>