multiformats / multihash

Self describing hashes - for future proofing
https://multiformats.io/multihash/
MIT License
887 stars 113 forks source link

Size limit of identity hash #130

Open vmx opened 4 years ago

vmx commented 4 years ago

As I've been working on the Rust implementation of Multihash, it came up that the identity hash currently doesn't specify any limits. From an optimization perspective (this is why it came up in Rust), but also from a security perspective I think it would make sense to specify an upper bound for its size.

I personally would take a quite low limit which is similar to what current hash functions have as length. So perhaps something around 64 bytes?

mikeal commented 3 years ago

We have recommendations for block size limits but ultimately it’s up to different protocol implementations to enforce them. I don’t see a compelling reason to recommend or try to enforce any limits beyond that. Identity CIDs live in blocks and can accumulate quickly to exhaust block limits. Given that we have no real enforcement mechanism let’s just leave it as is.

Winterhuman commented 2 years ago

Adding to the discussion, why limit the size of identity multihashes when you could limit the size of the things that contain identity multihashes? Here's my thinking:

Beyond DNS, DAGs, and IPFS records, identity multihashes don't see much use, and from there the size restrictions should be determined by the developer that intends to use them outside these typical set of use-cases.

As for developers making multihash parsers, the spec should include a minimum identity multihash size they must support (maybe make it equal to the largest non-identity multihash available?), however, it should be up to the developer to decide on an upper limit depending on the performance/security considerations that are being taken for the particular implementation.

What do you think?

rvagg commented 2 years ago

Hey, I forgot about this discussion (thanks for reviving it @Winterhuman) but I've had my head in identity CIDs of late so this is relevant.

We're dealing with some Filecoin retrieval issues involving identity CIDs which have been made by various tooling, most notably Lotus itself when creating UnixFS data for you with an import! https://github.com/filecoin-project/lotus/blob/088bf56f2a9a9ad6ae031b9af867cefec8ed4245/lib/unixfs/filestore.go#L31-L42

I've seen this code copied in a couple of places, so we must have a cultural meme that's stuck with some folks that identity CIDs are a good idea when the blocks are <= 126 bytes. So this is happening, in the wild, and any limit should at least take that into account.

The retrieval problem with Filecoin is interesting because it involves DAG root CID that are identities and are skipped by various parts of the storage and indexing layer because they are identity CIDs. But because they're overlooked, when a user shows up wanting to retrieve a DAG that starts as an identity CID (typically with multiple links in it), the retrieval fails because it can't find a stored piece with that CID in it.

So a top-layer solution (there are other solutions to be explored focused on indexing, but for now we just want to make these retrievals work) is to accept identity CIDs and handle them as if they are a request for a piece containing all of the links in that CID and then start a retrieval based on that from a matching piece.

But in the process of implementing that we have to make some decisions about limits, because we're not just dealing with the <=126 byte case, we're essentially opening up another retrieval mode where a user can craft their own identity CID containing lots of links and be requesting that a storage provider find a piece that happens to have all of those links in it. Having it open ended isn't really ideal because that could be a lot of work and from a client perspective just making a query is cheap and we don't want to make it expensive for storage providers.

I've come up with some limits, but they're kind of arbitrary, but they at least feel like a compromise between "don't make me do too much work" and "this could be a useful mode for multi-DAG or just multi-block retrieval": https://github.com/filecoin-project/go-fil-markets/pull/747

You can pack 32 average sized CIDv1s into a dag-cbor identity CID for around 1320 bytes.

Too much? Too little? :shrug: But this is a specific path, so the limits here don't necessarily reflect what a reasonable limit might be for other uses.

Winterhuman commented 2 years ago

@rvagg It's definitely a use-case to keep in mind, and it also helps demonstrate one of my points.

It would be much easier to just give a minimum max size of identity multihash that all multihash parsers must support than giving a definitive maximum size. For instance, your use-case is using identity multihashes for DAG roots, and so it has a large limit, however, something like Kubo would almost certainly support only up to the largest non-identity multihash to mitigate attack vectors.

Just to discuss your idea further, you want to impose a 32 link maximum per identity CID, however, what would stop people from using identity CIDs as links, with further links in them, to go past the 32 link limit? I guess the 2048 bytes maximum would prevent it going further, but, it's something I wanted to bring up.

rvagg commented 2 years ago

Just to discuss your idea further, you want to impose a 32 link maximum per identity CID, however, what would stop people from using identity CIDs as links, with further links in them

Well, I added support for recursive identity CIDs here: https://github.com/filecoin-project/go-fil-markets/pull/747/commits/ffdf335bb7660864086eac682cebc4cff725ecbd

Because the recursive has to be inside the parent identity CID, you hit the byte limit pretty quick, and the total 32-link max is taken over the parent + recursives it contains. Any identity CID that an identity CID contains must be included in it, so the case is covered fairly neatly here.

But to be clear - recursive identity CIDs are kind of insane, and even supporting them seems a little insane.

There's some additional weirdness when it comes to Filecoin re identity CIDs that make it a tricky topic, mainly focused around the relationship between a DAG, the blocks in that DAG and the "piece" it's stored in. When you have an opaque blockstore that just returns blocks to you when you ask for them then a lot of those problems go away. But with Filecoin there are other factors that come in to play, so it's not opaque - do I have to unseal the piece to get you this block/DAG? do I have to charge you to send you this block/DAG? And identity CIDs essentially become identifiers—even if Filecoin were to just treat them as free retrievals, (a) do we want to force storage providers to answer random queries for "give me this identity CID"? and (b) what about DAGs that hang off such identity CIDs when you want to do a graph-based retrieval or when you're using an identity CID to identify some piece that you want to do a complete piece retrieval of.

Ugh. It's a mess and leaves me believing identity CIDs were a mistake.

ribasushi commented 2 years ago

But to be clear - recursive identity CIDs are kind of insane

This is objectively incorrect: any competently written dag assembler can naturally produce recursively-stacked identity CIDs: the input into the "decision function" of whether to wrap an identity is nothing but the size of the "block" based on the immediate linklist.

leaves me believing identity CIDs were a mistake.

None of your current troubles are specific to identity CIDs as a concept: everything stems from the fact that folks were/are trying to be clever about handling them in a special way through the stack. That is: identity CIDs are not special in any way. Treating them as such outside of transport is the actual grave mistake we should stop perpetuating.

Winterhuman commented 2 years ago

I do have one question about inline CIDs. What's the difference between setting a link to bafkqai vs having a "bytes": section in the DAG object? For example:

"Links" : [
   {
      "Hash" : {
         "/" : "bafkqailumvzxi2lom4qgm2lmmvsho2dbmrug6ylxnfugicq"
      },
      "Name" : "file1",
      "Tsize" : 25
   },
   {
      "bytes" : {
         "/" : "B43512VB124G42B4YT14GV14TGF1422 ... "
      },
      "Name" : "file2",
      "Tsize" : 36
   }
]

Is the difference practical? Conceptual? It's the one thing I've not quite understood when reading the whole conversation

rvagg commented 2 years ago

@Winterhuman in your example you have what looks like a dag-pb conforming block printed in dag-json. One reason to do bytes in a CID is that dag-pb is so strict with what it can include, it only has one place for bytes, the Data field, and everything else should be a link inside the Links list. So if you were to try and pass your data through a dag-pb encoder it'll fail. But you could sneak those bytes in via an identity CID and have them resolve to what would essentially be a raw block that appears to be a separate entity.

But outside of dag-pb, the question still stands—why use an identity CID to smuggle bytes if you could just include bytes? I suppose it comes down to your desired schema, like with dag-pb, perhaps you need something to be a link in that position, but instead of linking to a tiny raw block, you just inline it and save some space. The same could be true with other data forms too—maybe your application's schema wants a link to a block containing a list, but you have an empty list, encoding that as a dag-cbor empty list would be a very small identity CID and you'd end up saving space doing so (even with the de-dupe you might get from needing to do it multiple times).

There are other cases too, like this one that's always slightly puzzled me: https://github.com/filecoin-project/specs-actors/blob/d8d9867f68a3c299295efdc6d1b3421c9b63df57/actors/builtin/codes.go - the identifiers for the builtin actors for Filecoin are identity CIDs containing the versioned name of the actor as bytes in a raw block. My guess is that we've always anticipated needing to identify actors by the hash of their code (wasm), so since we're going to need links in that position, let's just make them links from the start.

rvagg commented 2 years ago

This post is for extrapolation of one of the reasons why I think we should discourage the use of identity CIDs in Filecoin (so I can link to this from a lotus discussion). This is mainly relevant for online deal flow, and mainly influenced by the historical precedent we set by including inline CIDs as their own block entries in CARs, although there are other reasons why it's not necessarily a great idea.

When we include CIDs as separate blocks in CARs, the block data we're inlining shows up 3 times: in the original block with the link, then as the CID for a CAR section and then as the data for that same CAR section. It'll show up another time for a CARv2 index, which will be generated for SPs actively serving retrievals, but I've not factored that in below.

We get de-duplication efficiencies if multiple blocks use the same inline CID but we'd get similar efficiencies if we were linking to the same block anyway, and beyond the size of a standard CID (~36 bytes) we rapidly end up losing any benefit.

Ignoring de-duplication effects, the "savings" can be visualised like this (only CIDv1), going up to the 126 bytes that Lotus will currently inline:

Screenshot 2022-09-05 at 8 54 58 pm

Storage cost increases ~linearly for storing the bytes as a separate linked block, with the linking CID and the CID:bytes CAR entry; i.e. the overhead is ~2 CIDs. For identity CIDs the increase is ~3x, with an overhead of 9 bytes. Break-even is at 32 bytes.

rvagg commented 1 year ago

Is anyone in this thread going to squeal if we put a hard limit in both js-multiformats and go-cid, and also a suggestion in the spec, that would error if they are asked to decode (or encode) an identity CID greater than 2048 bytes? Try and ignore for a moment that special-casing one specific hash function is a bit weird, would you actually care of this was done? If not, what alternative limit would make you happier?

ribasushi commented 1 year ago

As a frequent user of 0x00 I would not mind, and this is the limit (2k) I originally proposed to @masih when they were doing the indexer inclusion thing. My only OCD request is to limit the payload ( the "hash" part ) to 2k instead of the entire MH/CID: i.e. make sure that no encoded varint gets to count towards these 2048 bytes.

ianopolous commented 1 year ago

Yep no problem. We've long since migrated off identity hashes for anything but small public keys, and we hard limit them to 36 bytes there.