Size limit of identity hash

vmx commented 4 years ago

As I've been working on the Rust implementation of Multihash, it came up that the identity hash currently doesn't specify any limits. From an optimization perspective (this is why it came up in Rust), but also from a security perspective I think it would make sense to specify an upper bound for its size.

I personally would take a quite low limit which is similar to what current hash functions have as length. So perhaps something around 64 bytes?

ribasushi commented 4 years ago

64bytes is most definitely a non-starter, as it already won't fit the prefix/varinting of a 512bit hash.

More generally: any data that you know won't ever be repeated is a good candidate for inlining. Also an upper limit already exists: the limit of a network block itself ( 1MiB soft, 2MiB-1 hard ).

It would make me sad if we put an arbitrary small limit bound by the ability to print out a CID. Also note that other visible system parts, e.g. dirnames have no limits at present.

Last but not least: there are currently production systems that do inline ~2k or was it ~4k of data on the dht today ( /cc @ianopolous )

vmx commented 4 years ago

Also an upper limit already exists: the limit of a network block itself ( 1MiB soft, 2MiB-1 hard ).

That's only a current IPFS limitation. It's not a general limitation if you look at Multihash in isolation.

To me identity hashing makes sense to save space if the hashing output it bigger than the input. If you use it for something else, then to me it has the smell of some hack around some limitations that could/should be solved on a better way.

ianopolous commented 4 years ago

There are already limits in the go code. We've been using it to inline stuff in Peergos <= 4096+16 bytes (after discussing with core ipfs devs, but I can't find the issue right now @Stebalien, @whyrusleeping ? ). There are earlier discussions on this: https://github.com/ipfs/go-ipfs/issues/4918 https://github.com/multiformats/cid/issues/21

cbor-gen also seems to have just added a limit of 100 for this: https://github.com/whyrusleeping/cbor-gen/pull/24

ribasushi commented 4 years ago

@vmx 2MiB-1 is a libp2p limit - it permeates everything ( but you are correct: multihash in isolation is limited by uvarint alone )

It is generally correct that inlining is there to realize savings, but they come in more forms than just space:

absolute space saving ( the one you had in mind ): anything shorter than cid-prefix+hashbits is a win if you get to inline it
storage/index space saving: multiple-logical-blocks packaged as a single physical block by recursive inlining aids storage-page utilization/packing
transport space/time savings: a recursively-inlined dataset can be seen as a "mini-graphsync optimization": i.e. you ask me for node X => you automatically and invariably get its contained nodes Y Z Q and R, enjoy!
operation savings: if you have a construct with multiple nested known-random entries ( i.e. a set of public keys ), having each entry addressable by CID but not having to do an encode/decode pass is likely valuable in contexts I can't think of right this moment. Perhaps @ianopolous can go further into their 4096 + 16 use-case...?

TLDR: We definitely should define a formal limit. I definitely think this limit needs to be in the high kilo byte range, way beyond the proposed 64 bytes.

vmx commented 4 years ago

way beyond the proposed 64 bytes.

I care more about having an actual limit, than what the limit actually is.

ianopolous commented 4 years ago

I found the discussion I was referring to: https://github.com/ipfs/go-cid/pull/88

Stebalien commented 4 years ago

We generally assume that CIDs are small and cheap to handle. CIDs are designed to be identifiers and inline CIDs are designed to allow inlining objects that are smaller than the identifier. Keeping this assumption valid is extremely important for performance and usability.

Allowing for large (more than 100-128 byte) CIDs makes it very difficult to reason about sizes.
CIDs show up in paths. Ridiculously long garbage in paths is not user friendly.
CIDs now show up in subdomains. The current plan here is to re-hash large CIDs in subdomains to make them fit (resolve the object, then create a new CID with a smaller hash digest) but that will impact content routing.

Last time we discussed this, we set the default limit to 32 bytes to be safe and marked the relevant options in go-ipfs as experimental.

Really, it sounds like the need here is for a way to pass around blocks with type information, right? We may need to distinguish between the two cases.

ribasushi commented 4 years ago

The current plan here is to re-hash large CIDs in subdomains to make them fit (resolve the object, then create a new CID with a smaller hash digest) but that will impact content routing.

Afaik this part got entirely nixed, and the current implementation will not support over-long CIDs ( /cc @lidel )

mikeal commented 4 years ago

Where/how would we enforce the limit? We’ve had very similar threads about block size limits and ended up punting any hard limits to the network and storage layer, which in this case isn’t really an option since they rarely decode the block data.

Also, just as a matter of fact, many of our existing libraries don’t handle identity multihashes in CID’s correctly and will hand the Link/CID type off to the storage layer to get as if it were any other CID. Similarly, our Block interfaces do not all support data encoded to/from the identity multihash. The strategy for representing and considering “inline blocks” is still in the experimental phase when it’s handled at all.

This is a good time for this discussion given that inline block support is still under development, but I want to make sure that we’re setting the right expectations about where this might land.

Stebalien commented 4 years ago

Where/how would we enforce the limit?

When decoding/validating CIDs?

Also, just as a matter of fact, many of our existing libraries don’t handle identity multihashes in CID’s correctly and will hand the Link/CID type off to the storage layer to get as if it were any other CID. Similarly, our Block interfaces do not all support data encoded to/from the identity multihash. The strategy for representing and considering “inline blocks” is still in the experimental phase when it’s handled at all.

Actual support in block stores is just an optimization. Inline CIDs are still usable as normal CIDs.

vmx commented 4 years ago

We identified two cases:

Data is smaller than the hash would be

I'd call this one the original use case. There seems to be agreement that for this use case, there should be a really low limit, in the bytes, rather than KiB range.

Inlining blocks/blocks with type information

This is the Peergos use case. It's more of an optimization.

Now the question is. Could we come up with a solution for the Peergos use case, which Peergos could upgrade to, while limiting the identity hash to a small size as some libraries already do (and also I'd be in favour of).

ianopolous commented 4 years ago

It seems clear to me that there is a breaking change coming here. So, for what it's worth, I've come up with a backwards compatible solution and we're stopping using identity multihashes to inline small blocks in Peergos now. We can't migrate existing users (we don't have their keys), but I've realised that those that are on our servers are actually ok now because we are using the S3 datastore and managing pins ourselves (indeed if they're logging in through our server they're not even using ipfs at all). Any new users, new data written, or old data modified will not use identity multihashes. I think we can write a gradual migration that runs when they log in. It's painful, but the risk is too high for us.

The largest identity multihash we use now is 36 bytes, for public keys (which includes the type of key and key material all encoded as cbor). And we now enforce this as a hard limit for new data. In the future we'd like to use identity multihashes for CSIDH public keys (64 bytes + multikey header), but that is a future discussion.

mikeal commented 4 years ago

multihash is a general purpose standard with general purpose libraries. Given how young the project is, we should assume that the use cases we currently understand are not a complete set and we need to make sure we stay open and accessible to being used for things in the future that we haven’t even thought of yet.

With that in mind, I don’t think that we should:

Create and enforce limits based on our own opinions about what is “good practice.”
Adopt and enforce limits on behalf of specific use cases.

If there’s a universal reason to have a limit on the size of a multihash that we’re confident is always going to be true, then we should adopt it, but that’s not at all what I’m seeing.

If, as we already know, IPFS wants to use CID’s for subdomains and therefor needs to enforce a size limit on CID’s which in effect limits the size of a multihash, that’s fine, but that’s IPFS’s decision to make and their limit to enforce. That doesn’t belong in the core of multihash because it’s not universally representative of all the use cases someone might build on multihash.

vmx commented 4 years ago

With that in mind, I don’t think that we should:

Create and enforce limits based on our own opinions about what is “good practice.”

I think we should exactly do that. This will prevent systems to randomly blow up when some implementation impose arbitrary limitations. The nice thing about having a limit in multihash is, that you can highly optimize it. You would always know the upper bound of the supported hashes, hence e.g. do things with stack allocations only. This is not possible if you want to support the Identity Hash with maximum compatibility, which would mean 8EiB.

mikeal commented 4 years ago

The nice thing about having a limit in multihash is, that you can highly optimize it. You would always know the upper bound of the supported hashes, hence e.g. do things with stack allocations only. This is not possible if you want to support the Identity Hash with maximum compatibility, which would mean 8EiB.

Then don’t “support the Identity Hash with maximum compatibility” ;)

Users and implementations are free to make domain specific decisions about these limits, the right decision for one user will not be the same for another. It’s not the job of the underlying primitive to make these decisions on your behalf because we don’t know what each user’s requirements are.

Look at the block limit, to my knowledge only one transport has a real block limit and yet pretty much every user imposes block limits at half the current transport limit because we called it out as a good practice. It’s not a hard requirement in the spec and it’s not enforced by our codec libraries, but it’s a functioning limit everywhere that it matters.

I’m not saying we shouldn’t define good practices, and even document what we think is a reasonable target limit for multihash, but we shouldn’t impose that limit in these libraries at that layer or call it out in the specification as a hard requirement.

If @ianopolous wants to have big multihashes in his CID’s, he shouldn’t have an issue at the multihash layer, even if he will have issues at the IPFS layer. In the same way that I can create 5MB blocks for a one-off use case knowing that if it ever needs to be used in Bitswap it’s going to break.

mikeal commented 4 years ago

Now the question is. Could we come up with a solution for the Peergos use case, which Peergos could upgrade to, while limiting the identity hash to a small size as some libraries already do (and also I'd be in favour of).

I struggle to see how these inline use cases would exist in dag-cbor and newer codecs. From what I can tell, this looks like a workaround for some limits in dag-pb or perhaps in unixfsv1 (I haven’t gone deep enough to know for sure).

You can “inline” node data into the block using any codec that supports the full IPLD data model without hacking it into the CID. I can’t see the utility here other than “we forgot to make this part of the data structure a union” and it seems like the right thing to do there would be to fix the data structure to support that because it’s a lot more complex to deal with data that has been inlined into the multihash. I understand that in the case of dag-pb we may not be able to change the data structures, but that doesn’t mean we should port this practice over to users that have access to the complete IPLD Data Model.

It’s not that inlining data isn’t a common and necessary feature, it is, that’s why we fully considered it in dag-cbor and in the IPLD Data Model and have a compelling feature set. If this pattern is common enough we could even consider adding syntax to IPLD Schemas to make kinded unions on links easier, similar to how we have syntactic affordances for making links in general easier.

But, across a lot of our code, data inlined into the multihash throws a wrench in our layer model and is difficult to support across different Block, Link, and storage interfaces. Most code thinks of a CID as a key and its data living somewhere that it can retrieve by that key. If you put the data in the key, the representational pairing of [ key, value ] is lost and there’s not a very clean way to maintain the interfaces without pushing this to users (which is what happens currently, if you put data in the multihash you’re going to be pulling it out and working with it very manually).

ribasushi commented 4 years ago

If you put the data in the key, the representational pairing of [ key, value ] is lost

why do you say that? Nothing in e.g. ipld-prime land would change:

you encounter a link with CID \x01\x71\x00\x2A{{ 42 bytes of cbor }}
you go to the ~blockstore~ link loader ask for that
the ~blockstore~ link loader turns around instantly and gives you back {{ 42 bytes of cbor }}
profit

ianopolous commented 4 years ago

I struggle to see how these inline use cases would exist in dag-cbor and newer codecs. From what I can tell, this looks like a workaround for some limits in dag-pb or perhaps in unixfsv1 (I haven’t gone deep enough to know for sure).

Everything we do is dag-cbor - this has nothing to do with protobuf or unixfs.

It is much much more elegant to do it our way (although elegance wasn't our motivation, speed was) because then the same object class always maps to the same ipld structure. The way I've worked around it is to now have two distinct cbor ipld encodings for the exact same type (class) of object, and handle both types explicitly in the deserialization. So the same type of object now has two totally different ipld strutures.

it’s a lot more complex to deal with data that has been inlined into the multihash.

It was trivial to support this for us (4 lines of code globally). There's nothing "manual" to it.

mikeal commented 4 years ago

link loader

That’s specific to Go, where the link loader is an abstraction between the storage layer and the decoded node layer. We don’t have that in every language, often the node layer just talks directly to the storage layer, which means every storage API needs to handle this or every line that asks for data by CID needs to handle this.

Also, @warpfork will need to weigh in, but I recall him mentioning that there are plenty of things in go-ipld-prime that won’t work well when inlining data this way.

It is much much more elegant

The solution we spent considerable time working through to this problem is unions (mostly kinded unions using Link as a kind). It’s a core feature of IPLD Schemas and translates nicely into every programming language and all the abstractions we’ve built.

It’s problematic to have multiple approaches to inlining data and a kinded union provides a much cleaner approach that keeps the type differences clear to everyone. It sounds like you actually want to blur the line a bit on the type differences so I can see how that approach with be more attractive, but as we build out generic libraries it’s rather difficult to have a single type mean very different things.

That said, we’re not going to break or disallow anything that is valid CID/multihash, we just may not have a very nice interface for you to use when you inline data this way, which you probably don’t care about since you have your own libraries ;) And as I’ve already stated, I’m rather opposed to setting a hard limit on multihash size in the specs or core implementations. Some libraries and consumers may set limits you’ll have to contend with and I suspect languages or libraries that want to optimize memory allocations will set a configurable limit, none of which are an issue if you were to take the kinded union approach instead.

ribasushi commented 4 years ago

It’s problematic to have multiple approaches to inlining data and a kinded union ...

I think this is where the disconnect is. Identity CIDs operate on a layer below where a "kinded union" would exist, they are strictly in "codec-land".

To put it differently: from link-traversal perspective there is no practical difference between:

a 4096-byte identity CID
a 32768-bit output from keccak's shake functions

We currently support both. The proposal is to limit only one of them, on account of one of them being special. How do these 2 examples differ?

mikeal commented 4 years ago

I think this is where the disconnect is. Identity CIDs operate on a layer below where a "kinded union" would exist, they are strictly in "codec-land".

Once you recognize that links in a node graph are transparently traversed, that the link is resolved to a node that replaces the link representation to become the node representation for that property in the parent node, they are functionally equivalent. Both put the node data in the same place and stored in the same block.

Conceptually, this never exists in the decoded node graph:

ParentNode -> Link -> ChildNode

Instead, this is what happens:

// before resolution
ParentNode -> Link
// after resolution
ParentNode -> ChildNode

You can observe this in our pathing, where named properties that are links get resolved to their decoded node value. There’s actually no way to return a link from a fully resolved path.

let value = 1234
let link = Link( value )
{ property: link }

If you resolve the path /property of this block you’ll get 1234, there’s no pathable reference to the link itself.

vmx commented 4 years ago

Some libraries and consumers may set limits you’ll have to contend with and I suspect languages or libraries that want to optimize memory allocations will set a configurable limit.

That's exactly the issue, when we don't specify a limit. Your application would work on one implementation, but not by default on some other. Finding out that one of your identity hashes is to long, sounds like a very hard to find bug if you are not aware that there might be a limit.

The option might be configurable, but it might not even be configurable at runtime, but at compile time only.

Having a limit makes building systems that work everywhere easier, not having a limit makes it harder with little benefit.

Storing data in Multihash to me sounds wrong, that is not what Multihash means to me. Though it's kind of a neat hack, so if we want to support large data (still with a limit, but a high one) stored within a Multihash, I suggest we introduce a new codec for that. This way codecs can still support the "the hash of my data is bigger than the data itself" use case, while not being forced to also support "i store data in my multihash" use case.

This way also the error reporting will be clear, as you are now see that your codec is not supported instead of having it fail in weird ways.

warpfork commented 4 years ago

I think a lot of the important things have already been said here, but I've been called out, so I guess I feel I ought to weigh in on the record, heh. I'll mostly just re-highlight things that have already been said that I agree with and want to boost, though:

It is true that we have typically eschewed hard limits in IPLD, while acknowledging that they exist in systems we're likely to be used together with (such as block size limits from IPFS / libp2p), and advising that people probably want to stay within those limits unless they're confident their use-case means they aren't worried about this. I think we have no regrets on this approach so far!
It is simultaneously true that we may want to note a size boundary recommendation, per the same logic as the previous point.
I'd probably vote for having length limits be a parameter of which multihash we're talking about.
- FWIW: I've actually always been a little surprised we have length parameters on all of the multihashes to begin with. I see why it's a parameter for e.g. blake2 variants; it was originally quite a shock to me the first time I saw it was also present for, say, plain ol' sha256.
I strongly agree with @Stebalien 's observations: "We generally assume that CIDs are small and cheap to handle. Keeping this assumption valid is extremely important for performance and usability. Allowing for large (more than 100-128 byte) CIDs makes it very difficult to reason about sizes." Yep; yep; and yep. I am not a fan of large values appearing in CIDs.
Inline CIDs... okay. Let's break this down into a few sub-bullet points:
- I don't think inline CIDs were a good idea, and I don't think we'll be encouraging anyone to use them in a year's time. (I don't think I'd encourage anyone to use them now, in fact; and I might place a bet on the table that we'll be actively and overtly regretting them in a year's time.) The following points are why:
- We have a mechanism for talking about a choice between "inline" data and links now! In the IPLD Data Model, you can do this freely; it's so innately up-to-you that it's barely possible to talk about a world in which it's... not. (This is what @mikeal is getting at in some of his comments. Inline CIDs might've made sense in some applications based around dag-pb, I think, maybe? But that world is... something we should regard as fading into the past very quickly: the IPLD Data Model is much more expressive than dag-pb.)
- We have a well documented and explicitly tool-supported way to choose between inline and linked data, as well: IPLD Schemas have copious support for unions, and various forms of indicating them: including some which simply differentiate based on whether the serial data is a link or some other Data Model kind. (These are the same concepts as you can use without the Schema layer; I highlight it in terms of the Schema system only to emphasize how very, very supported it is.)
- These mechanisms for choosing whether you're using links or "inlining" the data (by just... not putting it into a "link" format at all) work without causing additional serialization calls just to put the data into bytes for an "inline CID", which is... superior.
- These mechanisms for choosing whether you're using links or "inlining" the data are exposed very clearly at the library layer: you just... manipulate your data using the standard Data Model semantics, which resemble an AST. These continue to work and compose even when you're doing deep traversals, etc.
- Contrariwise: we do not have particularly good and clear ways to do high level operations on large graphs of data, whilst constantly deciding which links are going to be encoded as inline or not. It's possible... but I wouldn't say it's smooth, or a corner that we get a better developer experience by regularly asking people to shine a light into that crevasse.
- Given all the other caveats about inline CIDs discussed above (e.g., big sizes cause various major usability questionmarks)... really, there's just... I struggle to think of much good at all that comes from inline CIDs. We have clear alternatives; I can't imagine we won't grow to prefer them.
- n.b. I don't want this screed to be mistaken for a claim we should drop support for inline CIDs either; that's a stronger case than I'm willing to make. But I think it's also important to make remarks on how much energy we want to put behind supporting them nicely. IMO: a very limited amount; and I definitely don't think we should let them have predominating influence on any other design choices we wrangle with.
"There’s actually no way to return a link from a fully resolved path." -> okay, this is true, but I do actually regard that as probably a design bug, fwiw :) I'd like to design our way out of this. The blind spot has hurt us surprisingly little so far, but it does bother me.
- The rest of the arguments made in that comment still stand though! Even though I want to design our way out of this blind spot, my current suspicion is that we'll make "advanced pathing" variants that allow introspecting links, but they'll not be the default path we lead users down in most cases.

@ianopolous -- I'd be happy to talk to you more synchronously about this if you'd like, but in essence,

The way I've worked around it is to now have two distinct cbor ipld encodings for the exact same type (class) of object, and handle both types explicitly in the deserialization. So the same type of object now has two totally different ipld strutures.

... that sounds less like a "workaround" and more like "exactly the right thing" to me :)

Maybe there's some different way to organize the type definitions in your language of choice that would make it more natural? Dunno; I'd be willing to look at it with you though, if you'd like. I'll also say: goodness knows golang hasn't been making it exactly easy on me to represent unions either! But it's been a logically sound path to pursue, even when my host programming language hasn't made it frictionless. So far, every time push has come to shove, I've been very happy with the outcomes stemming from our pursuit of unions.

warpfork commented 4 years ago

(Perhaps that comment would be easier to read if we introduced some term other than "inline" for when we put data in the same block rather than a separate block -- I used it describe that general practice in the same comment as discussing "inline CIDs", and that's probably confusing. Forgive me, reader. A better phrasing has not, at this moment, yet occurred to me.)

warpfork commented 4 years ago

To follow up a little more concretely on that allusion to "we have a way to talk about links vs ~inline~ embeded data now" --

This is a schema snippet that we could use to describe this common scenario:

type ThingSomewhere struct {
    foo String
    bar Bytes
}

type ThingHereOrNot union {
    | ThingSomewhere map
    | &ThingSomewhere link
} representation kinded

type EnclosingFwoop struct {
    couldBeEmbededOrBeLink ThingHereOrNot
    otherData String
}

A block containing one object matching the EnclosingFwoop type could have either zero or one links in it, and is still described by this schema in either case, and its transitive graph (if it does contain a link -- or just the block itself, if it doesn't) contains one ThingSomewhere.

At no point were inline CIDs ~harmed~ used in the description of this data; and yet, we have choice over whether or not the data is split into two blocks or not.

I find that this is a very straightforward way to describe this situation: and it works fine with arbitrarily complex structures of data; it works purely in ways that are easy to describe in terms of the Data Model (e.g., without needing to discuss using different variants of CIDs); and because of this simplicity, I think this is generally the sort of approach I would recommend essentially all new code to take in preference to inline CIDs, if at all possible.

I've used the schema syntax here only to clarify and describe. It is not necessary to use schemas to do this; one can simply construct data and walk over the data model according to this convention.

I don't actually know what reasons there would be to prefer using inline CIDs over using this much simpler model of "embed if you want". If there are some very specific situations that really require using a CID for ${external opaque reason}, maybe that's something we should document in a short list of known situations? My suspicion is that list is going to be very short and have the general feeling of being enumerating "exception rather than the rule".

ianopolous commented 4 years ago

I can explain our case in detail if it helps. Note that we have a work around/"exactly the right thing to do" as mentioned above.

One of our fundamental objects has a field which is ciphertext, so a structureless byte[], which is limited to 5 MiB. This thing can be either a directory or a file, and indeed we explicitly hide this from the ipld layer (you have to decrypt to decide which). We represent this as a class with a field which is a list of cids. Whenever we want the actual ciphertext we just pass the list of cids to the datastore and get back the results (identity hash or not). However, as I mentioned above, this thing can represent a directory, and this means that when we are traversing a path we must retrieve many of these objects, decrypt the ciphertext and recurse. The critical point in our model is that we aren't assuming the datastore is local, normally it is on a remote server. This means that inlining the directory (or small file) data is critical to speed for us because it results in many fewer network round trips/ DHT retrievals.

I'm fine with not using identity mulithashes here, as we're about to migrate towards, but it does result in more complicated code, and I wish there had been a limit or indication that these were meant to be < 100 bytes when it was released in go-ipfs.

On the general topic of having a limit. I am very strongly pro having a well defined limit. The reason being that without such a limit we have a system with a primitive that is basically undefined. If some system imposes limits then the end result is a fragmented ecosystem of applications and language implementations that can't talk to each other.

ribasushi commented 4 years ago

@ianopolous so to make sure I understood you right: basically your use case is point 3 here: https://github.com/multiformats/multihash/issues/130#issuecomment-656009845 ? Your "bonus" of using inlined CIDs is the ability to not care for a predefined "schema" of the structure, so that "everything is a block" to your unwrappers/decryptors?

Or in other words: you use inline CIDs to logically separate the "transport/decrypt/low-level-decode" codepath from the "semantic high-level decode" one?

ianopolous commented 4 years ago

@ianopolous so to make sure I understood you right: basically your use case is point 3 here: #130 (comment) ? Your "bonus" of using inlined CIDs is the ability to not care for a predefined "schema" of the structure, so that "everything is a block" to your unwrappers/decryptors?

Or in other words: you use inline CIDs to logically separate the "transport/decrypt/low-level-decode" codepath from the "semantic high-level decode" one?

@ribasushi Yep that's right.

I should also add that the reason it is 4096 + 16 is that we pad all plaintext to a multiple of 4096 before encryption to protect the metadata around size. (And the 16 is the encryption overhead)

mikeal commented 4 years ago

Your application would work on one implementation, but not by default on some other.

This is true of pretty much every protocol. TCP, UDP, HTTP, etc, there’s no size limit in the protocol specification for the total data transferred, but every service provider and implementation sets one. These limits don’t seem to negate the benefits of agreeing on common protocols and clients learn to live within reasonable limits the ecosystem of service providers have set.

As an example: I have a script that does GraphQL queries to GitHub’s service and if the request takes too long the gateway kills the connection even though the query was well within the rate limit GraphQL and even their HTTP service set. Service limitations are application and provider specific, and they are applied to all the protocols you touch, we can’t enforce them for everyone or even hypothesize about what all the use cases are. For a lot of people, setting a block size limit of 1mb solves any concerns they might have about large CID’s as a side effect. For others, maybe not.

I’m very interested in recommending a reasonable size limit and would expect many consumers to adopt it (similar to how we handled block size limits) but to set a hard limit in the standard is too much for me.

Stebalien commented 4 years ago

The key difference here is that you're talking about "interactive" situations where limitations can be worked around after the fact:

If your program guesses the MTU wrong, it can try again with a smaller MTU.
If your program creates large GitHub queries, you can fix it to create smaller queries. Same for long running queries. Furthermore, you can readily test because there's a single centralized service.
If you're issuing too many queries and hit a limit, you can backoff and try again later.

If your application creates large blocks/CIDs that some applications support, it'll appear to work until someone tries to use it where such large blocks/CIDs are not supported. At that point, your data is literally incompatible with this other program so you have to either:

Change the data (changing all the hashes).
Change the program (changing important perf assumptions in the program).

Worse, many programs will support arbitrarily large CIDs/blocks, some user will start creating these arbitrarily large CIDs/blocks, then you'll try to forbid them for some performance/security reason, and you'll break end users.

MTUs usually don't have this problem because most systems assume a min MTU of around ~1KiB.
GitHub doesn't have this problem, because it can start warning devs about long-running queries, etc. before banning them.

On the other hand, we can't just go ahead and say "rewrite all your data and change all your CIDs". That's the real downside to content addressed data.

This is exactly the case we're running into with peergos right now and it's exactly why we need these clear limits. I guess we don't necessarily need true "maximums", but we:

Need a clear "if it's larger than X, it probably won't work in all cases".
Some way for users to avoid accidentally going over X. Unfortunately, the usual way to do this is to enforce at least a soft maximum.

aschmahmann commented 4 years ago

but to set a hard limit in the standard is too much for me.

I disagree, but beyond that can we agree that any implementation of the identity multihash will have to define a size limit?

If so then IMO the natural logic goes like this:

We have a specified thing called "multiformats: identity multihash" which has many things specified, but no size limit
For every application we have a specified thing "{appname} : multiformats: identity multihash" which has a specified size limit in order to make the application work
The multiformats project is effectively about future proofing (see all our website, github repo, etc.), and its main technique to do so is by using self describing values (i.e. the website says "Self-describing values for Future-proofing")
Having every application make their own spec without using their own codec effectively breaks this rule since it's no longer self describing. "IPFS : multiformats : identity multihash" looks exactly the same as " EmbedEverythingApp : multiformats : identity multihash"
Even more so this breaks future proofing by effectively preventing two applications from interacting with each other since their namespaces collide

Conclusion

Codecs must be specced out such that developers cannot accidentally create two incompatible versions of the same codec. This includes maximum size.

I'm fine taking @vmx's solution https://github.com/multiformats/multihash/issues/130#issuecomment-657433263 and allowing (if anyone ever wants this) the introducing of new codecs with as big sizes as your heart desires. You want a size 10GiB identity multihash go for it, but please use the identity-1EiB codec. As an added bonus since these multihashes will be so large we can feel free to allocate them slots in the codec table that are very long (i.e. what's a 1KiB prefix to a 10 GiB file).

vmx commented 4 years ago

We had a 25min discussion in the IPLD weekly call (watch the recording from 23:06 to 48:42).

Here are the things we agreed on (please correct me if I got those wrong):

Large CIDs, hence large Multihashes are a bad idea. You should keep them small.
Multihash size limits, should not only include the identity hash, but also other hash functions that can have an unlimited output size, like Blake3.
We add a documentation which is similar to the IPLD Library Design Guidance. Where we can talk about size recommendations. These recommendations can also include that library authors should make sure they document those limits well.
- A recommendation could be (but that's open for discussion when we add those), that implementers could restrict the size of all the hash functions (including identity hash) to the biggest fixed size hash they support.

One idea I had after the meeting. Could we put directly into the Multihash spec a note/warning that implementations may have size limits?

warpfork commented 4 years ago

+1. I think we all agreed that we should have something written somewhere. The only variations in opinion are about how precisely how sternly it should be phrased.

I think putting recommendations in Library Design Guidance, and also in the Multihash specs, both sound like a good idea.

aschmahmann commented 4 years ago

TLDR: Can we just allocate more codecs? It won't even be that many, and it will make data portability much more sane.

@vmx I agree that Large Multihashes are likely a bad idea. However, I think it's a big problem to have something that is effectively a spec (i.e. the IPFS blake3 multihash is at most 1KiB) but not have it expressed explicitly via the codec.

If you look at the multicodec table now we have over 100 codecs allocated to each of blake2 and 100s allocated to skein. Why not make life simpler by just adding more codecs? For example:

blake3-min256-max1k blake3-min1k-max4k blake3-min4k-max16k identity-min0-max512 ... identity-min-4k-max16k

Doing this has the advantage that it's trivial to figure out where the compatibility errors are between given applications since it's as simple as "oh I see, you don't support the X multihash, would you support it or do I need to write some conversion software that switches the multihash?".

As far as I can tell the only real downside of allocating many multicodecs is that they take up space in the table. However, there's this neat feature where the larger the multihash is the larger the prefix we can give them and have it be irrelevant. This means that by allocating codecs in exponentially increasing sizes that every codec beyond the first one or two may as well be free and that we end up with maybe 4 size codecs per hash function instead of 100s like we have now.

vmx commented 4 years ago

TLDR: Can we just allocate more codecs?

I think we could, but I don't see much value in it, but rather a theoretical exercise. Identity hashes shouldn't really be used and I can''t recall having seen hashed > 512 Bit in the wild yet.

It's good that we could add more codecs in the future, but I don't see a reason to do that now.

aschmahmann commented 4 years ago

Identity hashes shouldn't really be used and I can''t recall having seen hashed > 512 Bit in the wild yet.

So then we shouldn't have any problem today embedding in the specs for blake3 and the identity multihash that their maximum sizes are 512 bits. If someone comes along in the future and wants a bigger one we can just allocate a new codec for them.

I don't see a reason to do that now.

There's no reason to allocate the higher sized codecs since we don't need them. However, there is a reason to now mark the unbounded (or VarInt size bounded) mutlihashes as actually having small maximum sizes (e.g. 256 or 512 bits).

Stebalien commented 4 years ago

Multihash size limits, should not only include the identity hash, but also other hash functions that can have an unlimited output size, like Blake3.

I agree we should have some warnings about size limits, but I think the multihash size limits and the CID/IPLD size limits are going to look different. Multihash is just a general-purpose way to format self describing hashes. Ideally, we'd be able to turn any hash, no matter how big (within reason), into a self describing hash.

CIDs (with respect to IPLD, at least) are supposed to be reasonably short content identifiers. There I'd expect the size recommendations to be smaller.

mikeal commented 4 years ago

However, I think it's a big problem to have something that is effectively a spec (i.e. the IPFS blake3 multihash is at most 1KiB) but not have it expressed explicitly via the codec.

A surprising number of hash functions are of variable size so it’s not really practical to do this.

Something @ribasushi brought up a few months back is that we need to put a size limit in our code for pretty much every hash function. Since these recommendations vary per hash function, we may want to capture this in another table, or expand the multicodec table, because the expected/recommended max size for most hashing functions doesn’t vary by programming language.

We haven’t implemented this yet, but we intend to.

aschmahmann commented 4 years ago

A surprising number of hash functions are of variable size so it’s not really practical to do this.

What part is impractical about doing this? We may have a few (i.e. 1-4) hash functions, but the larger their sizes are the more prefix bits we can give them. Our key space is effectively unlimited if we're not focusing on giving out prefixes that are 1-2 bytes.

We gave away 100s of spots to other multihashes in the past many of which are probably unused (looking at you blake2s-72). What's a giving out a few per hash function where the prefix length assigned increases with minimum multihash size?

we need to put a size limit in our code for pretty much every hash function

If that needs to exist in code and we want code from different projects to be compatible then the code requires a spec

Since these recommendations vary per hash function ... expected/recommended max size for most hashing functions doesn’t vary by programming language

Then let's just assign the recommendations to a codec and we're done.

expand the multicodec table

How is this different then my proposal?

mikeal commented 4 years ago

Every table entry has a cost as it represents another barrier to compatibility between implementations, as each implementation must be configured with a hashing function that matches the identifier. If a hashing function has a variable length we can maximize the compatibility between implementations by leaving an affordance for the variability in length associated with only that one table entry and hashing function, making far more implementations compatible with it.

We cannot expect every multiformats implementation to have support for every hash function. We also cannot arbitrarily limit the number of new hash function entries in order to try and reduce proliferation because people who are using obscure hash functions have grounded reasons to do so and will likely sacrifice multihash compatibility if necessary. The only thing we can do is allow for optionality where we can, which is in the length. As far as I can tell that’s the only place we can, so we should.

mikeal commented 4 years ago

How is this different then my proposal?

I’m recommending that we add a column to the table to capture recommended max sizes. What I think you’re recommending is adding new entires for each hash+length.

aschmahmann commented 4 years ago

If a hashing function has a variable length we can maximize the compatibility between implementations by leaving an affordance for the variability in length associated with only that one table entry and hashing function, making far more implementations compatible with it.

How?

If you're talking about library compatibility, such as between go-multihash and js-multihash then what's the big deal? Either way we're going to end up with code like this to take into account the fact that we already allocate 100 blake hashes. Having libraries internally have some function like addCodecWithSizes(codec, []size) is IMO not a big deal, especially since we effectively require it already due to blake2 and skein.

If you're talking about application layer compatibility then IMO having a specified length instead of an apprehensive documentation note is much better. Without it the probability that two multihash based applications end up incompatible with subtle bugs seems quite high.

There's another benefit to using the multicodec table. When a new user shows up and says "I see you support the identity multihash up until 512 bits, but I'd like 10k please give me a codec" we can say "sure we're happy to assign a codec, but there's a reason nobody's using it so far do you want to discuss your use case".

ribasushi commented 4 years ago

@mikeal @vmx I think what is @aschmahmann proposed is actually quite practical ( even though I dislike it for a reason detailed below ), if we attach sizing to multiformats table position. Just like unicode has predefined "planes", we could ( without breaking too much ) define the following:

Any multihash id that has a length of 4 bytes or less when encoded as a varint, implies that the "hash/payload size" is <= 127 ( handwave )
Any multihash id that has a length of 5 bytes when encoded as a varint implies that the "hash/payload size" is <= 65535
... continue series until 9 bytes and a gigabyte of payload or whatever

I really dislike this because we are limiting stuff that was previously open, but if this will make all sides moderately happy: I think it's a compromise worth considering.

What I take issue with however is:

we can say "sure we're happy to assign a codec, but there's a reason nobody's using it so far do you want to discuss your use case".

This implies too strong "table ownership" dynamics. We are stewards at best. If a user wants to take the cost of a large CID, ad the penalty of a large multihash code: they should get one, no questions asked (asking as curiosity: sure, but not in "justify yourself" terms ).

Unless the conversation is shifting from "we want to give implementers reasonable assumptions to work with" to the more authoritarian "I think doing X is bad, and I don't want to see anyone do it using my table". If we are having the latter conversation: we already lost our way.

aschmahmann commented 4 years ago

Any multihash id that has a length of X bytes ... implies that the "hash/payload size" is <= Y

I don't know if we really need to define that as anything more strongly then a general policy (as opposed to a rule we could encode in a lang-multihash library). However, the general principle of leveraging the size of the multicodec table to allow specifying as many things as we want is indeed what I was getting at.

This implies too strong "table ownership" dynamics.

Fair enough. I agree we should assign codecs if people ask for them without needing a "reason" (as long as they're not doing things like asking for 1000 low bit codecs).

However, as people who are involved in the project it may serve as a useful heads up to let a project know that while they can happily use a new codec what some of the tradeoffs involved might be (especially if it's as simple a change as increasing the hash length).

vmx commented 4 years ago

I’m recommending that we add a column to the table to capture recommended max sizes.

@mikeal I'm against this. This would lead to discussions like this one. Which recommended size would you pick for the identify hash. It might also make implementations more difficult. I would then try to meet those recommendation. If we just say there are limits, I can keep it simple and e.g. limit all my hashes to a single maximum value. That's easy to explain and easy to reason about.

mikeal commented 4 years ago

@vmx we do not want to impose a cap for SHA2-256 of anything other than 256, and 512 for SHA2-512, and so on. Some hash functions have a known fixed length that should be imposed and some are variable, that’s why this needs to be another column, because it’s a bigger problem than just this generic max size.

whyrusleeping commented 4 years ago

Wow, long thread. I just want to add that the 100 byte limit I imposed on ScanLinks in cbor-gen (here) is fairly arbitrary, and aimed at protecting certain performance critical code from overallocations (note: the const is only used inside of ScanLinks, the rest of the codebase imposes a limit of 512 on cids, another fairly arbitrary limit to help protect against DoS).

I'm happy to change these limits (or make them configurable per use case) as people see fit.

vmx commented 4 years ago

some are variable

Yes I know. Though I guess it should follow the same logic as the identity hash, that we cannot really make good judgements about those limits. In case we want per hash recommendations (I don't), I wouldn't add another column (as it also doesn't apply to non Multihash Multicodecs), but just add it as a comments.

ribasushi commented 3 years ago

Re-pinging all participants, especially with the newly attained understanding of where identity CIDs really shine. What should we do here?

vmx commented 3 years ago

@ribasushi Can you please elaborate on where identity CIDs really shine.

Side-note: I'm still not sure how far the concept of CIDs should be stretched. I like the simplicity of having an identifier for content-addressed things. To me "identity CIDs (I'd call them "inline CIDs") is a different concept. It makes use of multiformats (which is great), but are they really CIDs anymore? In order to make that work, we not even need to stretch the concept of CIDs (by not addressing the content, but inlining it), but also the one of Multihashes. They are (according to the current spec) about "cryptographic hash functions", which the identify function certainly isn't (e.g. it misses the pre-image resistance). So perhaps there should be a new multicodec called "IID", an "Inline Identifier", which is just: <inline-identifier-multicodec><ipld-codec><length><data>. And perhaps there is then a "UID", the "universal identifier", which understands the multicodecs CIDv0, CIDv1 and IIDv1.

multiformats / multihash