multiformats / multicodec

Compact self-describing codecs. Save space by using predefined multicodec tables.
MIT License
340 stars 204 forks source link

Qualifications for identification as a "codec" #204

Open rvagg opened 3 years ago

rvagg commented 3 years ago

This question comes up very regularly, essentially what are the qualifications for being included in this table, but I'm particularly focusing on the "codec"s in here, IPLD and related.

Some recent discussions that point toward challenges in definitions:

How about the fact that all of the codecs we present could also be interpreted as raw? I liked @Ericson2314's thoughts related to this kind of question: https://github.com/ipld/specs/pull/349#issuecomment-763867809

Some excellent and clear thinking from @aschmahmann about nominative typing using codecs: https://github.com/ipld/specs/pull/349#discussion_r559963156

There clearly exists a grey area here, and while we should avoid strong gatekeeping of the table where a contributor has greater expertise in their particular system than us, there's an educational role to play too because many people show up with requests that clearly don't fit the purpose of this table and the definitions of "codec" that we broadly share. It's valid to say "this is an incorrect use of multicodec / CID" where it clearly is. But what we need is better shared understanding of those "clear" boundaries.

Thoughts please!

(/cc @warpfork who isn't in the Assignees list)

vmx commented 3 years ago

Thanks @rvagg for digging into previous PRs and showcase the discussions we had there.

For me what those cases have in common is "I want to distinguish things without looking at the data itself or any context". Some are possible to be identified by the data itself without any further context (e.g. SoftWare Heritage), some can only be identified by context (e.g. Bitcoin).

Merkle Forest

My vision of the "Merkle Forest" is that data (or pieces of data) is stored "somewhere" (content-addressing) and it might be even possible to use it outside its original context. SoftWare Heritage is a really good example that implements that vision. They are using Git Objects for their archiving purpose. They can be used outside of the context of Git. But you could still import them into Git. If git-raw is used as the common codec, you can leverage that, without any special software to be written.

If you would use a custom codec for SoftWare Heritage, you could of course add this as an alias for git-raw to your application. But you would need to do that manually and have gained that knowledge from somewhere. Having the CID encoded that these things are the same is beautiful, as then such interoperability can emerge without any coordination.

This "coordinationless re-use" is key for me. If every application/user starts having their own identifier you would end up with silos. Those identifiers won't have a meaning outside of those systems, unless you manually intervene and give them a meaning to your own system. If me manage to to have codecs that have a more universal meaning, like just containing the information on how to get some structure out (more than just bytes), we might end up with systems that automagically interoperates in perhaps even unexpected ways.

I'm well aware that this is a vision and things might not even get there, but I'd like to keep that door open and don't want to help building data silos. Please note that I'm sure SoftWare Heritage doesn't want to build a data silo (quite the opposite), but it might become one if we take the wrong choices.

Meaning of the CID

I can see the use case of being able to identify the data that was produced/should flow into/from your own application/system. I think SoftWare Heritage and Likecoin are similar in this regard. Especially if you think about the nature of the distributed networks and making sure the content is available (e.g. pinned) or if you want to grab it from data that flows through your peers. Being able to identify your own data by just the CID makes things easier.

But I'm not sure we even want to get there/support that. I'd prefer if data would just be spread across the peer-to-peer network and is cached by popularity and can't even be identified by the CID which system it was produced with (in a similar vein to net neutrality). For me the CID should be about the content and not about where it originated from/who produced it.

Those use-cases about identifying the data coming from a certain system certainly exist. Perhaps there should be something people can use, instead of building their own systems or not using CIDs at all. I don't know what this would look like. it might be a CIDv2, or perhaps something completely different and not even a Multicodec code. I only know that for me that is not part of CIDv1.

Ericson2314 commented 3 years ago

@vmx I would say the fact we have so many different codecs is because the first step is getting everyone to interoperate at all. Yes, it does seem like needless balkanization, but not if the alternative is simply no interopt at all. That's strictly worse. IPLD is currently being the polite "big tent" format that is currently willing to make the opening gesture absorbing the sometimes-redundant complexity to facilitate collaboration. I think that's a commendable sacrifice.

Once everyone is participating in the same content-based internetwork, I think the economics alone can be relied upon to consolidate around future formats. That's how JSON and friends has defeated the line-oriented configs of old, after all.

Stebalien commented 3 years ago

I agree with everything @vmx says. In terms of multicodecs, there's no issue assigning as many as necessary. However, when it comes to IPLD codecs, creating new codecs for the same underlying formats harms interoperability.

On the other hand, people are looking for a way to distinguish between different higher-level systems. We handle this in IPFS by using path namespaces. Swarm handled this by (last time I checked), concatenating a swarm namespace codec with the actual CID: <swarm-namespace><cidv1><codec>.... Honestly, I think this may be the way to go in many of these cases:

  1. Within IPLD, use CIDs.
  2. Outside of IPLD, use namespaced paths, or namespaced CIDs if you need something shorter. E.g., an ENS record might refer to <ipfs-codec><cidv1><...> and/or /ipfs/CIDv1/...

I'm now going to, again, plug https://github.com/multiformats/multiformats/pull/55, because it basically says "these are all multipaths".


Case by case:

Likecoin uses dag-cbor but wants to do content-routing with a new codec #200

I haven't read the full thread, but I agree with @vmx's proposal to just store this type information in the structured data, not the CID. This case looks a lot like the swarm case.

warpfork commented 3 years ago

Also big +1 to everything in vmx's comment. So much +1 that I don't even have any clarifying comments or quibbles at all. Just "yes".


There's another factor I think we should identify and include in our typical list of considerations (and comes up as significant in some of the examples needing decisions right now):

For systems that have a content-addressing based ID structure already, there's an interesting two-part question:

When we bring some data into the multiformats / IPLD universe, can we emit that data (and ideally, also slightly modified forms of it) back into the foreign document structure, and compute that same foreign ID for it?

... and that's a bit of a trick question, because the answer is always "yes" -- you can just wrap it in a {foreignID | foreignBlob} tuple. So then the critical second half of the question is:

Can we do it with just the content and the CID? Or do we need to introduce additional wrapper data to do it?

This is an interesting question because it significantly affects the amount of friction that will be experienced when moving data between these bridged systems. If wrapper data is required to be generated when data comes into our system, just to be stripped in the occasion that the data is consumed again by the bridged system family that it came from, are there significant overhead penalties involved? And, will the wrapper data make it significantly weirder to work with that data while it's within our ecosystem? If there are significant overheads, and the system we're bridging to is a large one that we are significantly concerned with low-friction integration with... then that may be a good reason to think deeply about where we can push the foreign ID entropy in order to keep it smooth. And if the presence of wrapper data would enweirden the bridged data in the PoV of our tools, that may actually inhibit the "coordinationless re-use" that vmx's comment identifies as a goal.

The multicodec indicator table is one obvious place where we can push that information in order to make a low-friction bridge. It's not the only option, but it's definitely an option.

aschmahmann commented 3 years ago

A big đź‘Ť to @vmx's comment as well.

One thing I'd like to add here is that from what I can tell the plan of most/all of the proposed systems that would leverage new IPLD codecs to perform location addressing/hinting is to try and interoperate with go-ipfs by making, broadly speaking, a new implementation of Bitswap that notices a request for a special codec and does some recursive request elsewhere.

However, the fact that Bitswap asks for data by CID instead of by multihash is essentially a quirk related to data being indexed improperly in go-ipfs (the original Bitswap user) that is planned to be resolved in an upcoming release. Once that happens the next iteration of the Bitswap protocol may not even bother to send the extra CID bytes down the wire (i.e. the version + codec numbers) which would cause problems for any software relying on doing some magic by looking at the codec.

So by accepting IPLD codec identifiers that are used explicitly for the purposes of location addressing/hinting we may be setting these ecosystem members up for future problems due to them using the codecs in ways that are not aligned with how CIDv1s are designed to work. As someone who works on go-ipfs I both don't want to break ecosystem members' software nor commit to maintaining an old quirk/hack indefinitely.

Stebalien commented 3 years ago

I want to clarify this slightly. While it should be possible to ask for blocks with any or no codec, there's no plan on removing the codec from these requests (for now, at least) as it adds potentially useful metadata. But this is otherwise correct.

mikeal commented 3 years ago

I also agree with everyone, but I do think it’s worth pointing out that not every multiformat is a codec and that we may encounter use cases for a new multiformat that don’t observe the rules we set for codecs. This is worth mentioning because the table we use for all multiformats is actually in this multicodec repo and it’s easy to forget that the breadth of multiformats is much broader than codecs.

That said, the spec for CID says “multicodec” and not “any multiformat” so new multiformats that don’t observe codec rules would not be usable as CID multicodecs.

rvagg commented 3 years ago

@vmx suggested I drop this link in here as it relates to the somewhat squishy definition of a codec, but specifically framing it as telling you what glasses to put on in order to see the data inside the bytes, which can lead to having multiple glasses being able to see the same bytes but yield different forms (raw being an obvious one but there are others): https://gist.github.com/rvagg/1b34ca32e572896ad0e56707c9cfe289

aschmahmann commented 2 years ago

@Gozala asked this in #264 and this seemed like a better place to answer it. It includes some comments about ADLs which are mostly missing from the comments here so far. It doesn't 100% match the context, but the general question was about using a single codec for two different serialized forms because the the existing codecs would have put the data into two different data model representations when they wanted a unified one.

This is reasonable position however as far as I can tell it does not address case where multiple underlying IRs [i.e. Data Model Representations] could be used under the hood of the same final representation, which is exactly the case for dag-ucan library.

If I look at this abstractly the question being asked seems to be: "When should I use a Codec vs SomethingElse for describing some new transformation on top of particular data when a codec that could reasonably represent my data already exists?" (there is another version of this problem when no codecs are available).

My 2c is that codecs feel great for throwing everything you want in there. That is... until you realize that at the moment there is a bunch of ecosystem tooling that doesn't work well with unknown codecs and that you need to get everyone you care about to add support for your codec. Some of them will do it, some of them won't, some of them will do it quickly, some will update very, very slowly.

So then you back up and ask yourself "what do I gain/lose by writing on top of the data model + existing codecs vs bytes?". When does it become worth while to make a new codec and get people to start using it and caring about it (yes, things like WASM might help here in the future as some folks have mentioned)?

Some thoughts:

So if you find yourself in a position where you could choose to use an existing codec or a new one to represent your data it's likely worth asking about the tradeoffs. Some seem easy (don't use codecs for system-addressing), but others are trickier (some of my data translates nicely into the codec but other pieces do not).

warpfork commented 2 years ago

https://ipld.io/docs/synthesis/gtd/ is also a relevant doc on this.

I really like your writeup and those bulletpoints, though, Adin.

Especially, the bits about "If the codec is a poor fit for my data then many of the gains are nullified": I dn't think that's well-covered in the docs pages yet. I'd love it if you throw those comments in there.