multiformats / multicodec

Compact self-describing codecs. Save space by using predefined multicodec tables.
MIT License
340 stars 205 forks source link

add 'solana-tx' ipld ID #299

Closed riptl closed 2 years ago

riptl commented 2 years ago

Solana transactions are a compact binary format (bincode). They don't contain content-addressable elements, but still requesting an IPLD ID for Merkle-DAGs that carry Solana transactions as leaves.

The 0x5B prefix is arbitrary ("Solana Beach"). Happy to pick any other range.

ribasushi commented 2 years ago

As far as I know the current general consensus regarding codecs is that if:

(the payload) don't contain content-addressable elements

then the correct way to encode is using the raw codec, like 0155<multihash>, indicating "this is an opaque, non-traversible structure"

There have been various past discussions regarding (ab)using the codec as a form of "freeform label", I am not sure where the current thinking is on this: last I know the stance of the stewards is still "strongly discouraged".

@rvagg @vmx can you chime in?

vmx commented 2 years ago

then the correct way to encode is using the raw codec, like 0155<multihash>, indicating "this is an opaque, non-traversible structure"

That is correct. The "codec" of an CID is really meant to be an IPLD codec, i.e. it contains the information on how to decode the data, so that links can be extracted. If the data doesn't contain content addressed links, it should use raw.

There might be cases in the multicodec table that indicate that things are different, but those usually pre-date IPLD as we know it today.

As such identifiers are a common request, there is currently work on a proposal at https://hackmd.io/@vmx/HkoYAr64o#Application-context-proposal (the "application context" one is the most promising), to solve that problem. It would be great to hear if that would solve your problem.

rvagg commented 2 years ago

@terorie can you expand a bit on this and what you're trying to achieve?

IPLD ID for Merkle-DAGs that carry Solana transactions as leaves

This:

If the data doesn't contain content addressed links, it should use raw.

is not strictly true, we do have "serialization" type codecs that will never yield native link types, but the "codec" here still tells you how to decode the bytes you've found once you look up the blob corresponding to the hash. cbor and json are like this, the're effectively terminal (as is raw!) but still useful to know how to turn them into data model form.

I'd like to formalise "IPLD" codecs being codecs that can potentially natively yield data model links, and "serialisation" codecs being those that have a self-describing encoding format (e.g. not generic protobuf) that will never yield native data model links.

I'm not sure what this one is but it doesn't seem unreasonable that you could have a CID pointing to it?

riptl commented 2 years ago

Thank you for the thorough reviews. Closing this as this PR is clearly the wrong approach. Indicating the data type in a CIDv2 would be useful indeed.

The main motivation is to be able to iterate IPLD blocks in a CAR, so that each CAR/IPLD block is self-describing.