multiformats / cid

Self-describing content-addressed identifiers for distributed systems
Other
419 stars 78 forks source link

Universal container format based on progressive specialization #23

Open rotemdan opened 6 years ago

rotemdan commented 6 years ago

[This is a work-in-progress draft design which has been heavily edited since it was first published]

This is an attempt at designing a highly flexible, yet compact, multipurpose container format that can function both as a content/entity identifier, a file header, as a part of a protocol message, or even to contain both metadata and data by itself.

Basically there's a very simple underlying concept here: that successive type enumerations can be used to progressively "namespace" into more and more specialized contexts describing more fine-grained information. Note these type enumerations don't have to be limited to built-in fields (like entity domain or schema version) -- they can be dynamically inferred from fields whose semantics are progressively refined by the schema itself (somewhat like a state machine).

(This is mostly an illustrative example of how such format could be designed, but I did put a lot of thought into it so I think it's a worthwhile read)

It starts with a message encoding identifier (1 character), which can be any one of raw-binary, base64, base32 etc:

<message encoding [1 char]>

Now that we're in binary, a version number for the container format (varint):

<container version [varint]>

Now a varint for a entity domain identifier (e.g. file, ipfs, ipns, https, bitcoin, ethereum etc.)

<entity domain [varint]>

And now a varint version number of the schema for the domain (each domain independently maintains its own schema versioning):

<domain-specific schema version [varint]>

Now the base payload (AKA required fields), where its schema is specialized for the particular domain and version number, (note that total length is included to allow for a client to segment it even if it is unfamiliar with the particular combination):

<base payload length [varint]>
<base payload [arbitrary binary layout - can be variable length]>

And now field data (AKA optional fields), in a simplified protocol buffer like encoding (roughly described below):

<field data [unspecified total length])>

That's all really. It's not bound to contain a hash of any sort, or to be associated with a particular category within a set of predefined codec types.

Example: say we want to encode [raw-binary, container version 2, IPFS, schema version 1] so the first required field would be resource type, say it's UnixFS File, which in turn would refine the schema further to expect <dag hash type [varint]> and <dag hash [binary string]> as following fields.

The base document would look something like:

<encoding: "b" [1 character]>
<container version: 2 [1 byte]>
<entity domain: IPFS [1 byte]>
<domain-specific schema version: 1 [1 byte]>
<base payload length: 34 [1 byte]>
<resource type: UnixFS file [1 byte]>
<dag hash type: sha-256 [1 byte]>
<dag hash [32 bytes]>

(Total length: 1 char + 38 bytes)

Optional fields:

Each optional field is structured as:

<data type and field identifier [varint]>
<field payload>

Where the first bit of data type and field identifier represents the type and the rest the field identifier (specific for the particular schema), which can grow indefinitely since its a varint (fitting into a single byte would allow for 6 bits which can support up to 64 different field IDs).

Data type can be:

0: varint 
1: length prepended binary string (where length is a varint)

(I'm not sure if there's a need for anything else, since booleans can be contained in bitfields and floats can be stored in binary strings)

So let's say for the example we wanted to add a file size, chunking algorithm and max chunk size optional fields to the base CID:

<data type: 0, field id: file size (#0) [1 byte]>
<field payload [6 bytes]>
<data type: 0, field id: chunking algorithm (#1) [1 byte]>
<field payload [1 byte]>
<data type: 0, field id: max chunk size (#2) [1 byte]>
<field payload [3 bytes]>

Totals (file size: 7 bytes, chunking algorithm: 2 bytes, chunk size: 4 bytes). Of course if the information cannot be represented here (say, chunking is variable): it may simply not be included at all.

Now let's say the user wants to also add a signature for the hash, and that is not supported in the base schema, so they would need to use their own application specific field identifier in a reserved range (for this example say 4096+ is reserved [4096 is roughly midway within the range available for 2 byte identifiers]).

<data type: 1, field id: hmac-sha-256 hash signature (#4096) [2 bytes]>
<field payload [1 for length + 32 bytes for data]>

Even if the client doesn't understand this field, it can safely ignore and skip it since all the length information is available through the encoding itself.

Note that it's possible to standardize identifiers within the range 4096+ as application reserved globally for all domains. This would mean that application-specific fields could be added to a document even if its schema is not understood by the client.

rotemdan commented 6 years ago

I've made some major changes, especially to generalize the terminology:

  1. Flexible content descriptor -> Universal container format
  2. CID -> Container
  3. Protocol -> Entity domain
  4. Required fields -> Base payload
  5. Optional fields -> Field data

and removed resource type as a built-in field, since not all domains/protocols would need it.

It turned out to be a significant challenge to describe this in a clear manner so I might come back to polish it a little more. Since I think it got to a reasonably stable form I would be interested in getting some feedback. Any questions? suggestions for improvement? clarifications?

geoah commented 5 years ago

Is this effort part of cid/ipfs/ipld/mutliformats or something different/new?

eikeon commented 3 years ago

I've made some major changes, especially to generalize the terminology:

  1. Flexible content descriptor -> Universal container format
  2. CID -> Container
  3. Protocol -> Entity domain
  4. Required fields -> Base payload
  5. Optional fields -> Field data

and removed resource type as a built-in field, since not all domains/protocols would need it.

It turned out to be a significant challenge to describe this in a clear manner so I might come back to polish it a little more. Since I think it got to a reasonably stable form I would be interested in getting some feedback. Any questions? suggestions for improvement? clarifications?

@rotemdan, It's been a couple year. Curious if you've continued down this path?

rotemdan commented 3 years ago

This is an idea I suggested several years ago with the purpose of potentially unifying content identifiers and IPLD documents.

Basically having one highly flexible data format that could describe anything, and that would be compact enough to be transmitted as a link (albeit possibly a long one).

This means that resonably simple/small files would not require fetching an additional metadata (IPLD) file from the network. The link would contain all the hashing information and the extra metadata required to safely retrieve and verify the data (not just from IPFS, but also from http, bittorrent or potentially any other protocol). As long as you have the link. You'd still have a chance of safely acquiring the file from somewhere. This is in contrast to IPFS, where once the IPLD document becomes unavailable, the associated data cannot be retrieved or verified.

In a sense, it presents a vision that's quite different from the way IPFS was initially designed. It's not bound to set of predetermined protocols, and is not "locked" to the IPFS ecosystem.

Since I never got any comment about this idea from IPFS team members, I'm assuming it's this either doesn't fit their business model or it is too much of a departure from the founders' original design of the network, to the point they may feel that going in this particular direction would diminish their sense of "ownership" of their own product.