ipld / specs

Content-addressed, authenticated, immutable data structures
Other
592 stars 108 forks source link

Better separate codecs and blocks in documentation #244

Closed warpfork closed 4 years ago

warpfork commented 4 years ago

I motion to split "block layer" and "codec layer" and do so consistently across all our specs and docs.

Fundamental reason: the work the codec code does is almost entirely about how to encode/linearize/serialize/whatever-you-call-it Data Model trees into flat bytes (and of course vice versa, unfurling flat bytes to DM). It's very little to do with blocks. There's two separate things going on here, and we should reserve space to talk about each of them.

Another proximate reason: the current block-focused naming makes people think that a "blockstore" interface is the centerpiece of this layer... And it's really, really not; we mislead people super badly in this regard. We should refine our terminology so that we don't impart this mis-impression.

Another tertiary reason: putting codecs under the heading of blocks is likely to mislead people into thinking that our codecs can't do streaming operation, which just isn't true, and makes us sound more limited than we are.

And as a quaternary concern: we should also redouble our efforts to make sure there's no possibility of confusing "blocks" in IPLD with "chunks" and "chunking" of files in IPFS / unixfs-applications. While we use the word "chunking" for the thing that's done to break up large byte ranges, imo neither we nor (just as importantly) other literature in the space at large use either term with 100% consistency, so the possibility of confusion is very real. At minimum, identifying "blocks" cleanly and separately from "codecs" should help with this. (At maximum, I wonder sometimes about the word "block" at all; but I don't come prepared with an alternative suggestion, so I'll park that thought.)


Here's some short phrases that would describe this relationship:

Block->Codec->DataModel(<->Schema|ADL)

You have a codec that transforms the datamodel into a block and back. The block is the unit the content addressed identifier is based on.

What do ya'll think?

warpfork commented 4 years ago

Phrases at the bottom are attributed to @vmx ; we did a little cooking on this thought earlier in the week. :) He also talked me down from "rename block to codec" to this idea of splitting them instead, which I think turns out much better.

vmx commented 4 years ago

Another benefit about separating block and codec is, that we can make clear that IPFS has limitations on the block size, it's not a limitation of the codec. That confused people in the past. In probably also makes it easier to state that this block limit is not inherit in IPLD, but only applies if you use Bitswap (or probably any other networking thing).

creationix commented 4 years ago

I've been thinking a lot about a storage backend I've used in the past that's essentially a giant virtual block device with some really nice properties[1] that makes it work well in P2P environments. I've always been kinda sad that there was no place in our stack to use such an abstraction. But now I'm thinking that if codec was it's own layer, an optimized filesystem for storing serialized data could be designed that uses the large block device as storage instead of storing and locating serialized 'blocks' on their own.

But if codec is tightly coupled with blocks and storage, then there is no place for this tool I want to use for my personal projects.

[1] really nice properties:

mikeal commented 4 years ago

Just for the benefit of this conversation, I’ll re-iterate a few of the reasons we created the Block layer.

  1. Simplicity. The layer model doesn’t need to be strictly enforced in the abstractions, it’s mainly there to help people learn the components and how they fit together. Prior models were much more complicated and the last reduction left us with the Block layer.
  2. The Block layer holds codecs and CID. This was important when CID was an IPLD project, but we actually moved it to multiformats and even teach it in ProtoSchool as a multiformats technology so we probably don’t need to capture it the same way anymore.

I think that Block is a very important concept and term that we need to keep around, but I’m not entirely sure it needs to be a “Layer” any longer.

I’ve definitely been fighting against the “need a Blockstore” thing, but I’m not sure how much of that really comes from having a “Block Layer.”

That said, I’m not in favor of “splitting” the Block Layer, that just seems to complicate things. We could rename it to “Codec Layer” with few other changes.

rvagg commented 4 years ago

Based on this discussion, what would you change in this diagram?

Screenshot 2020-03-10 13 07 35
warpfork commented 4 years ago

Fwiw, I had tried to write something similar in an ascii diagram:

                           [User / Application logic]

┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━^━━v━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃                                                                         ┃
┃  D  M   ┌───────────────────┐  ┌─────────────────┐  ┌─────────┐   D  M  ┃
┃  A  O   │                   │  │                 │  │         │   A  O  ┃
┃  T  D   │ Advanced Layouts  │  │ bare Data Model │  │ Schemas │   T  D  ┃
┃  A  E   │                   │  │                 │  │         │   A  E  ┃
┃     L   └───────────────────┘  └─────────────────┘  └─────────┘      L  ┃
┃                                                                         ┃
┣━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━^━━━━━━━━━━━v━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┫
┃              (unmarshalling) ─┘           └── (marshalling)             ┃
┃                                                                         ┃
┃                                                                         ┃
┃                                   Codecs                                ┃
┃                                                                         ┃
┃       [json] [cbor] [dag-json] [dag-cbor] [git] [eth] [etc...!]         ┃
┃                                                                         ┃
┣━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━^━━━v━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┫
┃      (reading + (verifying) hash) ─┘   └── (writing + hashing)          ┃
┃                                                                         ┃
┃                                    Block                                ┃
┃                                                                         ┃
┃                                                                         ┃
┗━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┛

But I like yours entirely better, @rvagg.

You can see I put "basic" data model in its own box in my diagram... but I wasn't really happy with that. Yours is better.

One interesting thing to note from mine: trying to name some arrows between levels highlighted one thing that I think is better explained with this splitting of Codec from Block in the big-picture diagram: see how we get to point out now that hashing takes place on the scale of blocks?

I like being able to point out that hashing happens on the transition to and from Block, because it's accurate. For example, in go-ipld-prime, this line is the closest that hashing and codecs ever get to each other: and it's not close at all: neither package is directly importing the other, and that line is putting together mcDecoder (which is a MulticodecDecoder interface, nothing specific) with a hasher that pipes into CID.Sum (which is also effectively an interface rather than any particular hasher, of course, since that's one of the points of CID). The coupling between Codecs and any detail of hashing is very distant: aside from the multicodec byte appearing in CIDs, they're almost totally separate, with just a byte pipe between them. So having a big line in our layercake that helps us say that looks useful in my eyes.

rvagg commented 4 years ago

oh yeah, I like those state transitions, might try and get them in although I fear clutter

rvagg commented 4 years ago
Screenshot 2020-03-10 19 49 23
vmx commented 4 years ago

In my mind, Schemas are on top of the Data Model. Application Code might not directly interact with the Data Model, but with the more feature full Schemas. Though having the Schemas on top, makes them look like a layer and indication that you can use the Data Model directly is hard to show.

So perhaps putting Schemas on the edge:

      Application Code 
       ^           ^
       |           |
       v           v
+---------+-----------------+
| Schemas |                 |
+---------+   Data Model    |
|                           |
| +-----------------------+ |
| | Advanced Data Layouts | |
| +-----------------------+ |
+---------------------------+

My diagram shows that Schemas operate with the Data Model bounds, but can be used directly. The Data Model may be used without Schemas. Advanced Data Layouts behave like the Data Model, hence are completely subsumed.

warpfork commented 4 years ago

There's lots that's true about that, but I think I might want to play the "let's make didactically useful simplifications" card here in favor of @rvagg 's visualization.

One of the big things I wanna improve the communication on is that one can still treat schema-handled data with Data Model interfaces -- because istm this is super easy to lose this detail (lots of people porting their assumptions from other universes default to thinking that once schemas get involved, the game changes completely; we're not like that, and it's one of our major distinguishing traits); and also it's critical for library design to get this right so we can have generic traversals, selectors, etc work over data handled with schemas.

I think it's comparatively easy to write a block of text saying "Using Schema features may also grant you additional more specialized APIs for interacting with the data.", and put that after or nearby the diagram. I suspect people will pick that up, go "oh. k" and move on easily. Therefore it's much more important to use the big visual real estate to emphasize the initially-more-surprising fact that schema data can still be treated 100% like Data Model.

vmx commented 4 years ago

(lots of people porting their assumptions from other universes default to thinking that once schemas get involved, the game changes completely; we're not like that, and it's one of our major distinguishing traits);

That point convinces me. So people look at the diagram, wonder "why are the Schemas inside the Data Model". They read one more sentence and are blown away.

mikeal commented 4 years ago

is that one can still treat schema-handled data with Data Model interfaces

The one place that this is sort of not true is the rename feature in schemas.

mikeal commented 4 years ago

The description of “Block” in the new diagram is a bit at odds with how we’ve defined the term previously. In the past, Block has included the CID, at least in the terminology. It’s not just an arbitrary array of bytes, it’s an array of bytes with an address, and that address also includes the Codec.

When we’re first introducing IPLD, the faster we get to linking the better. It’s not clear why any of this stuff is useful or important until you understand how linking works. There’s very little difference between IPLD and something like Serde until you explore linking.

warpfork commented 4 years ago

one can still treat schema-handled data with Data Model interfaces

The one place that this is sort of not true is the rename feature in schemas.

nope, Still true!

For some schema:

type Foo struct {
    bar String (rename 'zazz')
}

We should be able to do either of these two things (roughly; psuedocode):

n := Unmarshal(UntypedNodeStyle, `{"zazz":"zy"}`)
assert(n.Kind() == ipld.Kind_Map)
assert(n.MapKeys() == ["zazz"])
//assert(n.Type().Kind() == ipld.TypeKind_Struct)
// ^ this last one is false, of course: `n` doesn't even have a `Type` method!

and:

n := Unmarshal(codegen.Foo, `{"zazz":"zy"}`) // unmarshal follows the rename directive
assert(n.Kind() == ipld.Kind_Map)
assert(n.MapKeys() == ["bar"]) // uses the schema fields here!
assert(n.Type().Kind() == ipld.TypeKind_Struct)

So, one can still treat schema-handled data with Data Model interfaces.

Is it exactly the same data? No. The schema may be acting as a lens that affects how we see it. The rename directives are an example of this. So is basically using any non-default representations (such as representation tuple on a struct). But it's close -- it's bidirectionally morphable to the data we see without the schema -- and both of them can be handled using the Data Model interfaces.

rvagg commented 4 years ago

new version with updated text for Schemas:

Schemas are a means of formalizing the shape of data structures within the bounds of the Data Model and may present the data in an altered form (e.g., a “struct” containing a fixed list of fields, serialized as an array with fixed ordering).

And Block:

Blocks are arbitrary arrays of bytes identified by a CID (content identifier, including hash and codec details). IPLD doesn’t concern itself with the source or nature of these bytes as long as its Codecs can read and/or write them. Limitations (size, location, availability, etc.) are concerns of the data source.

Does that get us closer?

Screenshot 2020-03-18 13 21 59

Do we want to do something with this, or just keep it as a resource in our pockets for now? I could pass it on to the design folks to nicify it.

warpfork commented 4 years ago

I flirted with splitting the schema description into two sentences, and trying to split the mention of CIDs into a separate sentence about blocks and putting it towards the end. But ended up still equally satisfied with your text.

I really think we should keep pushing this. We can keep hedging on whether it's perfect, but I think if we compare this to the three boxes on the very front of the ipld/specs repo right now, it's clear that this is vastly, vastly more informative.

Ericson2314 commented 4 years ago

I was really confused by the current docs. This looks great and is vastly closer to what I'd expect. Big :+1: from me.

I'd add that in addition to marshaling and unmarshalling, there is also validation, both of the packed ready to be hashed version (e.g. are these bytes beginning with tree a valid git tree object?) and of the unpacked version (e.g. does the JSON object look like {<name>: {hash: <hash>, mode: <mode> }?).

Validation can sometimes be implemented by just doing a round trip, but it is usually better to implement it more directly both for performance and also because that code should serve as documentation / an "reference implementation" of the codec specification.

rvagg commented 4 years ago

Iterating on the design with someone who knows what they're doing, current incarnation, which is close, #266 needs to be resolved (maybe just by removing ADLs from here for now) and also need to make it clear that "Data Model" is a thing in itself and not just a category for things inside of it.

Good point about validation, but I think I'd rather have it as an implicit in here so as not to add additional clutter. We could add it in the textual section if we can do it without bloating it too much. This needs to be as simple as possible to communicate the core parts of IPLD without making people's eyes glaze over or looking like the stereotypical conspiracy pinboard full of newspaper clippings with pieces of string connecting seemingly unconnected pieces.

Ericson2314 commented 4 years ago

Glad to hear this is still coming along!

We could add it in the textual section if we can do it without bloating it too much.

Oh, that's totally fine with me. I wrote my comment with the basic ideas in mind rather than the specific diagram and introduction page which the conversation had shifted to discussing. Sorry I wasn't clear about that.

mvdan commented 4 years ago

Here's another :+1: from me. I agree that I was understanding codecs to only really exist as part of blocks.

I was also somewhat misled into thinking that schemas and ADLs are an entirely separate layer compared to the data model, when in fact they are very close. Moreover, the data model can be used directly for many use cases, but my first reading of the specs diagram led me to believe that the vast majority of users would be using schemas (since it's the highest level).

rvagg commented 4 years ago

ok, let's see if we can instantiate this https://github.com/ipld/specs/pull/293