ipld / specs

Content-addressed, authenticated, immutable data structures
Other
592 stars 108 forks source link

feat: support inline bytes in a byte list #280

Closed mikeal closed 4 years ago

mikeal commented 4 years ago

This is a feature @ribasushi surfaced the need for recently. This comes up using smarter chunkers where you may have just a byte or two as separators and it’s smaller to inline them than to be forced to use a link.

@warpfork it seems a little weird to being using the & syntax in this union since it’s already kinded on the link, but I think it’s still correct right?

warpfork commented 4 years ago

I think this might want a little more poking.

(I'll try to write a concrete alternative proposal, but as I think you just mentioned during the weekly call, yeah, this is a very recursive structure; it sort of gives one headspins to handle it while trying to imagine all the user stories it could be applied to suit...)

warpfork commented 4 years ago

Okay, here's what I came up with:

type FlexibleByteLayout union {
  | Bytes bytes
  | NestedByteList list
  | &Kernel link
} representation kinded

type Kernel union {
  | Bytes bytes
  | NestedByteList list
}

type ShortOrBlock union {
  | Bytes bytes
  | &Kernel link
}

type NestedByteList [ NestedByte ]

type NestedByte struct {
  length Int
  part ShortOrBlock
} representation tuple

It has several more unions, but it's correspondingly a bit more structurally confined. (I also invented entirely placeholder names for the new unions. They don't make sense; please replace them...)

Examples of what this can match:

I think that's everything we want, and nothing we don't.

(I'm surprised it seems to have taken me three distinct unions to do this, though. Is there a shorter way?)

warpfork commented 4 years ago

The above has the weird feature of having a 'length' int in a NestedByte structure even if the part is about to be a Bytes embed, which isn't very helpful because bytes in our systems pretty much always have a length prefix at the codec level anyway, so maybe replacing those last couple of types with these would be better:

type ShortOrBlock union {
  | Bytes bytes
  | NestedByte map
}

type NestedByteList [ ShortOrBlock ]

type NestedByte struct {
  length Int
  part &Kernel
} representation tuple
rvagg commented 4 years ago

I think this adds back in the flexibility I suggested was missing at the bottom of https://github.com/ipld/specs/pull/211#issuecomment-645939662, so you could inline a bunch of bytes and keep the structure of them intact. There are also objections in that thread to such inlining capabilities (e.g. https://github.com/ipld/specs/pull/211#issuecomment-573753702) namely that it increases the possibilities for representing the same sequence of bytes in different ways. IIRC Steven raised this concern somewhere too.

Re @warpfork on the possibility of linking to links, FlexibleByteLayout -> FlexibleByteLayout -> FlexibleByteLayout is kind of like touch a; ln -s a b; ln -s b c; ... which isn't that terrible and maybe there's legitimate cases for it. This change already adds considerable additional flexibility in layouts so stopping short of adding top-level recursion seems a little arbitrary?

Also, minus the "lengths" stuff, this takes us back to near the beginning of FBL, e.g. https://github.com/ipld/specs/blob/f4e06c61eecbdde9907e748032591c3bcba07f75/schema-layer/data-structures/data.md, what does that mean?

warpfork commented 4 years ago

FWIW, I think I've previously also been on team "increasing the range of possibilities for representing the same sequence of bytes in different ways may be bad", but I'm coming off of that stance.

Something I want from ADLs in practice in our libraries still is: some defined behavior for constructing them so that I can use that behavior to back a NodeBuilder, and thus let people create ADL data using the regular Data Model interface, rather than needing to Know they're doing something with an ADL.

(And I want this also so that given some traversal I'm doing over a large IPLD graph, which may or may not have hit an ADL at any given point, and I do a "mutation" there: the library should be able to look at the current Node, ask it to create a new NodeBuilder that can be used to "replace" that Node... and if that Node as an ADL, the new NodeBuilder should also create an ADL and use the same strategies, whatever those may be.)

But what I've realized I don't need for this: is any strict binding between those builders and a schema. Those builders could carry various strategy code and configuration, and many variations of them can still produce data matching the same schema. That's sort of fine.

I've also been previously worried about it for holistic reasons around convergence-in-practice and dedup-emergent-in-communities (with-minimal-coordination), but I think I now retract my previous advocations that we try to enforce that anywhere in this particular technical arena.

warpfork commented 4 years ago

Linking-to-links is something we probably ought to have a bigger discussion about on its own. I think the utility of it is debatable in this particular context, but there's also some general questions about them.

(The bigger picture / general questions in brief: Does link-to-a-link even make sense in content-addressed immutable systems (it's not really the same as symlinks -- if none of them can be changed, there's no semantic purpose possible to adding hops that can always be statically resolved)? Given that it doesn't seem to be something that has useful semantics, but adds latency and blocks: do we want to support it? Symlinks in filesystems are also weird in that they caused stat and lstat to both be necessary -- is there a similar reckoning necessary for IPLD pathing (because if so, we haven't defined it yet)?)

chafey commented 4 years ago

+1 for embedded byte streams. One use case for this is when you want to store a file that uses a legacy file format where the actual binary content is chunked and delimited. The user just wants the chunked data, but not the delimiters. Being able to store the delimiters as embedded bytes and the chunks as separate blocks is perfect. This way the original file format is preserved, but clients can access the underlying data (chunks) using IPLD constructs (even linking directly to the underlying chunks)

mikeal commented 4 years ago

I hadn’t thought about this when I wrote it, but the infinite nesting is actually f’ing awesome.

It means that you can take any FBL and stick it in another FBL, basically you can take any linear byte representation and string it together into another linear byte representation without copying. That’s amazing!

Unless someone has a strong reason why this shouldn’t be allowed I’m inclined to leave it.

mikeal commented 4 years ago

Linking-to-links is something we probably ought to have a bigger discussion about on its own. I think the utility of it is debatable in this particular context, but there's also some general questions about them.

Agreed, we haven’t thought about this enough and we should put it on the list of things we’re actively considering. That said, I’d like to merge this in to the draft specification to aid that consideration, and we will hold off on moving the spec to a more mature state until we’re confident we’ve thought this through.

ribasushi commented 4 years ago

I’d like to merge this in to the draft specification to aid that consideration

Neutral from me, I have an inkling something isn't quite right here, but I don't have the headspace right this moment to get to it. Let's merge, I will try to jolt down thoughts before Fri-ish

warpfork commented 4 years ago

I thumb's up'd those last couple of comments, but to make it clear and make sure there's a notification emitted too:

Okay, I'm convinced. As long as we know there's some considerations to do about link-to-link.