feat: require deterministic map serialization

ipld / specs

Content-addressed, authenticated, immutable data structures

Other

592 stars 108 forks source link

feat: require deterministic map serialization #235

Closed mikeal closed 3 years ago

mikeal commented 4 years ago

In order to be “data model compliant” we should require deterministic maps.

This means that all the dag-{format} codecs must add deterministic key sorting during serialization. We should add the details for each codec in those codec specs, but there should be a general requirement that some form of determinism is defined.

vmx commented 4 years ago

I would specify the ordering the Data Model and having the codec store in which way it wants (I agree that it should be fixed per codec). When you interact with Selectors, which operate on a Data Model level, you want to have that one specific ordering, you don't care how it is actually stored.

mikeal commented 4 years ago

I would specify the ordering the Data Model

If a format already has deterministic ordering by other means it’s preferable to use that. If we specific it we could potentially exclude a format from being data model compliant if it doesn’t match what we specify.

vmx commented 4 years ago

If a format already has deterministic ordering by other means it’s preferable to use that.

This means that traversals (e.g. with Selectors) depend on the codec.

mikeal commented 4 years ago

Yes, but that’s already true.

If we allow for streaming parsers then the keys will come out in the encoded order, which means it relies on codec write-time determinism.

Similarly, when you have an advanced selector over a HAMT or other multi-block data structure the keys MUST be yielded in encoded order because you can’t re-order without reading all the keys from all the blocks into memory.

In other words, selector determinism for key ordering is “read them in the order they are encoded” and we are relying on consistent ordering from the data writers.

rvagg commented 4 years ago

Yes, that seems reasonable to me. Since the codec varies the hash and CID of an object then determinism at that level, as long as it's properly deterministic within that codec, should be enough. Then we just have to put some wording in selectors about leaning on this. :+1: good reasoning.

vmx commented 4 years ago

To me a strength of the IPLD Data Model is, that as long as your codec implements the full Data Model, it doesn't matter which codec you use, the system will behave the same. You can just switch to another codec. You can build libraries on top of the IPLD Data Model (like Selectors) and things will work. With this change, that's no longer true. We break that assumption (that I would guess some would expect, besides the spec saying it's not the case) for potential future codecs (we don't know if ever will happen) that might want to encode things different.

vmx commented 4 years ago

I would much prefer we specify the Data Model as having "stable order" and leave it at that.

Thanks @warpfork your comment makes a lot of sense (sorry to others I only now got the point (hopfully). So is the idea that we specify a stable order in our Data Model and we define which stable order it is in our Codecs. We can have as many Dag codecs as we like and there specify that it is one specific order. As long as you would use Dag codecs only, you'll have the same order, no matter which one you use.

In case what I'm saying is accurate I'm in full support for this.

warpfork commented 4 years ago

I really like the comment from the top of the issue:

We should add the details for each codec in those codec specs, but there should be a general requirement that some form of determinism is defined.

Idea: we should have a page (codecs/mixins.md? :laughing: ) that defines some of this stuff.

Then the page for dag-cbor could just have a short table:

property		comments
strict order	:heavy_check_mark:	using {link: ../mixins.md#sortingstyle-foo}

...etc

warpfork commented 4 years ago

Some of my thoughts on this are also shaped by the way I structured code in go-ipld-prime, so I should maybe talk about that a little.

In go-ipld-prime, the ipld.Node interface is essentially "the Data Model" as an interface. Accordingly, it has a MapIterator() method that returns an iterator, and so forth.

The first implementation of ipld.Node I wrote (still whimsically called ipldfree.Node) has no knowledge of codecs, at all. The encoding packages for dag-json and dag-cbor both interact with it purely through the iterator when serializing, and through ordered insertion to the corresponding NodeBuilder when deserializing.

There's an idea that we could build a cbor.Node implementation which has some optimized special paths that work particularly efficiently with CBOR encodings (at the cost of likely being trash, performance-wise, when used with literally any other codec). If such an implementation wanted to ignore insertion order and unconditionally internally store things in a canonically sorted order, I think that'd be fine... but it would still need to work without crashing or erroring when paired with another codec that simply reads it via the MapIterator. Such a pairing might mean the other codec needs to do a sort, yeah, and that's just a cost we'd accept in that situation, but it should be defined and it should work.

mikeal commented 4 years ago

I would much prefer we specify the Data Model as having "stable order" and leave it at that.

I like this language and will update the PR to reflect it. I’d like to define stable ordering in this doc as, and this is draft language: “consistent key ordering of a map regardless of how that map was created (does not vary by insertion order)”

rvagg commented 4 years ago

Just to focus in on one minor quibble @warpfork raised: are we prepared to rule out insertion order as the possible ordering heuristic a codec chooses and the current language seems to do that. Maybe a codec that's focused on encoding streaming data wants to retain insertion order and it has a way of ensuring that in a round-trip decode/encode too.

(key sorting should be consistent regardless of insertion order or language specific sorting)

How about just: "(key ordering should be consistent and stable, using rules specified by the codec and mechanisms in implementations as appropriate for each language)"

In most cases I think the codec will control the insertion order at the lowest layer. Like now, in (I think) all of our codecs in Go and JS, we hand off blobs for marshalling by the codec which picks it apart and feeds it through a coding mechanism. It's the "picking apart" step that we're trying to codify.

(Aside: I think it's worth noting that "insertion order" is much more of a JavaScript programmer obsession because of our Maps than it is for Go programmers who get YOLO ordering).

mikeal commented 4 years ago

are we prepared to rule out insertion order as the possible ordering heuristic a codec chooses and the current language seems to do that. Maybe a codec that's focused on encoding streaming data wants to retain insertion order and it has a way of ensuring that in a round-trip decode/encode too

There’s a tradeoff here no matter which way we go. Streaming a single map encode, to me, is not a high priority since it’s a difficult to use micro-optimization that only works on a single block which we already want to keep small for other reasons.

The problem with insertion order is that there’s no way to fully ensure determinism. Even if you do the work of preserving decode order for maps in languages that don’t preserve it, you still have the problem that two nodes who create the same new map would have to somehow agree on their insertion order in order to guarantee determinism. That’s a lot of heavy lifting, and is something we should try to ensure at the codec layer instead.

It’s hard for me to see insertion order ever being the easier thing to implement once you take into account the work across languages and these coordination problems between nodes.

I’m still in agreement that we shouldn’t define a specific ordering in the data model but I’m comfortable ruling out insertion order as “stable.” By definition it isn’t really stable between nodes without some other method of coordinating the insertion order.

rvagg commented 4 years ago

OK, I'm fine with that, just wanted to confirm that it had been properly considered. Will let @warpfork dissent if he still wants to.

warpfork commented 4 years ago

I still got a lotta beef. "regardless of insertion order" does not jive a single whit for me.

I don't think we've fully identified even half the relevant user stories here. We have a very fundamental desync on the meaning of "determinism" -- I don't think it can be well defined without defining "input" and "output" better than discussion here has; and, I think most of the time when we say "determinism" here, we've meant "convergence", which is a subtle but importantly distinct thing -- which we should get our heads around ASAP because it's super central and gonna keep coming up in this and other discussions. And @mikeal , I think you're throwing out some user stories that I'm not.

"single block which we already want to keep small for other reasons" is only true in some applications, such as the 1MB-ish limit in libp2p... and we do not, as a rule, want to admit those issues to core. For myself, I absolutely intend to also throw a gigabyte file of JSON at these libraries we're writing. And I do want to process that streamingly. A line that I am holding to here can be summed up as "if, given our core libraries, we can't build a jq-equivalent trivially, we have failed"; dismissing streaming operation is therefore a "goto fail", in my book.
"The problem with insertion order is that there’s no way to fully ensure determinism" I don't think this is the correct way to use the word "determinism": I'd like to create a wee little crowbar separation between "determinism" and "convergence". "Keep the order" is entirely deterministic. It's not convergent. Sorting is convergent and deterministic. Fostering convergence is good; but it's a gradient thing (more is better, less is less good -- less isn't death with spikes). Maintaining determinism is a hard line (less is death with spikes). But the logical implication "determinism is hard line"->"sorting is a hard line" does not hold, because sorting is sufficient but not necessary to determinism.
I think you might be missing the part of my argument where this is actually just plain the easiest way for the spec to encourage good implementations of the data model with reusable code: if the data model just transparently keeps order at all times, that makes codecs behaving correctly easier even if they're sorted, because the data model just stays out of it -- if something is handed to data model code as already sorted, it stays sorted, and you don't have to tell the data model code about that fact. By contrast, if data model code whimsically randomizes ordering, we need to sort every time we do output, which is not a joy. (And if the data model code does regard sort order, that can save maybe a little time when building new objects, but... well, no necessarily, and probably not in practice: unless you're using a linked list (there's enough literature on why those are Bad In Practice I'm just gonna handwave a 'no' to that), a single sort at the end will often go faster than a sort during insertion because all the memory is already allocated and filled and so the total numbers of swaps are less.)
Keeping "order is maintained" as the fallback definition is not optional! In case a codec doesn't define a strict ordering -- and if I define a codec for "json (any json, not dag-json)", it will indeed clearly not have a strict ordering; and like the 1GB file thing, I'm stating this as something I will do ;) -- then maintaining order is how we keep determinism even when convergence is off the table.
A fallback definition of order when codecs are not involved at all is also not optional! We can build data purely as a library in memory with no serialization involved, and libraries expose a concept of iterators. Iterators mean there will be some ordering. It must be defined. Therefore we need some definition. If the definition can't come from codecs, and we don't want to specify a single blessed sorting for the data model (already established), then... what? This needs to be defined, and the only option I see on the table is "maintain insertion order".

So, I'm sorry to be so staunchly unyielding here, but truly: I think we must not reject the idea of simply maintaining a handed-in order in the spec, or things collapse.

We can say "codecs should have sorted order" too at the same time and that's fine. But saying the data model must be sorted and may not simply persist a given order is just not going to end well for us. Let's use the words "stable". Stable is sufficient. Stable has all critical good properties, and permits all bonus good properties. Stable.

warpfork commented 4 years ago

I've been brewing on a longer exploration report-style writeup about this and the big-picture "determinism"-vs-"convergence" thing and I'll try to wrap that up and push it soon to give a little more zoomed out / principles perspective.

But to spoil some snippets:

Some User Stories

using IPLD for (human-written) config

reproducible builds indexes

filesystem ingesting

web page scraping and ingesting

stream processing and 'jq-like' usage

Extracted patterns from user stories

using the data model and a strict sorting codec

using the data model and a non-sorting codec

using the data model with two differently-sorting codecs

using the data model alone

using a schema (without codegen)

using a schema (with codegen)

using a schema plus user sourced map content

I identify at least seven different patterns of usage that have different constraints, some of which (as described in previous comment) staunchly reject any solution other than "maintain original order", and some of which provide interesting lenses for considering performance tradeoffs; and in the example user stories, there's a mix of "strictness is essential" and "sorting is irrelevant" and "strict sorting is murderously unusable".

I think we should be sure we've factored all these.

warpfork commented 4 years ago

Another thing I should note here, in furtherance to my concern that "determinism" can't be well defined without defining "input" and "output" better:

I don't consider golang's maps an "input", or at least certainly not a very good one that I'm going to define anything around. They're randomly ordered and there's nothing I can do about that. So, if those are my "input" then there's not much I can do, yeah.

So this is why the NodeBuilder interface exists, and it simply doesn't have any functions that would let you take a randomly-ordered/unordered thing and feed it directly in without making the caller make some choices here.

Similar issues probably exist in other languages but can be solved the same way.

When handling anything serialized, that kind of defacto ordering of a token stream always exists, so it's not a particularly unnatural thing to make a NodeBuilder interface consider.

mikeal commented 4 years ago

"single block which we already want to keep small for other reasons" is only true in some applications, such as the 1MB-ish limit in libp2p

I’ll try to be clearer.

As I see it, there’s really only one problem with block sizes and all the other problems descend from that problem. The problem is that a block is the smallest unit that is verifiably addressed.

A transport’s max block size is not some arbitrary limit they place for reasons we can discard when we don’t use that protocol. Those max sizes are how the transport deals with the fundamental problem. Even when you aren’t using that transport, you will still run in to situations in which you cannot have an arbitrarily large value and cannot simply accept data for an indefinite period of time without being able to verify it.

So yes, we should not design everything to the max block size of a particular transport but we can assume that all use cases will implement a max block size as they mature, we just can’t say with certainty what that size will be.

Given that some max size will be imposed, I’m skeptical that enough use cases will have a sufficiently large max size that their performance bottleneck will be streaming the serialization of keys. In the rare instance that this is your performance bottleneck, I think it’s reasonable for us to say that the way you should handle this is to use a custom non-native map in your code that inserts in stable order at insertion time so that you can then safely stream the key serialization.

"The problem with insertion order is that there’s no way to fully ensure determinism" I don't think this is the correct way to use the word "determinism": I'd like to create a wee little crowbar separation between "determinism" and "convergence". "Keep the order" is entirely deterministic. It's not convergent. Sorting is convergent and deterministic. Fostering convergence is good; but it's a gradient thing (more is better, less is less good -- less isn't death with spikes). Maintaining determinism is a hard line (less is death with spikes). But the logical implication "determinism is hard line"->"sorting is a hard line" does not hold, because sorting is sufficient but not necessary to determinism.

I think I clarified this in comments elsewhere but I’ll do it again here.

While it is certainly possible for two actors to find some way of aligning their insertion order, it’s difficult and problematic. If we do not specify insertion order as being unstable we will be deferring this problem to IPLD consumers and they will need to find their own consensus mechanism for aligning insertion order. That is what I’m trying to avoid. If you don’t agree and think it’s reasonable to push this up the stack then that is where our disagreement is. I have not been compelled enough by the possibility of use cases for insertion order as a stable ordering mechanism to change my view here.

if the data model just transparently keeps order at all times, that makes codecs behaving correctly easier even if they're sorted, because the data model just stays out of it -- if something is handed to data model code as already sorted, it stays sorted, and you don't have to tell the data model code about that fact. By contrast, if data model code whimsically randomizes ordering, we need to sort every time we do output, which is not a joy.

I think we may be lacking alignment on terminology.

We’ve stated a few times now that dag-cbor and, with caveats, dag-json are the only current “data model compliant” codecs. Our other codecs do their best to represent what is encoded as data model when de-serialized, but they aren’t fully data model compliant.

If you have a bunch of CBOR data encoded with whatever and you want to link to it, you can do so with a cbor codec and have little worry about ordering requirements. This order stability rule would only apply to data serialized with dag-cbor. I don’t think we even can require the de-serializers enforce this rule because it would end up baring streaming de-serialization.

I do not think that the codec should re-order keys it de-serializes. If possible, it should just fail if the ordering is not consistent with the spec. If it does not fail then the re-ordering would only happen during re-serialization. And before you say “that’s a mutation” let me say “so what?” If the ordering is indeterministic then nobody else can reproduce that block independently anyway!

I identify at least seven different patterns of usage that have different constraints

One thing that every user story for IPLD will have in common is that the data is addressed by hash. Any user story that is addressed by hash will have issues if the serialization is indeterministic, which is why we’re trying to handle these requirements at the serialization layer of IPLD.

There may be cases where you don’t care about consistency between languages or even applications written in the same language. But if you don’t care then doing this work won’t hurt unless the performance penalty is large enough to exclude your use case. The stable ordering that CBOR defines is certainly problematic in this regard but is it so bad that it would break or otherwise exclude a class of user stories?

mikeal commented 4 years ago

So, I'm sorry to be so staunchly unyielding here, but truly: I think we must not reject the idea of simply maintaining a handed-in order in the spec, or things collapse.

We can say "codecs should have sorted order" too at the same time and that's fine. But saying the data model must be sorted and may not simply persist a given order is just not going to end well for us. Let's use the words "stable". Stable is sufficient. Stable has all critical good properties, and permits all bonus good properties. Stable.

I just don’t find this argument compelling given the tradeoffs.

Will someone, some day, want to write their own sorting algorithm: sure. Is that important enough for us to allow for application specific sorting of our generic codecs and fail to provide these assurances to everyone else: absolutely not.

This falls far enough outside of the generic/default path that it’s perfectly acceptable to me that we say this needs to be an application specific codec. We’ve got a codec table with plenty of room, they are welcome to it ;)

warpfork commented 4 years ago

I have an application that does the following:

    nb := ns.NewBuilder()
    mb, err := nb.BeginMap(3)
    mb.AssembleKey().AssignString("z")
    mb.AssembleValue().AssignInt(1)
    mb.AssembleKey().AssignString("x")
    mb.AssembleValue().AssignInt(2)
    mb.AssembleKey().AssignString("xy")
    mb.AssembleValue().AssignInt(3)
    mb.Finish()
    return nb.Build()

Consider three cases:

I have no schema, and no codec, and am doing this only with the Data Model.
I have a schema, and it says struct { x int, xy int, z int }.
I have a codec, and it's dag-cbor.
Bonus: I have a codec, and it's something that's sorted like normal humans sort things, not like dag-cbor does.

When I use an iterator to iterate over key-value pairs, which order do you expect? When I serialize, which order do you expect? Are the answers the same in all cases? Are all cases even defined, if "maintain order" is outlawed? If one of these situations is undefined, how is that not a blocking problem?

mikeal commented 4 years ago

Consistent ordering requires that you prepare the keys to be serialized, either as you insert them or before you serialize. The ordering for dag-cbor would be defined by the codec.

If you’re taking an arbitrary Map-like object and serializing this means preparing the keys right before serialization, and this may involve an extra memcopy. If the API for inserting the keys is custom, and it knows the codec you’ll be using, you could keep the internal ordering consistent with the codec’s ordering rules as you insert them.

If you have a struct schema then it depends on the representation. If the representation is a map then the rules above apply.

This would actually be easier if we defined consistent ordering rules for all the data model codecs and didn’t have any variations, but I don’t know if we are prepared to do that.

rvagg commented 4 years ago

I've been trying to grok the difference here for the past few hours and just can't quite grasp it. It's making me feel very small-brained.

@warpfork perhaps you could propose an alternate paragraph in place of @mikeal's that expresses what you'd like to see re ordering?

Iterators mean there will be some ordering. It must be defined.

Why must it be defined? Same as with your ipld-prime example above, why do we need to define in the data model how the ordering is presented to the user? This is one point where I think language differences are getting in our way - Go's yolo []map ordering (which mean you're avoiding them entirely) vs JavaScript's just-throw-objects-at-it-bro approach to maps (which mean we hardly ever use proper Map, including with IPLD, it's all {} there and back again--which I think will need to change sooner than later fwiw). But even if JavaScript were to adopt the same style Node interface, what is the case for specifying ordering here and not making that a language concern? Perhaps it's implicit in Go that when interacting at the data model that insertion order is what you get when you iterate: on newly constructed Nodes you get the construction order of iterations, but if you do a cycle through dag-cbor and back again, the back again insertion order as it's instantiated into a Node applies which is the strict sorted algorithm order we already have there. In JavaScript it's whatever Object.keys() or Object.entries() gives you at either of those points, which will probably be the same as you have with Go (insertion order at creation and re-instantiation from a codec). And ..?

If we changed the paragraph to say "Data model compliant codecs must ..." would that change anything? We're really only talking about dag-cbor and dag-json here, what happens with json and cbor are a different matter and they are not data model compliant in themselves so wouldn't fall under this "must". But we could clarify that a little more.

We also need to work more on this convergence vs determinism business and clarify core principles that are guiding decisions like this so that when we make them, we can reference those principles that we already agree on. This might be a case for a team week sometime soon.

warpfork commented 4 years ago

Here's a fresh proposal for text I'd be happy with:

In libraries that expose iterators from data model interfaces, those iterators must have a stable order. No ordering is precisely prescribed, other than that it must be stable. This is consequential because we expect high level features like traversals and selectors to work in a predictable way, and this requires stable ordering.

When a map has been produced by a codec during deserialization, the codec provides an ordering. (A codec may either yield data in the serial order it encounters it; or, a codec may apply a sorting rule (this may or may not also include strict parsing; see TODO:link-to-codec-docs-for-this)).

When a map has an implementation with its own order preference (for example, a struct type implemented by generated code which is derived from an IPLD Schema has a natural order, and this order is necessarily determined at code generation time), it may use this ordering.

When a map has been produced in library code, using data model interfaces only (e.g. no codec has been involved; nor is there any other potential source of ordering such as an applicable schema), insertion order may be used as an ordering.

Note that codecs may also have sorting rules when serializing: TODO:link-to-codec-docs-for-this. This a concern limited in scope to the codec: the data model avoids specifying any particular sorting rules specifically so that we remain usable as an intermediary between multiple codecs which may have conflictory sorting rules. When a codec has sorting rules, it should enforce them at the codec boundary.

You'll notice an important effort I'm making here in this larger text is to identify what libraries are actually capable of doing in various situations.

(We could expand on this further with recommendations and remarks on efficiency and performance and how different implementation choices can relocate those costs to different parts of a pipeline of deserialization, operations, and serialization. Probably we'd want to do that in a larger document in another file, though, as the file we're currently in is supposed to be just a terse introduction to Kinds in the Data Model, not full advice for library implementors nor performance tips for users.)

I also think we should introduce as much as possible of this topic in documentation in codecs, as previously commented. Yes, even if that means introducing new (and unfortunately fairly skeletal) files. Accordingly, there are "TODO" links in my proposed text. If anything, I think my example here doesn't go far enough in moving the codec topics somewhere else -- it could be improved by further separating those.

We really need to get the distinction between Data Model concerns and Codec concerns right. To refresh ourselves on an important detail here, in case it's become glazed over or lost from focus, the path currently targeted by this diff is data-model-layer/data-model.md! -- and the sentences right above the current diff are talking about iteration in term of the Data Model. Everything around this is about the Data Model. A sudden change in topic to something that's not about Data Model semantics is problematically jarring at best; at minimum needs a massive amount of clarifying words about which concerns are relevant at which scopes; and ideally we should factor out as much as possible into other files connect with "seealso" links.

There's more that could be done here. There's minimal talk about streaming in the text I proposed, and there's minimal talk about how strictness should result in error handling, and there's minimal talk about support for deserializing-loosely-and-then-sorting... all of which absolutely also deserve discussion! But... all of which belong in text in the codec area.

warpfork commented 4 years ago

Interestingly, in a surprising bit of coincidental synchronicity, the Python community is also making a change to maps and ordering around the present.

They're going full order-preservation.

There's a plethora of comments on Hacker News about this (as well as elsewhere, I'm sure), but to cherry-pick two which resonate pretty strongly with me, as they contain language-independent user stories about the upsides of this:

https://news.ycombinator.com/item?id=22270131
- states the serialize(deserialize(serialize(obj))) case (! we're hardly the first to think of this)
- interesting user story about CLI args (order maint across transformation being a plus to systemic legibility)
https://news.ycombinator.com/item?id=22270167
- plain and simple legibility when debugging

Just food for thought. (I couldn't not link to it given the coincidental timing.)

mikeal commented 3 years ago

Closing for now, need to work on a better document about all of our determinism requirements.