ipld / specs

Content-addressed, authenticated, immutable data structures
Other
592 stars 108 forks source link

DagCBOR: tighten up spec, additional strictness requirements #236

Closed rvagg closed 4 years ago

rvagg commented 4 years ago

Ref: https://github.com/ipld/specs/issues/227

It's already pretty close to what we've been talking about. This is mostly just clarification, and some additional fluffing with a mind to publishing it on specs.ipld.io too.

rvagg commented 4 years ago

Pushed an update to clarify "small integer encoding" rule and added one for floats, but I'm much more uncertain about that one. @warpfork I wouldn't mind your thoughts on https://github.com/ipld/specs/pull/236#discussion_r370488893, whether we go for "shortest possible maintaining accuracy" or "everything in a single size (64-bit)".

rvagg commented 4 years ago

Quick survey, it looks like we have a mismatch already between the two languages:

vmx commented 4 years ago

whether we go for "shortest possible maintaining accuracy" or "everything in a single size (64-bit)".

I would go for "shortest possible maintaining accuracy". As the IPLD data model only specifies a Float kind it makes sense to use the smallest representation. Else it would be more like a de-facto Float64 (for the DagCBOR encoding, which is currently the one we care most). Having a fixed size it's also easier to express with some annotation, so I'd make that a possible Schema opt-in (though I'm not really sure we want that).

rvagg commented 4 years ago

So, the float thing: to figure out what size, an encoder first needs to encode a float at 16 bits, then decode and compare to original, if it's different then it moves to 32-bit, does the same and then finally 64-bit and lands there if it's not the same.

In theory, I think we're dealing with "proper" IEEE-754 in both JavaScript and Go, and likely other platforms interacting with IPLD, so the calculations should work out the same in all of them. My faith in JavaScript's numbers is limited though, there's always something surprising to find when you dig deep enough.

We need to lock something in here (I think... correct me if you see a reason where we could be lax on floats but not on everything else?), and when we do we have to change either refmt or borc to conform. Along with that we really should develop a solid test suite for floats, with edge cases testing the boundaries to make sure we're right in our assumptions.

I don't feel strongly either way tbh; it's a trade-off between wasted space and extra encoding time for every float.

rvagg commented 4 years ago

re map key sorting

borc does length-then-byte-order sorting of map keys: https://github.com/dignifiedquire/borc/blob/b6bae8b0bcde7c3976b0f0f0957208095c392a36/src/encoder.js#L351-L355 & https://github.com/dignifiedquire/borc/blob/b6bae8b0bcde7c3976b0f0f0957208095c392a36/src/utils.js#L139-L152

refmt does the same thing I believe: https://github.com/polydawn/refmt/blob/474b602855ed928fc5bb3448bd0dae4a6fe78d96/obj/marshalMapWildcard.go#L142-L148

so they're both following the suggestions in the cbor spec so we may as well specify that here even though it is ugly.

warpfork commented 4 years ago

On the floats thing: call me a paranoid old man, but I... don't trust anyone's float implementations. I would go so far as to say I'm not even convinced IEEE-754 was a good idea.

Floating point calculations have been known to vary on CPUs ~because of bugs in silicon that went undetected (or possibly, and what may even be worse if true: were detected but filed as unremarkable and "WONTFIX" at manufacture time) until too long after shipping.~ (wow, I did more research) because of arguably-defensible logic about some chips providing more precise numbers of extra bit size in intermediate calculations ("Extended Precision" -- apparently this was even recommended but not specified by IEEE-754; cool cool cool).

Fun anecdata: Such an divergent behavior in silicon ended up so widely deployed that for some years malware began to use it to detect whether or not they were on a virtualized host: real hosts would do floating point math "wrong"; virtual machines would do it right. This worked on such a wide deployment of real hosts for so long that it was actually a viable strategy for malware to opt-out of working on hosts that did fp math right, as a way of captured-study avoidance.

In Java, we can also see clear evidence of this bugaboo in the existence of the strictfp keyword built into the language itself, because java's earliest designers and builders found the floating point reliability of CPUs to be a threat to the "write once, run anywhere" mantra. (They eventually relented and made the default loosey-goosey, for speed reasons, but then added the strictfp keyword as an opt-in.)

I bring up these anecdotes and examples because I find it's entirely too possible to go "ohhh, pfffff -- sure floats are sketchy in theory, but no one ever notices that in practice..." Yes. Yes we do notice it in practice. It's very real, and it's absolutely not a thing relegated to distant forgettable history. If only.

Now, I think, to the best of my understanding and reading of this, that comparison is the one operation that should be unequivocally safe from all this. But honestly? I'm not sure.

Floats terrify me.

warpfork commented 4 years ago

we really should develop a solid test suite for floats, with edge cases testing the boundaries to make sure we're right in our assumptions.

it's a trade-off between wasted space and extra encoding time for every float.

Yeah, we'd definitely need fixtures o'plenty for this.

It's also going to be extra decoding time for every float, I'm afraid. Per the same logic about strictness elsewhere, we should probably be rejecting floats encoded with too much precision.

Although I'll readily admit the idea of "rejecting floats encoded with too much precision" sounds like it's going to be an absolute usability nightmare in practice. If there does turn out to be the slightest disagreement about precision and rounding in code in the wild, the bugs will be beyond bizarre: only some libraries on only some physical machines on only some data samples may suddenly have show-stopping issues. Blegh; the lengths I'd go to in order to avoid this are substantial.

This is also something that's probably going to make reuse of existing serialization libraries in various languages harder. I'd bet a shiny nickel we'll more or less have to fork the CBOR decoder in many cases in order to express strictness on this -- look at the line numbers we're looking at already in borc and refmt; they're not at the shallow end of the libraries! Maybe this is just something we swallow and accept, but uffdah.

I... I don't know what to do with this.

There's so much "between a rock and a hard place" with floats that I actually want to tentatively float (heh) the idea that floats don't even belong in the Data Model, because they wreck so many things and are so dangedly impossible to make practical guarantees about. Or if we keep them there, we put incredibly loud warnings around them: convergence of hashes of objects containing floats is generally weakly guaranteed if at all, etc; essentially, leaving floats in so that they're there for rapid prototyping, but beyond that, the answer to any float-related problems becomes "don't use them" and we try to get people to move towards more intentional stances around numbers instead.

rvagg commented 4 years ago

I did already raise the possibility of an advanced layout to represent floats using alternate mechanisms, that was only 1/2 a joke.

I share your concern with floats, to the degree that unless we're going to just say "all in 64-bits", I'm not entirely comfortable codifying "smallest possible" here until we get some solid tests in place to confirm that we can achieve what we think we can.

Next steps: I'll try and tighten this all up but may push the floats thing to "unresolved pending coding & testing work".

The CBOR recommendation for "smallest possible" on floats does make me scratch my head a bit on the length-first key ordering recommendation. It seems like a performance optimisation because length comparison is so cheap. But it's the opposite with floats, this nasty float encode+decode+compare+repeat work is expensive. But maybe it's a reflection of how common they think both data types will be in cbor?

Also not touched on in the float discussion is JavaScript's nasty whole-number problem for round-trips. A 1234.0 will become a 1234. We can probably deal with that when schemas get involved, but without metadata carried along with a decoded object we can't deal with that.

ribasushi commented 4 years ago

A 1234.0 will become a 1234

@rvagg @warpfork Have we considered not implementing floats at all on the wire, and instead have a struct with 2 integers? Similar to: https://github.com/ipfs/specs/blame/3816b05747e/UNIXFS.md#L99-L113

Yes it is wordy and convoluted. But there is no way to screw it up on the wire.

rvagg commented 4 years ago

@ribasushi yep, considered, and as @warpfork said, maybe even remove them from the Data Model! A bit extreme, but maybe we'll keep on banging our heads against this one enough that when we (inevitably) build our own coding format to evolve past CBOR that we ditch native floats entirely.

And also, an advanced layout could do this transparently (struct of two ints, a string, some other binary encoding) so maybe we just work on that and encourage its use for much win.

ribasushi commented 4 years ago

@rvagg so then... can't the initial DM spec just say "floats currently unsupported" ?

rvagg commented 4 years ago

@ribasushi I have a feeling that the majority of programmers will look at that and think we're not taking this seriously. Maybe people who have experienced significant float pain and have worked around it in novel ways will see through it, but mostly people are going to say "well, this is not usable for real-world data then!".

A large part of my background is in animal genetics and dealing with gobs of data that is primarily floating point, where increased accuracy exhibits marginal returns. Most of the test data I could come up with is float-heavy and without a primitive or an existing tool that I can pick off the shelf to encode them, I'm probably not going to bother!

@vmx do floats make much of an appearance in GS data or do they deal with that pain in other ways already?

My inclination right now is to:

  1. figure out the limits we're dealing with across languages with this CBOR question
  2. litter our specs (data model & codecs) with disclaimer about floats and what they can do to stability and determinism
  3. work on tools to give users a way out -- perhaps an ADL, perhaps just a spec for a codec-independent float representation format that has a little bit of extra user-work (perhaps just simple JS & Go libraries that take two ints and give you what you need).
  4. consider a future possibility of removing floats from the data model, or at least deprecating their direct use
ribasushi commented 4 years ago

@rvagg what you are saying re

majority of programmers will look at that and think we're not taking this seriously

is unfortunately somewhat true :/ More specifically: folks who think they need floats far outweigh the folks who know they need floats and can't help it any other way.

However:

consider a future possibility of removing floats from the data model, or at least deprecating their direct use

^^ this is unlikely to happen due to the nature of protocols in general, and our context of "promised immutable" in particular. There needs to be a clear understanding and acceptance that anything that is put in any of the PL specs is effectively there forever. Better options can appear in the future and be very strongly encouraged, but removal is not really possible.

Making this tradeoff ( we ship something suboptimal now, knowing it will stick ) is perfectly fine, and I'd encourage this myself. What worries me is that a decision is made under a false premise of "we can remove X later".

rvagg commented 4 years ago

I put together a JavaScript and Go DagCBOR side-by-side to look at edge cases and test differences https://github.com/rvagg/ipld-dagcbor-floats

Needs more experimentation, I'd love to test if there are cases around the edges that push the limits of the IEEE-754 implementations. So far the differences are only in the shortest-encoded vs everything-64-bit choice.

rvagg commented 4 years ago

@vmx it would be good if you could consider how difficult these strictness rules are going to be within Rust / serde. It looks like you have quite a bit of control over types and can make some of these decisions during encoding but the defaults have a way to go to get to strictness: https://github.com/vmx/rust-ipld/blob/master/dag-cbor/src/lib.rs

With the following strictness rules, how hard will it be with Serde to (a) encode as the right type and (b) reject encoded data that doesn't meet the strictness requirement:

If we have the opportunity to make it less painful for Rust while we're making these decisions then we should at least consider it before committing.

vmx commented 4 years ago

@vmx it would be good if you could consider how difficult these strictness rules are going to be within Rust / serde.

I haven't had a look, but all those things should be possible. I'd also like to mention that I'd like to move away from Serde anyway, so this shouldn't really be a blocker on the decisions about floats.

vmx commented 4 years ago

@ribasushi I have a feeling that the majority of programmers will look at that and think we're not taking this seriously. Maybe people who have experienced significant float pain and have worked around it in novel ways will see through it, but mostly people are going to say "well, this is not usable for real-world data then!".

I share this concern. Even knowing the issues with floats I probably also wouldn't look into a system for encoding data that doesn't easily support floats. I would think if it as an "esoteric, theoretically nice system no one will use". I would argue that people who don't know much about floats, implicitly think of IEEE-754 floats. They "just work" in every language. And if you hit issues, it's the same issues everywhere. I personally rarely ever had issues with floats in real world. IEEE-754 floats are the standard. Numbers in JSON are also very broken, though we don't encode them as strings in DagJSON (probably to keep things practical), we should also take a practical approach for the IPLD Data Model.

Another point is that I think IPLD should also be useful without schemas. Without floats, it's not useful. Though we can (and should) discuss a decimal type on the schema level, just as we do with non-string keys for maps.

@vmx do floats make much of an appearance in GS data or do they deal with that pain in other ways already?

In the geospatial world everything is floats. Most widely used in GeoJSON, which doesn't go into details about floats. But I haven't seen any libraries for GeoJSON that do anything special with floats. A very old de-facto standard is Shapefile, there floating point numbers are stored as string (I didn't know that before). I should also mention GeoPackage which is basically just a SQLite file, hence using IEEE-754.

rvagg commented 4 years ago

I've split out the floating point issue into a separate section under "Strictness" titled "Floating Point Encoding (Unresolved)", we'll figure out that one a bit later and not block this with it. I've also added an "Implementations" section at the bottom with notes on where they're at. Let me know if you think this doesn't belong, I'm not attached to it.

Ready for final reviews (please).

warpfork commented 4 years ago

majority of programmers will look at that and think we're not taking this seriously

is unfortunately somewhat true :/ More specifically: folks who think they need floats far outweigh the folks who know they need floats and can't help it any other way.

Yup. This.

This is why they're in there. Both for some practical bridging reasons and for some cough practical cultural evolutionary truths.

It's just a question of how many and how loud of warnings we can put around them.


consider a future possibility of removing floats from the data model, or at least deprecating their direct use

^^ this is unlikely to happen due to the nature of protocols in general, and our context of "promised immutable" in particular.

I think an "eternally deprecated" (but never removed) approach is basically the tone to aim for.

(Just as an aside, because this word has come up in a couple issues lately, and I'd like to clear some air around it -- the definition of deprecate is: "discourage... without completely removing it or prohibiting its use" (from wikipedia or "to express disapproval of" (from dictionary.com, particularly the "usage note" section). The word "deprecate" does not denote removal and it certainly doesn't imply a timeline; it may carry that connotation sometimes, but without More Words stating something about e.g. "slated for removal", it really shouldn't be assumed.)


also, an advanced layout could do this transparently (struct of two ints, a string, some other binary encoding)

Oooooooh. I'd... actually never thought of that! That's an adventurous idea and I'd love to see how it shakes out.

I was gonna say this is one where I'd be happy to see us just put a big block of docs with a worked example (e.g. doing the struct and all), but then leave people to it. But making an advanced layout for it is also a very interesting idea I'd kinda like to see.


Wholly on board with pushing this floats stuff to a separate section and marking it pending research. :+1: