ipld / specs

Content-addressed, authenticated, immutable data structures
Other
592 stars 108 forks source link

DAG-JSON canonicalizing and whitespace #108

Closed warpfork closed 4 years ago

warpfork commented 5 years ago

We need a spec about handling of whitespace in dag-json encodings.

I recently ran into a fun situation. I have some code which emits json, in pretty print format. This pretty printing includes a trailing line break, as is generally sensible for printing. This is hashed and stored.

With the trailing linebreak.

When loading this, my deserializer consumed all the json tokens, then reported it's "done" with the json. A hasher has been fed by a TeeReader, so, with exactly the same bytes that the deserializer consumed; and after the deserializer is done, the final hash is summed and then checked for integrity.

Which failed. Because the trailing linebreak wasn't consumed by my deserializer. So it wasn't rolled into the hasher. So of course the hash wouldn't match.

Ouch.

Probably the answer is simple: "don't emit with whitespace". But since I did, and now we're here...

What is our spec here? https://github.com/ipld/specs/blob/master/Codecs/DAG-JSON.md doesn't currently have much to say.

Should we be Postel'y: accept more inputs (e.g. with whitespace variation) than we emit? Perhaps also supporting a "--strict" mode (or, the contrary, "--lax" mode) behind feature flags of some kind?

Can we phrase this in terms of some clear higher-level rules, like "The content as known by the Data Model layer plus the multicodec value must fully specify all bytes", or, the contrary, "Byte sequences are acceptable as long as they result in one full JSON object, and may contain leading or trailing bytes so long as they do not yield any JSON token (e.g. are whitespace)"?


(n.b. we probably want to keep this separated from A) CBOR and B) canonical ordering of map entries; and C) anything else at all; canonicalization is hard enough one spec detail at a time.)

Stebalien commented 5 years ago
  1. Yes, we should get rid of whitespace. We also need to fix the map ordering.
  2. We should allow whitespace. That way we can have put IPLD json objects into IPFS and it'll "just work" (assuming we swap the codec to "raw").
  3. We should almost-never (read never) actually hash dag-json. It's useful for interop but not something that should ever hit a datastore.
mikeal commented 5 years ago

I think we should hold off, and potentially remove our current, dag-json specs. We need some more experimentation in this area before we try to write compatible implementations.

For instance, the current implementation uses a JSON serializer that avoids the issues you saw and enforces key sorting for consistency, but that’s not based on a spec so we’d have to document it. We’re also considering a completely different format that would give us much more flexibility and performance but would also make the block format not strictly a single JSON object.

warpfork commented 5 years ago

Fwiw, I'm perfectly on board with the idea of other encodings that look JSON-y but have other specifics. (Like, super on board, even.) But I don't think dag-json is going anywhere soon.

For example, if I'm reading right, Radicle is using dag-json quite concretely.

warpfork commented 5 years ago

We should almost-never (read never) actually hash dag-json

Isn't that literally what the multicodec bytes in CIDs are... for... tho

Stebalien commented 5 years ago

Isn't that literally what the multicodec bytes in CIDs are... for... tho

Not really. The multicodec byte exists so IPLD can support multiple (usually existing) IPLD formats. That doesn't mean we should encourage application developers to use a known-inferior format (e.g., DagJSON). It needs to work because we need a human readable IPLD format but that's about it.

warpfork commented 5 years ago

Okay, so, again,

Can we phrase this in terms of some clear higher-level rules, like "The content as known by the Data Model layer plus the multicodec value must fully specify all bytes", or, the contrary, "Byte sequences are acceptable as long as they result in one full JSON object, and may contain leading or trailing bytes so long as they do not yield any JSON token (e.g. are whitespace)"?

I'm writing this code^W^W^W have written this code, and it's going to have a behavior, so I'm not really interested in extemporizing that doesn't define any behavior. I'd like to discuss what that behavior should be.

vmx commented 5 years ago

Sorry for not answering your question, but I would also say that dag-json shouldn't even be a thing. I would rather specify a "JSON representation for the IPLD Data Model". That representation can be used to pretty-print things and as input, but it won't ever be hashed.

Stebalien commented 5 years ago

I'd like to discuss what that behavior should be.

If we hash it, we should strip all whitespace and sort all keys. However, I wouldn't mandate it.

mikeal commented 5 years ago

I would rather specify a "JSON representation for the IPLD Data Model".

This is what we were trying to do with the “canonical JSON representation” but threw out for a bunch of reasons.

If we hash it, we should strip all whitespace and sort all keys. However, I wouldn't mandate it.

This is what the JS implementation did, but it’s not properly specified anywhere, it’s pretty much just an implementation decision.

Agree on not mandating, it’s hard to imagine we even could sufficiently mandate something like this. We can be diligent in our own implementations to ensure consistency, but given that codecs like this are going to be written against existing encoder/decoders we shouldn’t expect 100% consistency unless there is already 100% consistency across those encoder/decoders. JSON is widely available and also somewhat inconsistent. We can advise people of best practices but since nobody, not even us, is going to invalidate blocks based on these inconsistencies there isn’t much more we can do.

but I would also say that dag-json shouldn't even be a thing

A few things.

  1. On a long enough timeline, someone will write a dag-json. If we have concerns about it we should resolve them by writing the initial implementations ourselves.
  2. In the browser, JSON has one key advantage other formats won’t be able to overcome: it doesn’t require any dependencies.
  3. The current CBOR implementation for JS is awful to use. The way the memory allocation works makes its usability much worse than something we could build with JSON. Either we’ll get a much better CBOR or we’ll get a dag-json people prefer 🤷🏽‍♀️
vmx commented 5 years ago

I would rather specify a "JSON representation for the IPLD Data Model".

This is what we were trying to do with the “canonical JSON representation” but threw out for a bunch of reasons.

I have a different understanding of what we tried in the past. I had the impression we tried to find a way to make it easy to put in JSON data you have into IPLD. What I propose here is JSON as exchange format, it might not be that nice to use and will have restrictions, but it should properly represent the data model. Think about it as "XML for IPLD".