block-layer/codecs/dag-json: hash inconsistency possible via unicode escapes

mvdan commented 3 years ago

Take this working example: https://play.golang.org/p/NxMbmIIa6yk

"15\u00b0C" and "15°C" are both valid JSON encodings of the string 15°C, and they are different byte sequences that would result in different hashes. The dag-json spec should probably force strings to always encode in a single canonical way, to prevent this kind of issue. That would likely require some form of utf-8 normalization, plus strictly defining when unicode escapes can be used.

rvagg commented 3 years ago

So the state of play with these codec specs is degrees of strictness combined with whatever our personal inclinations are to deal with lack of strictness, with dag-json being at the very liberal end. dag-pb I think would be the most strict, I think it has no room to move (it's also less useful as a generic codec, which has to do with flexibility too). dag-cbor is in the middle, toward the strict-ish end and we've paid more attention to trying to make it more strict.

However, all of our efforts to date have been on the encode side, with no validation and strictness on the decode side. So we can write all the specs we want for how dag-cbor should be a subset of cbor and dag-json should adopt very particular opinions about how to do json, but we'll still accept regular sloppy cbor and json in all our systems if they're told it's dag-cbor and dag-json. Which is not great, but opinions vary on how much this matters.

I've been chasing down dag-cbor with the hope of making it more ideal and have strong opinions about strictness. I've got a half complete JS cbor encoder / decoder that deeply embeds these ideas of strictness into it (will finish that off one day ..). dag-json is in a category more of "it's nice to have another codec that can represent the data model, but we don't recommend this for serious things", so we haven't been as focused on perfecting it. But we could put more effort in, I think we just all recognise how difficult a job it is. This instance being just one example.

I'd suggest that you go ahead and open a PR with what you think would be a good way to specify boundaries around string encoding for JSON. Accepting that many (most?) instances of our code interacting with dag-json data is going to use third-party libraries that will do their own thing regardless.

mvdan commented 3 years ago

I'd suggest that you go ahead and open a PR with what you think would be a good way to specify boundaries around string encoding for JSON.

How about http://gibson042.github.io/canonicaljson-spec/? We could piggyback off it, because it seems to have all that we already specify, plus more - including strings.

rvagg commented 3 years ago

lgtm, although I don't know in what ways it varies from what we have now in our two primary implementations. We'd want to make sure that we're not introducing rules that we're not already breaking in common cases because (a) there's data in the wild already and (b) we will have to work to hamonise the implementations and having deviations from whatever harmony we have now will just be frustrating.

We did something similar with CBOR, piggy-backing most of the the strict-CBOR recommendations in the spec itself, so we're not doing anything particularly wild or divergent from others who care about canonical forms. Maybe a good exercise would be to check what the JS and Go implementations currently do and how they diverge from that Canonical JSON spec and see what we can pull in wholesale and what we have to reject.

mvdan commented 3 years ago

I agree. This will require a small amount of research, and I think there shouldn't be any discrepancies. At the same time it feels non-urgent, so I'll just let the issue sit here for now.

ipld / specs

block-layer/codecs/dag-json: hash inconsistency possible via unicode escapes #289