kurrik / json

JSON parser in Go
Apache License 2.0
8 stars 3 forks source link

Failure in decoding UTF-16 surrogate pairs #1

Closed samuel closed 11 years ago

samuel commented 11 years ago
work:json(master) samuelks$ go test
--- FAIL: TestCases (0.00 seconds)
    compat_test.go:194: string
    compat_test.go:195: string
    compat_test.go:197: Decode: [239 191 189 239 191 189]
    compat_test.go:199: Expected: [240 157 132 158]
    compat_test.go:202: Problem decoding 'String with small-U encoded multibyte UTF-8' Expected: , Got 
FAIL
exit status 1

It seems that strconv.Unquote calls utf8.DecodeRuneInString which returns RuneError ('\uFFFD') for each part of the surrogate pair (since it doesn't understand them). The way encoding/json handles this is to not use strconv.Unquote but rather handle the unquoting itself (there's a comment in json/decode.go : "The rules are different than for Go, so cannot use strconv.Unquote" which may just be for unicode support)

I would look into fixing this but I haven't run into it in the wild.. just in the unit tests :)

kurrik commented 11 years ago

Hm, I wonder if this changed in Go 1.1, as I'm pretty sure these all passed previously. Thanks for the heads up.

kurrik commented 11 years ago

Verified that this passes in go1.0.2

kurrik commented 11 years ago

I think I had grabbed the example from: http://www.fileformat.info/info/unicode/char/1d11e/index.htm C/C++/Java source code "\uD834\uDD1E"

From http://en.wikipedia.org/wiki/UTF-8 "According to the UTF-8 definition (RFC 3629) the high and low surrogate halves used by UTF-16 (U+D800 through U+DFFF) are not legal Unicode values, and the UTF-8 encoding of them is an invalid byte sequence and thus should be treated as described above."

So I think the test case itself is incorrect and Go1.1 probably got more strict. So I'll remove the test and close this issue.