Open LukeShu opened 1 year ago
@LukeShu note that json.Unmarshal documentation mentions this:
When unmarshaling quoted strings, invalid UTF-8 or invalid UTF-16 surrogate pairs are not treated as an error. Instead, they are replaced by the Unicode replacement character U+FFFD.
As json.Unmarshal and json.Valid share underlying code, your example looks like an expected behavior.
Perhaps there's a room for improving the documentation?
I understand that Unmarshal
and Valid
share underlying code, but Unmarshal
being permissive of invalid input doesn't mean that Valid
should identify it as valid (at least without its own note in the documentation).
And I discovered this because of inconsistency between the various functions, when I expected that even if they have quirks they'd at least be consistent because of sharing the underlying implementation. I'd discovered that json.Unmarshal(any)
→json.MarshalIndent()
would convert the binary garbage to U+FFFD, while json.Indent()
and json.Compact()
would pass it through unchanged. I figured that differing behavior for invalid JSON was reasonable, but then was surprised to find that json.Valid()
identified it as valid. (And that differing behavior, while surprising, I couldn't really say was a bug; while json.Valid()
identifying it as valid is a clear bug to me.)
CC @dsnet @mvdan
It's unfortunate that the json
package allows invalid UTF-8 since RFC 8259, section 8.1 clearly states that JSON must be UTF-8.
Given that we already verbally promise handling of invalid UTF-8 in Unmarshal
, we should also document how it operates in Valid
.
What version of Go are you using (
go version
)?Does this issue reproduce with the latest release?
Yes.
What operating system and processor architecture are you using (
go env
)?go env
OutputWhat did you do?
I passed a binary garbage wrapped in
"
quotes tojson.Valid
.https://go.dev/play/p/rrtmrEL3Ipd
What did you expect to see?
As JSON is specified to be "a sequence of Unicode code points" (ECMA-404) or "encoded in UTF-8, UTF-16, or UTF-32" (RFC 7159) I would expect that a JSON document containing bytes that cannot be interpreted as Unicode codepoints to not be considered valid.
What did you see instead?
It considered the document to be valid, even though it contains bytes that cannot be interpreted as Unicode codepoints.