golang / go

The Go programming language
https://go.dev
BSD 3-Clause "New" or "Revised" License
123.29k stars 17.58k forks source link

encoding/json: document Valid allows invalid utf8 #58517

Open LukeShu opened 1 year ago

LukeShu commented 1 year ago

What version of Go are you using (go version)?

$ go version
go version go1.20 linux/amd64

Does this issue reproduce with the latest release?

Yes.

What operating system and processor architecture are you using (go env)?

go env Output
$ go env
GO111MODULE=""
GOARCH="amd64"
GOBIN=""
GOCACHE="/home/lukeshu/.cache/go-build"
GOENV="/home/lukeshu/.config/go/env"
GOEXE=""
GOEXPERIMENT=""
GOFLAGS=""
GOHOSTARCH="amd64"
GOHOSTOS="linux"
GOINSECURE=""
GOMODCACHE="/home/lukeshu/go/pkg/mod"
GONOPROXY=""
GONOSUMDB=""
GOOS="linux"
GOPATH="/home/lukeshu/go"
GOPRIVATE=""
GOPROXY="https://proxy.golang.org,direct"
GOROOT="/usr/lib/go"
GOSUMDB="sum.golang.org"
GOTMPDIR=""
GOTOOLDIR="/usr/lib/go/pkg/tool/linux_amd64"
GOVCS=""
GOVERSION="go1.20"
GCCGO="gccgo"
GOAMD64="v1"
AR="ar"
CC="gcc"
CXX="g++"
CGO_ENABLED="1"
GOMOD="/dev/null"
GOWORK=""
CGO_CFLAGS="-O2 -g"
CGO_CPPFLAGS=""
CGO_CXXFLAGS="-O2 -g"
CGO_FFLAGS="-O2 -g"
CGO_LDFLAGS="-O2 -g"
PKG_CONFIG="pkg-config"
GOGCCFLAGS="-fPIC -m64 -pthread -Wl,--no-gc-sections -fmessage-length=0 -fdebug-prefix-map=/run/user/1000/tmpdir/go-build2160073961=/tmp/go-build -gno-record-gcc-switches"

What did you do?

I passed a binary garbage wrapped in " quotes to json.Valid.

package main

import (
    "encoding/json"
    "fmt"
)

func main() {
    jsonStr := "\"0\x85\xcd\xc0\xf3\xcb\xc1\xb3\xf2\xf5\xa4\xc1\xd40\xba\xe9\""
    fmt.Println(json.Valid([]byte(jsonStr)))
}

https://go.dev/play/p/rrtmrEL3Ipd

What did you expect to see?

As JSON is specified to be "a sequence of Unicode code points" (ECMA-404) or "encoded in UTF-8, UTF-16, or UTF-32" (RFC 7159) I would expect that a JSON document containing bytes that cannot be interpreted as Unicode codepoints to not be considered valid.

What did you see instead?

It considered the document to be valid, even though it contains bytes that cannot be interpreted as Unicode codepoints.

artyom commented 1 year ago

@LukeShu note that json.Unmarshal documentation mentions this:

When unmarshaling quoted strings, invalid UTF-8 or invalid UTF-16 surrogate pairs are not treated as an error. Instead, they are replaced by the Unicode replacement character U+FFFD.

As json.Unmarshal and json.Valid share underlying code, your example looks like an expected behavior.

Perhaps there's a room for improving the documentation?

LukeShu commented 1 year ago

I understand that Unmarshal and Valid share underlying code, but Unmarshal being permissive of invalid input doesn't mean that Valid should identify it as valid (at least without its own note in the documentation).

And I discovered this because of inconsistency between the various functions, when I expected that even if they have quirks they'd at least be consistent because of sharing the underlying implementation. I'd discovered that json.Unmarshal(any)json.MarshalIndent() would convert the binary garbage to U+FFFD, while json.Indent() and json.Compact() would pass it through unchanged. I figured that differing behavior for invalid JSON was reasonable, but then was surprised to find that json.Valid() identified it as valid. (And that differing behavior, while surprising, I couldn't really say was a bug; while json.Valid() identifying it as valid is a clear bug to me.)

ianlancetaylor commented 1 year ago

CC @dsnet @mvdan

dsnet commented 1 year ago

It's unfortunate that the json package allows invalid UTF-8 since RFC 8259, section 8.1 clearly states that JSON must be UTF-8.

Given that we already verbally promise handling of invalid UTF-8 in Unmarshal, we should also document how it operates in Valid.