strconv: non-UTF8 bytes are being unescaped in strconv.Unquote

fffonion commented 2 years ago

What version of Go are you using (`go version`)?

$ go version
go version go1.17.2 darwin/amd64

Does this issue reproduce with the latest release?

yes

What operating system and processor architecture are you using (`go env`)?

go env Output

$ go env
```
GO111MODULE=""
GOARCH="amd64"
GOBIN=""
GOCACHE="/Users/fffonion/Library/Caches/go-build"
GOENV="/Users/fffonion/Library/Application Support/go/env"
GOEXE=""
GOEXPERIMENT=""
GOFLAGS=""
GOHOSTARCH="amd64"
GOHOSTOS="darwin"
GOINSECURE=""
GOMODCACHE="/Users/fffonion/go/pkg/mod"
GONOPROXY=""
GONOSUMDB=""
GOOS="darwin"
GOPATH="/Users/fffonion/go"
GOPRIVATE=""
GOPROXY="https://proxy.golang.org,direct"
GOROOT="/usr/local/Cellar/go/1.17.2/libexec"
GOSUMDB="sum.golang.org"
GOTMPDIR=""
GOTOOLDIR="/usr/local/Cellar/go/1.17.2/libexec/pkg/tool/darwin_amd64"
GOVCS=""
GOVERSION="go1.17.2"
GCCGO="gccgo"
AR="ar"
CC="clang"
CXX="clang++"
CGO_ENABLED="1"
GOMOD="/Users/fffonion/bbb/go.mod"
CGO_CFLAGS="-g -O2"
CGO_CPPFLAGS=""
CGO_CXXFLAGS="-g -O2"
CGO_FFLAGS="-g -O2"
CGO_LDFLAGS="-g -O2"
PKG_CONFIG="pkg-config"
GOGCCFLAGS="-fPIC -arch x86_64 -m64 -pthread -fno-caret-diagnostics -Qunused-arguments -fmessage-length=0 -fdebug
```

What did you do?

package main

import (
    "fmt"
    "strconv"
)

func main() {
    // `"\147\\u0001"`
    in := []byte{34, 147, 92, 117, 48, 48, 48, 49, 34}

    str0, _ := strconv.Unquote(string(in))
    fmt.Println([]byte(str0))

}

It seems neither does UnquoteChar (https://github.com/golang/go/blob/go1.17.6/src/strconv/quote.go#L267) nor does its caller Unquote (https://github.com/golang/go/blob/go1.17.6/src/strconv/quote.go#L267) handles the error from utf8.DecodeRuneInString(s), so any single byte non-UTF8 sequence are being escaped to utf8.RuneError and stored in output.

In Python for example, such cases are handled properly:

a=bytes([147,92,117,48,48,48,49])
a.decode('raw_unicode_escape')
# '\x93\x01'

What did you expect to see?

[147 1]

What did you see instead?

[239 191 189 1]

robpike commented 2 years ago

This is how strings and UTF-8 work in Go. Every byte of illegal UTF-8 is converted to U+FFF8.

it's actually not all that clearly documented. Although we went to great pains to do this consistently between all the library routines and the range statement, and therefore by inclusion strconv.Unquote, the best documentation for this behavior is here: https://pkg.go.dev/unicode/utf8#DecodeRune. It also appears in https://go.dev/ref/spec#For_statements in the section about range clauses.

It should probably be made more visible.

But as far as this behavior is concerned, it is working as intended. I'll leave the issue open to see if anyone has ideas about the best way to make this easier to discover.

fffonion commented 2 years ago

@robpike Thanks for the reply! I came across this issue when trying to json unmarshal a part with non-printable bytes inside. Unquote is needed to unescape those unicode sequences like "\u0022", but those single byte character (like 0x99) are converted to utf8.RuneError. I have tested if those single byte are escaped to like "\x99" before passing to json.Unmarshal then strconv.Unquote is happy. So we are going to write our modified version of strconv.Unquote that just ignores such case.

fffonion commented 2 years ago

@robpike I noticed the case above might pointed to json package instead (https://github.com/golang/go/blob/go1.17.6/src/encoding/json/decode.go#L1306) where a similar approach is done as strconv.Unquote. However, per JSON RFC, putting unescaped characters like 0x93 unescaped, as shown in example, is allowed. Does it make sense to make the json package specificly allow this use case? Technically it could just be:

        default:
            rr, size := utf8.DecodeRune(s[r:])
            if rr == utf8.RuneError {
                b[w] = s[r]
                r += 1
                w += 1
            } else {
                r += size
                w += utf8.EncodeRune(b[w:], rr)
            }

robpike commented 2 years ago

The the error condition would need to be

size == 1 && rr == utf.RuneError

and that is certainly possible but I am not in a position to say whether it's the right thing to do. It would likely break some tests, but maybe not too many. Non-compliance with the RFC is arguably a reason to make such a change but this behavior would be unique within the Go environment and therefore would require discussion.

mvdan commented 2 years ago

When it comes to encoding/json, I'm afraid the current behavior is well documented:

When unmarshaling quoted strings, invalid UTF-8 or invalid UTF-16 surrogate pairs are not treated as an error. Instead, they are replaced by the Unicode replacement character U+FFFD.

Changing the default behavior would surely break programs out there in subtle ways. The general understanding with the Go standard library's compatibility guarantee is that we may be able to change undocumented behavior, as long as very few Go programs out there break because of the change. But changing documented behavior, especially if it could break programs, is almost always a breaking change we cannot do.

Note that the json encoder does allow disabling the replacement of invalid UTF-8 (see below), so one could imagine a proposal to also opt out of this feature in the decoder.

String values encode as JSON strings coerced to valid UTF-8, replacing invalid bytes with the Unicode replacement rune. So that the JSON will be safe to embed inside HTML Githubissues.
Githubissues is a development platform for aggregating issues.

golang / go