golang / protobuf

Go support for Google's protocol buffers
BSD 3-Clause "New" or "Revised" License
9.64k stars 1.58k forks source link

JSON Unmarshal of some Unicode escape sequence results in error #1576

Open denis-zyk opened 7 months ago

denis-zyk commented 7 months ago

What version of protobuf and what language are you using?

protoc --version
libprotoc 23.4

protoc-gen-go --version
protoc-gen-go v1.31.0

Golang lib:

google.golang.org/protobuf v1.30.0

What did you do? Steps to reproduce the behavior:

  1. Have a JSON string containing the following Unicode escape sequence \udca0, e.g. {"name":"'unsafe-eval'; object?src\udca0'none'"}.
  2. Unmarshal byte slice of JSON string in question to protobuf message using google.golang.org/protobuf/encoding/protojson package.
  3. Got error proto: syntax error (line 1:9): invalid escape code "'none'" in string.

Code snippet for reproducing issue:

package main

import (
    "encoding/json"
    "fmt"
    "google.golang.org/protobuf/encoding/protojson"

    "pkg.my.package/microservices/models"
)

func main() {
    j1 := []byte(`{"name":"'unsafe-eval'; object?src\uc2a0'none'"}`)    # unmarshal is OK
    j2 := []byte(`{"name":"'unsafe-eval'; object?src\udca0'none'"}`)    # unmarshal error in protojson

    // protojson lib

    // models.Client is a golang struct compiled using `protoc` binary, from Client protobuf message.
    // Protobuf message for the sake of reproducing has 1 string field:
    //   Name       string       `protobuf:"bytes,1,opt,name=name,proto3" json:"name"`
    outProto1 := &models.Client{}
    outProto2 := &models.Client{}

    unmarshaler := protojson.UnmarshalOptions{DiscardUnknown: true}

    err1 := unmarshaler.Unmarshal(j1, outProto1)
    err2 := unmarshaler.Unmarshal(j2, outProto2)

    fmt.Println("protojson lib")
    fmt.Println(string(j1), err1, outProto1)
    fmt.Println(string(j2), err2, outProto2)

    // core lib
    outCore1 := make(map[string]interface{})
    outCore2 := make(map[string]interface{})

    err1 = json.Unmarshal(j1, &outCore1)
    err2 = json.Unmarshal(j2, &outCore2)

    fmt.Println("core lib")
    fmt.Println(string(j1), err1, outCore1)
    fmt.Println(string(j2), err2, outCore2)
}

The example code snippet above produces the following output:

protojson lib
{"name":"'unsafe-eval'; object?src\uc2a0'none'"} <nil> name:"'unsafe-eval'; object?src슠'none'"
{"name":"'unsafe-eval'; object?src\udca0'none'"} proto: syntax error (line 1:9): invalid escape code "'none'" in string 
core lib
{"name":"'unsafe-eval'; object?src\uc2a0'none'"} <nil> map[name:'unsafe-eval'; object?src슠'none']
{"name":"'unsafe-eval'; object?src\udca0'none'"} <nil> map[name:'unsafe-eval'; object?src�'none']

What did you expect to see? Unmarshalled value name:"'unsafe-eval'; object?src�'none'". I.e. same unmarshalled string value as in Golang core JSON lib.

What did you see instead? Unpopulated message and an error proto: syntax error (line 1:9): invalid escape code "'none'" in string.

Anything else we should know about your project / environment? Reproduced & confirmed using the followed golang versions and environments:

Linux: go version go1.20.11 linux/amd64 macOS: go version go1.21.4 darwin/arm64

neild commented 7 months ago

U+DCA0 is a Unicode surrogate, and should not appear in valid UTF-8 text. A conformant UTF-8 decoder is required to reject input containing an unpaired surrogate; see "How do I convert an unpaired UTF-16 surrogate to UTF-8?" here: https://unicode.org/faq/utf_bom.html

I'm not sure about the history that led to encoding/json accepting this invalid input and protojson rejecting it, but I don't think protojson is doing the wrong thing here.

puellanivis commented 7 months ago

I think this is still generating an improper error message. It seems to be complaining that the invalid escape code is 'none' (which is not at all being used as an escape code), when it should probably be reporting the invalid escape code to be dca0, right?

neild commented 7 months ago

Yes, you're right; I wasn't looking at the error message. The error is definitely wrong, or at least confusing.

Also, the protojson parser expects a surrogate to be followed by \uXXXX. I'm not certain if we should be expecting the potential for input like \udca0 (the surrogate) followed by a single unescaped character with no \u prefix.

puellanivis commented 7 months ago

Yeah… not sure what the proper behavior should be in error conditions. But also, do we check at all that a Hi Surrogate is followed by a Lo Surrogate, and vice versa, or do we just check that surrogates occur in pairs?

I smell the familiar scent of a rabbit role into a pedantic reading of standards… 😩 At least the minimal solution here seems to be that if a surrogate isn’t followed by a surrogate, we report the escape itself as invalid? Or some sort of message that surrogates must be paired?

dsnet commented 7 months ago

The JSON mapping for protobuf follows RFC 7493, which requires strict checking of surrogate halves (see section 2.1 of RFC 7493). The "encoding/json" package only adheres to RFC 8259, which leaves split surrogate halves as undefined behavior (see section 8.2 of RFC 8259). As a side note, the "encoding/json/v2" draft proposal will target compliance with RFC 7293 by default.

I'm not certain if we should be expecting the potential for input like \udca0 (the surrogate) followed by a single unescaped character with no \u prefix.

A standalone surrogate half is invalid and should be rejected. The validity of a surrogate pair is determine using utf16.DecodeRune.

I don't see anything wrong with what protojson is doing. It correctly identified a surrogate half and was now expecting another escaped surrogate half. The error message says:

proto: syntax error (line 1:9): invalid escape code "'none'" in string

Strictly speaking, this error message is correct as 'none' is literally an invalid escape code.

dsnet commented 7 months ago

The proposed future "encoding/json/jsontext" package produces a better error message:

jsontext: invalid surrogate pair `\udca0'none'` within string

as it preserves the context that this is occurring within a surrogate pair.

puellanivis commented 7 months ago

Interesting, I see what you mean with the “'none' is an invalid escape code” message. It also just coincidentally has the same length one would expect a \uXXXX escape sequence to be.

I do like the error message from jsontext which makes it clear exactly what is actually going wrong.

lfolger commented 7 months ago

Improving the error message seems reasonasble to use and we are happy to accept a contribution.