jhump / protoreflect

Reflection (Rich Descriptors) for Go Protocol Buffers
Apache License 2.0
1.35k stars 173 forks source link

protoparse: ParseFiles doesn't escape non-ascii characters in string values #441

Open WillAbides opened 3 years ago

WillAbides commented 3 years ago

Comparing ParseFiles and protoc for the file below shows this difference:

        "options": protocmp.Message{
            "@type":                s"google.protobuf.FileOptions",
-           "java_outer_classname": string("my\xbcvalue"),
+           "java_outer_classname": string("my�value"),
        },

protoc seems to replace characters above 0x7f in option values with their \x escape code, but protoc does not.

syntax = "proto3";

package foo;
option java_outer_classname= "my�value";

The � above is 0xbc

jhump commented 3 years ago

I think your example file may have been malformed when you pasted it into a GitHub comment. What I see is that � is the Unicode replacement character (U+FFFD), not 0xbc. (Which is expected and typical when you try to put illegal binary data into a UTF-8-encoded text document.)

Since protobuf source is supposed to be UTF-8, I'm surprised it would accept a literal 0xbc, since that is invalid UTF-8 (and, in fact, I would expect it to be turned into the replacement character, since that is the purpose of the replacement character).

Protoc definitely does not encode such values with an escape code. That is your diff library showing you the raw byte value as an escape sequence. The actual issue is that protoparse is seeing the invalid UTF-8 encoded value and silently converting it to the replacement character (because that is usually how UTF-8 handling is done). Apparently, protoc does not and allows unescaped data that is invalid UTF-8. (TBH, that smells like a bug in protoc's handling of input -- it should probably use the Unicode replacement character or outright reject the input file due to bad encoding.)

In any event, I think I know reasonably simple fix to make protoparse match protoc. But I'll actually file a bug against protoc first, to see what they think the correct behavior should be.

WillAbides commented 3 years ago

You're right about it being the diff doing the escaping. That makes some inconsistent behavior I saw make sense.

Given that, it really does seem like it's protoc doing the wrong thing here and protoreflect shouldn't try to emulate it.

jhump commented 3 years ago

I've filed a bug with the protobuf project: https://github.com/protocolbuffers/protobuf/issues/9175

Depending on the response to that, I could update protoparse. I can see a couple of possible things I'd need to do:

  1. Reject source with invalid UTF-8 input, instead of silently converting invalid data to Unicode replacement characters.
  2. Even if the source is valid UTF-8 (e.g. is uses the escape sequence \xbc), it should be rejected if data for string options is not also valid UTF-8.

I think all of the runtimes reject string data if it is not UTF-8, so definitely strange that protoc accepts it. As far as it accepting the bad character directly in the source, I suspect that is because the tokenizer actually reads the input byte-by-byte instead of decoding UTF-8. It is ostensibly in UTF-8 since the only allowed byte-order marker prefix is a UTF-8-encoded marker; the rest of the source is likely expected to be 7-bit ASCII...