Closed MioYamanaka closed 1 month ago
Do you know the behavior of the C++, Java, and python implementations when fed this?
Does the python implementation use encodebytes for encoding a proto message?
Looking at the relevant python code, it seems to add padded bytes according to (4 - len(encoded) % 4)
, so it’s likely to add unnecessary padding in this situation… although using an online repl, it seems this decodes "YQ\n=="
to b'a'
so it seems like that works. 🤔
PS: Also works for 1, 2, 3, 4, and 5 repeats of \n
in the string.
PPS: perhaps in the alternative: if !strings.HasSuffix(s, "=") { enc = enc.WithPadding(base64.NoPadding) }
? This wouldn’t always be correct, but I feel like it’s a safe assumption to try NoPadding
if and only if there’s no '='
at the end? I mean, we’ll still get decode errors if it’s applied inappropriately?
Thanks for the great bug report. I decided to verify this versus the "canonical" implementation (C++), by adding the following two tests:
TEST_P(JsonTest, NoNewline) {
auto m = ToProto<TestMessage>("{ \"bytesValue\": \"YQ==\" }");
ASSERT_OK(m);
EXPECT_EQ(m->bytes_value(), "a");
}
TEST_P(JsonTest, Newline) {
auto m = ToProto<TestMessage>("{ \"bytesValue\": \"YQ\n==\" }");
ASSERT_OK(m);
EXPECT_EQ(m->bytes_value(), "a");
}
The latter fails with:
[ RUN ] JsonTestSuite/JsonTest.Newline/0
protobuf/json/json_test.cc:495: Failure
Value of: m
Expected: is OK
Actual: INVALID_ARGUMENT: invalid JSON in proto3.TestMessage @ bytesValue: bytes, near 1:17 (offset 16): corrupt base64
I don't see any special newline handling in the repo either: https://github.com/protocolbuffers/protobuf/blob/77047d9cda0bd5478489d4b5a85f1907cfe7f4fa/src/google/protobuf/json/internal/parser.cc#L129-L139.
So I think that we should bring this to the upstream proto team to decide what to do (whether to accept this or not).
I've posted a message to https://groups.google.com/g/protobuf, but it's pending approval. Let's see what the team says.
@puellanivis in-lining responses below:
Do you know the behavior of the C++, Java, and python implementations when fed this?
I don't, though it seems based on @aktau 's message, this same behaviour does exist in C++ at least.
Does the python implementation use encodebytes for encoding a proto message?
Looking at the relevant python code, it seems to add padded bytes according to
(4 - len(encoded) % 4)
, so it’s likely to add unnecessary padding in this situation… although using an online repl, it seems this decodes"YQ\n=="
tob'a'
so it seems like that works. 🤔
We ran into this issue because requests to our service can be made by passing a protobuf object we expect as a JSON object, and one of the fields we require is a base64 encoded byte string. The user has the flexibility to use whatever encoder they would like. I would think this is a pretty common way of making requests to services that use protojson?
PS: Also works for 1, 2, 3, 4, and 5 repeats of
\n
in the string.PPS: perhaps in the alternative:
if !strings.HasSuffix(s, "=") { enc = enc.WithPadding(base64.NoPadding) }
? This wouldn’t always be correct, but I feel like it’s a safe assumption to tryNoPadding
if and only if there’s no'='
at the end? I mean, we’ll still get decode errors if it’s applied inappropriately?
From my own testing, the change I proposed seems to work but I'm open to different implementations - what would make the alternative you suggested here not always be correct?
@aktau thank you for verifying the C++ implementation!
One thing I should clarify - it's not that byte strings with newlines aren't able to be decoded in general. The existing implementation is able to decode YQ\n\n\n\n==
to a
even though it has newlines because the length of that string is divisible by 4.
I've submitted a code change request in Gerrit in case it's easier to talk about implementation there: https://go-review.googlesource.com/c/protobuf/+/597796
No-one has replied on the thread yet :( (https://groups.google.com/g/protobuf/c/oGiRW-ezToo).
I'm discussing this with the upstream developers internally now. Though no conclusion has been reached yet, I'm not confident this will be accepted by all implementations, and we favour consistency over features in this regard. While I agree it's not optimal from a UX perspective, this issue can be worked around fairly efficiently for those experiencing it:
package main
import (
"bytes"
"encoding/base64"
)
func main() {
s := "YQ\n=="
current([]byte(s))
proposalExternal([]byte(s))
proposalInternal([]byte(s))
}
// Current implementation of the first part of protojson.Unmarshal.
func current(src []byte) {
enc := base64.StdEncoding
if bytes.ContainsAny(src, "-_") {
println("url?!")
enc = base64.URLEncoding
}
if len(src)%4 != 0 { // <--
println("Should have no padding")
enc = enc.WithPadding(base64.NoPadding)
}
maxlen := enc.DecodedLen(len(src))
dst := make([]byte, maxlen)
if n, err := enc.Decode(dst, src); err == nil {
print("Managed to decode - ")
println(string(dst[:n]))
} else {
print("Failed to decode - ")
println(err)
}
}
// The first proposal handles this internally within [protojson.Unmarshal]
// using [strings.Count], so that it can handle spurious newlines in the input.
func proposalInternal(src []byte) {
// See https://github.com/golang/protobuf/issues/1626#issue-239601714
}
// The second proposal handles input-with-newlines using a preprocessing step.
// The advantage is that it does not require changing the protobuf library. The
// disadvantage is that people hitting this problem must find out about it and
// add their own preprocessing step.
//
// WARNING: modifies the input byte slice.
func proposalExternal(src []byte) {
src = stripNewlines(src)
current(src)
}
func stripNewlines(src []byte) []byte {
// NOTE: for long byte slices and long runs of non-whitespace, it may be more
// efficient to use a loop over [bytes.IndexByte], which can use SIMD to skip
// over large ranges of non-whitespace. But, ISTM that *if* the client sends
// base64-with-whitespace, the whitespaces won't be separated more than XX
// characters. Hence we choose the simply approach.
var i int
for _, b := range src {
if b != '\n' {
src[i] = b
i++
}
}
return src[:i]
}
Output:
go run main.go
Should have no padding
Failed to decode - (0x49e1f8,0x4f9438)
Managed to decode - a
Managed to decode - a
Is such an approach feasible for the time being?
@aktau just want to reiterate my earlier comment about how it's not that newlines are issues themselves - just that when there's the right amount of them to have the string length be not divisible by 4 this issue happens. It's not an issue with newlines in general.
With that in mind it feels like the current implementation is buggy. Based on @puellanivis 's suggestion, I have made the code change request to check for =
in the suffix instead.
Hey @MioYamanaka
Thank you for your CL and report.
We discovered this issue when a user of our service made a request where the bytes field was encoded using Python's encodebytes method which inserts a newline every 76 characters (as per RFC 2045).
RFC 2045 specifies MIME (Multipurpose Internet Mail Extensions); the newlines every 76 characters is for quoted-printable encoding, which in turn is designed to be safe when transmitted with SMTP.
The protojson package allows:
In Python, you should use the base64.b64encode function (not base64.encodebytes) to encode data accordingly.
We ran into this issue because requests to our service can be made by passing a protobuf object we expect as a JSON object, and one of the fields we require is a base64 encoded byte string. The user has the flexibility to use whatever encoder they would like. I would think this is a pretty common way of making requests to services that use protojson?
I would argue the user should use base64 encoding as specified by RFC 4648, not quoted-printable email encoding.
I’m hesitant to accept your CL because I don’t see the use-case — once you switch the code from the wrong encoding function to the correct one, there isn’t a motivation for the change anymore.
We definitely can tell our users to not use base64.encodebytes , but I am wondering why it wouldn't make sense to make this change anyways - Why is checking for the string length being divisible by 4 a better indicator than just checking for the padding character at the end of the string when determining if a base64 string has no padding? Especially if valid base64 strings can be created where the length may not be divisible by 4?
Also, is it documented anywhere that the protojson package or Protobuf standard only allows those encodings? That would be helpful as users of the package.
We definitely can tell our users to not use base64.encodebytes , but I am wondering why it wouldn't make sense to make this change anyways - Why is checking for the string length being divisible by 4 a better indicator than just checking for the padding character at the end of the string when determining if a base64 string has no padding? Especially if valid base64 strings can be created where the length may not be divisible by 4?
I’m not arguing for the current code — on a technical level, it’s fine to change it, I think.
What I’m arguing against is to support MIME encoding.
Also, is it documented anywhere that the protojson package or Protobuf standard only allows those encodings? That would be helpful as users of the package.
The protojson package documentation links to https://protobuf.dev/programming-guides/proto3#json, which states:
JSON value will be the data encoded as a string using standard base64 encoding with paddings. Either standard or URL-safe base64 encoding with/without paddings are accepted.
The term “base64” commonly refers to RFC 4648, and has nothing to do with MIME.
The fact that Python offers a MIME-related function in its base64 package is unfortunate, I think.
Yeah, I had similar reservations about accepting newlines just because it’s in the MIME encoding.
Does the python implementation use encodebytes for encoding a proto message?
Here I was specifically asking this because more important is that we’re supporting what the python protobuf does. “Side-effect” support of certain deviations is almost guaranteed by nature of complex systems, but supporting deviations can quickly become an exhausting task.
If the official Python proto implementation does that, it should certainly start using the non-MIME functions highlighted by @stapelberg above.
I should note that RFC 4648, section 3.3 says:
Implementations MUST reject the encoded data if it contains characters outside the base alphabet when interpreting base-encoded data, unless the specification referring to this document explicitly states otherwise.
As I understand it, the protobuf documentation does not call out any specific non-alphabet characters that are to be ignored. Thus, newlines should technically be rejected by a strict interpretation of the spec.
It is an unfortunate artifact of the Go "base64" implementation that it does not strictly adhere to the RFC as it ignores newlines by default. See golang/go#53845.
However, given that the Go "base64" ignores newlines and "protojson" is already implicitly ignoring newlines by transitive dependency on "base64", then I guess we should at least implement it correctly in "protojson". As such, I believe this line:
if len(s)%4 != 0 { ... }
should really be:
if (len(s)-strings.Count(s, "\n")-strings.Count(s, "\r"))%4 != 0 { ... }
Note that the Go "base64" package ignores both newlines ('\n') and carriage returns ('\r').
As an aside, I think the Python implementation is quite strange. What is the utility of injecting newlines when it is always escaped in JSON? It's increasing the payload size for no benefit (as it doesn't help readability either).
This is not actually the behavior of the python implementation. From the original poster:
The user has the flexibility to use whatever encoder they would like.
It seemed to me that they were at least implying that some client is encoding their JSON manually, rather than using the official python JSON protobuf implementation. Indeed, as you mention, it seems odd that any implementation would inject the newlines, as they have to get escaped, and thus increasing payload size, and not even breaking up lines, since the newline is now escaped.
As for:
if (len(s)-strings.Count(s, "\n")-strings.Count(s, "\r"))%4 != 0 { ... }
I’ve already mentioned, strings.Count()
is going to run in O(n)
time, whereas simply checking for a =
at the end achieves the same behavior for all valid base64 padded and unpadded, an in O(1)
time. As well, this seems like the way I would personally as a human check to see if a base64 text were padded or not.
PS: This also defers any decisions on what letters need to be ignored or errored on to the base64 implementation itself. Really, we just care here about telling it to expect padding or not, not to gauge validity. Once we start counting \n
and \r
there’s the question of if we should even be counting \r
if it doesn’t occur as part of a CRLF combo, etc.
We already do an O(n) check to detect whether to use StdEncoding
or URLEncoding
:
https://github.com/protocolbuffers/protobuf-go/blob/d4621760eaa24af1d915dd112919dbb53f94db01/encoding/protojson/decode.go#L478
So we could fold the check into that.
Alternatively, we could gather data on which representation is most common. For example, we could just try StdEncoding
without padding first, and if it fails, fall back on the other representations.
🤔 Good points. Though, as I note, I’m still unsure why we need to do anything over checking for a =
as a suffix. All valid and padded base64 will end with =
, and all valid and unpadded base64 won’t. 🤷♀️
Validity of the base64 encoding itself otherwise should be an implementation detail of the base64 library used.
The current len(s) % 4 != 0
was indirectly a true statement about valid, minimal, and padded base64 that had about the same computational complexity as checking for a =
suffix—perhaps even less. So, it was a pretty reasonable solution. But now, if we’re to switch to iterating through and counting up newlines and carriage returns… it just seems like a whole lot more work, than the other true statement about any valid and padded base64.
I’m still unsure why we need to do anything over checking for a = as a suffix
I think you are mostly correct. The correct logic would be:
if !strings.HasSuffix(strings.TrimRight(s, "\r\n"), "=") {
enc = enc.WithPadding(base64.NoPadding)
}
The reason is because "\r\n" are characters that are entirely ignored by the Go "base64" package (even if they appear in padding):
fmt.Printf("%x\n", must.Get(base64.StdEncoding.AppendDecode(nil, []byte("\rA\nA\n=\r=\n")))) // prints 00
I’m not sure the trim is even particularly necessary. There should never be a case where MIME rules would force a newline or carriage return after the =
, right? Because if there were no following characters, the NL/CR would be unnecessary to add, right?
I grant that the Go base64
package ignores newlines and carriage returns regardless of where they appear, but maybe we should not be looking to support literally anything accepted by that package, right? As that’s a bit too much of an implementation detail specific to base64.
From our perspective we don’t expect anything to end with a newline or carriage return? I mean, we already weren’t even expecting them in the first place. 😂
My interpretation of RFC 4648, section 3.3 is that if an implementation chooses to ignore certain non-alphabet characters it's more or less equivalent to remove all such characters from the input regardless of whether they are in the padding or not. If we're going to ignore '\n' and '\r', then we should at least be as faithful to the RFC.
🤷♀️ Sounds reasonable. I was just trying to think of how we can do less work. :trollface:
So to confirm, should I update my change request in Gerrit to be the code in the earlier comment (by @dsnet) that trims \r\n
?
If we want to go forward with a code change, then yes, that sounds like the best approach.
But I’m still unclear on what your motivation for changing the code is at this point. We already established that you were using the wrong encoding function and should switch over to the correct one. Once you switch to the correct function, what remaining motivation is there to change anything in Go Protobuf? (In other words: Could we not just close the bug and move on?)
Based on the conversation above between @dsnet and @puellanivis it seemed like the conclusion was that changing the code would be better. I just wanted to clarify if that will be done on the protojson side, or if I should be adjusting my change request to have that change.
To recap:
Notably, if we don’t change anything, other people making the same mistake (reaching for MIME encoding instead of base64 encoding) will encounter the same roadblock, which will lead them to switching to the correct encoding function.
I think not changing anything is the most appealing option, hence I’m closing this issue.
What version of protobuf and what language are you using? Version:
v1.34.1
What did you do? If possible, provide a recipe for reproducing the error. A complete runnable program is good with
.proto
and.go
source code.This is a very simplified version of the code we were running:
What did you expect to see?
No output.
What did you see instead? Make sure you include information that can help us debug (full error message, exception listing, stack trace, logs).
"proto: (line 1:14): invalid value for bytes type: \"YQ\\n==\""
We narrowed it down to it being a result of this if-condition: https://github.com/protocolbuffers/protobuf-go/blob/d4621760eaa24af1d915dd112919dbb53f94db01/encoding/protojson/decode.go#L481-L483
The issue seems to be that if the byte string length is not divisible by 4, the current implementation trusts that there is no padding (
=
) in the byte string. However, if there is a newline in the byte string, then it changes the evaluated length of the string, so the implementation assumes that there is no padding even if there is.In the above example,
YQ==
is a padded encoding with length 4 of the charactera
. However, because there's a\n
thrown in, the length is determined to be 5, and the encoding assumes that there is no padding. As a result, it seems that when the decoder tries to decodeYQ\n==
, it's assuming that there is no padding, and believes the byte string to be invalid.We discovered this issue when a user of our service made a request where the bytes field was encoded using Python's encodebytes method which inserts a newline every 76 characters (as per RFC 2045).
Here's a fully runnable example that compares the current implementation with a potential change that would prevent those byte strings from being determined invalid.
The above program successfully decodes the string into
a
in the proposed change.Anything else we should know about your project / environment?
Not that I can think of, but please let me know if I should provide any additional info.