encoding/protojson: unmarshalBytes assumes no padding if byte string length is not divisible by 4

MioYamanaka commented 2 months ago

What version of protobuf and what language are you using? Version: v1.34.1

What did you do? If possible, provide a recipe for reproducing the error. A complete runnable program is good with .proto and .go source code.

This is a very simplified version of the code we were running:

package pb;

message Scalars {
  bytes opt_bytes = 1;
}

package main

import (
    "encoding/json"
    "google.golang.org/protobuf/encoding/protojson"
    "pb"
)

s := "YQ\n==" // base64 of 'a' with a newline inserted

message := map[string]string{
    "opt_bytes": s,
}

b, err := json.Marshal(message)
if err != nil {
    t.Errorf("marshal error: %q", err)
}

err = protojson.Unmarshal(b, &pb.Scalars{})
if err != nil {
    t.Errorf("Error: %q", err)
}

What did you expect to see?

No output.

What did you see instead? Make sure you include information that can help us debug (full error message, exception listing, stack trace, logs).

"proto: (line 1:14): invalid value for bytes type: \"YQ\\n==\""

We narrowed it down to it being a result of this if-condition: https://github.com/protocolbuffers/protobuf-go/blob/d4621760eaa24af1d915dd112919dbb53f94db01/encoding/protojson/decode.go#L481-L483

The issue seems to be that if the byte string length is not divisible by 4, the current implementation trusts that there is no padding (=) in the byte string. However, if there is a newline in the byte string, then it changes the evaluated length of the string, so the implementation assumes that there is no padding even if there is.

In the above example, YQ== is a padded encoding with length 4 of the character a. However, because there's a \n thrown in, the length is determined to be 5, and the encoding assumes that there is no padding. As a result, it seems that when the decoder tries to decode YQ\n==, it's assuming that there is no padding, and believes the byte string to be invalid.

We discovered this issue when a user of our service made a request where the bytes field was encoded using Python's encodebytes method which inserts a newline every 76 characters (as per RFC 2045).

Here's a fully runnable example that compares the current implementation with a potential change that would prevent those byte strings from being determined invalid.

package main

import (
    "encoding/base64"
    "strings"
)

func main() {
    s := "YQ\n=="

    // current implementation
    enc := base64.StdEncoding
    if strings.ContainsAny(s, "-_") {
        println("url?!")
        enc = base64.URLEncoding
    }
    if len(s)%4 != 0 {    // <--
        println("Should have no padding")
        enc = enc.WithPadding(base64.NoPadding)
    }
    b, err := enc.DecodeString(s)

    if err == nil {
        print("Managed to decode - ")
        println(b)
        return
    }
    print("Failed to decode - ")
    println(err)

    // proposed change
    enc = base64.StdEncoding
    if (len(s) - strings.Count(s, "\n")) % 4 != 0 {    // <--
        println("Should definitely have no padding")
        enc = enc.WithPadding(base64.NoPadding)
    }
    b, err = enc.DecodeString(s)
    if err == nil {
        print("Managed to decode - ")
        println(b)
        println(string(b))
        return
    }
    print("Failed to decode - ")
    panic(err)
}

The above program successfully decodes the string into a in the proposed change.

Anything else we should know about your project / environment?

Not that I can think of, but please let me know if I should provide any additional info.

puellanivis commented 2 months ago

Do you know the behavior of the C++, Java, and python implementations when fed this?

Does the python implementation use encodebytes for encoding a proto message?

Looking at the relevant python code, it seems to add padded bytes according to (4 - len(encoded) % 4), so it’s likely to add unnecessary padding in this situation… although using an online repl, it seems this decodes "YQ\n==" to b'a' so it seems like that works. 🤔

PS: Also works for 1, 2, 3, 4, and 5 repeats of \n in the string.

PPS: perhaps in the alternative: if !strings.HasSuffix(s, "=") { enc = enc.WithPadding(base64.NoPadding) } ? This wouldn’t always be correct, but I feel like it’s a safe assumption to try NoPadding if and only if there’s no '=' at the end? I mean, we’ll still get decode errors if it’s applied inappropriately?

aktau commented 2 months ago

Thanks for the great bug report. I decided to verify this versus the "canonical" implementation (C++), by adding the following two tests:

TEST_P(JsonTest, NoNewline) {
  auto m = ToProto<TestMessage>("{ \"bytesValue\": \"YQ==\" }");
  ASSERT_OK(m);
  EXPECT_EQ(m->bytes_value(), "a");
}

TEST_P(JsonTest, Newline) {
  auto m = ToProto<TestMessage>("{ \"bytesValue\": \"YQ\n==\" }");
  ASSERT_OK(m);
  EXPECT_EQ(m->bytes_value(), "a");
}

The latter fails with:

[ RUN      ] JsonTestSuite/JsonTest.Newline/0
protobuf/json/json_test.cc:495: Failure
Value of: m
Expected: is OK
  Actual: INVALID_ARGUMENT: invalid JSON   in proto3.TestMessage @ bytesValue: bytes,  near  1:17 (offset 16):   corrupt base64

I don't see any special newline handling in the repo either: https://github.com/protocolbuffers/protobuf/blob/77047d9cda0bd5478489d4b5a85f1907cfe7f4fa/src/google/protobuf/json/internal/parser.cc#L129-L139.

So I think that we should bring this to the upstream proto team to decide what to do (whether to accept this or not).

aktau commented 2 months ago

I've posted a message to https://groups.google.com/g/protobuf, but it's pending approval. Let's see what the team says.

MioYamanaka commented 2 months ago

@puellanivis in-lining responses below:

Do you know the behavior of the C++, Java, and python implementations when fed this?

I don't, though it seems based on @aktau 's message, this same behaviour does exist in C++ at least.

Does the python implementation use encodebytes for encoding a proto message?

Looking at the relevant python code, it seems to add padded bytes according to (4 - len(encoded) % 4), so it’s likely to add unnecessary padding in this situation… although using an online repl, it seems this decodes "YQ\n==" to b'a' so it seems like that works. 🤔

We ran into this issue because requests to our service can be made by passing a protobuf object we expect as a JSON object, and one of the fields we require is a base64 encoded byte string. The user has the flexibility to use whatever encoder they would like. I would think this is a pretty common way of making requests to services that use protojson?

PS: Also works for 1, 2, 3, 4, and 5 repeats of \n in the string.

PPS: perhaps in the alternative: if !strings.HasSuffix(s, "=") { enc = enc.WithPadding(base64.NoPadding) } ? This wouldn’t always be correct, but I feel like it’s a safe assumption to try NoPadding if and only if there’s no '=' at the end? I mean, we’ll still get decode errors if it’s applied inappropriately?

From my own testing, the change I proposed seems to work but I'm open to different implementations - what would make the alternative you suggested here not always be correct?

MioYamanaka commented 2 months ago

@aktau thank you for verifying the C++ implementation!

One thing I should clarify - it's not that byte strings with newlines aren't able to be decoded in general. The existing implementation is able to decode YQ\n\n\n\n== to a even though it has newlines because the length of that string is divisible by 4.

MioYamanaka commented 2 months ago

I've submitted a code change request in Gerrit in case it's easier to talk about implementation there: https://go-review.googlesource.com/c/protobuf/+/597796

aktau commented 2 months ago

No-one has replied on the thread yet :( (https://groups.google.com/g/protobuf/c/oGiRW-ezToo).

aktau commented 2 months ago

I'm discussing this with the upstream developers internally now. Though no conclusion has been reached yet, I'm not confident this will be accepted by all implementations, and we favour consistency over features in this regard. While I agree it's not optimal from a UX perspective, this issue can be worked around fairly efficiently for those experiencing it:

package main

import (
    "bytes"
    "encoding/base64"
)

func main() {
    s := "YQ\n=="

    current([]byte(s))
    proposalExternal([]byte(s))
    proposalInternal([]byte(s))
}

// Current implementation of the first part of protojson.Unmarshal.
func current(src []byte) {
    enc := base64.StdEncoding
    if bytes.ContainsAny(src, "-_") {
        println("url?!")
        enc = base64.URLEncoding
    }
    if len(src)%4 != 0 { // <--
        println("Should have no padding")
        enc = enc.WithPadding(base64.NoPadding)
    }
    maxlen := enc.DecodedLen(len(src))
    dst := make([]byte, maxlen)
    if n, err := enc.Decode(dst, src); err == nil {
        print("Managed to decode - ")
        println(string(dst[:n]))
    } else {
        print("Failed to decode - ")
        println(err)
    }
}

// The first proposal handles this internally within [protojson.Unmarshal]
// using [strings.Count], so that it can handle spurious newlines in the input.
func proposalInternal(src []byte) {
  // See https://github.com/golang/protobuf/issues/1626#issue-239601714
}

// The second proposal handles input-with-newlines using a preprocessing step.
// The advantage is that it does not require changing the protobuf library. The
// disadvantage is that people hitting this problem must find out about it and
// add their own preprocessing step.
//
// WARNING: modifies the input byte slice.
func proposalExternal(src []byte) {
    src = stripNewlines(src)
    current(src)
}

func stripNewlines(src []byte) []byte {
    // NOTE: for long byte slices and long runs of non-whitespace, it may be more
    // efficient to use a loop over [bytes.IndexByte], which can use SIMD to skip
    // over large ranges of non-whitespace. But, ISTM that *if* the client sends
    // base64-with-whitespace, the whitespaces won't be separated more than XX
    // characters. Hence we choose the simply approach.
    var i int
    for _, b := range src {
        if b != '\n' {
            src[i] = b
            i++
        }
    }
    return src[:i]
}

Output:

go run main.go
Should have no padding
Failed to decode - (0x49e1f8,0x4f9438)
Managed to decode - a
Managed to decode - a

Is such an approach feasible for the time being?

MioYamanaka commented 2 months ago

@aktau just want to reiterate my earlier comment about how it's not that newlines are issues themselves - just that when there's the right amount of them to have the string length be not divisible by 4 this issue happens. It's not an issue with newlines in general.

With that in mind it feels like the current implementation is buggy. Based on @puellanivis 's suggestion, I have made the code change request to check for = in the suffix instead.

stapelberg commented 2 months ago

Hey @MioYamanaka

Thank you for your CL and report.

We discovered this issue when a user of our service made a request where the bytes field was encoded using Python's encodebytes method which inserts a newline every 76 characters (as per RFC 2045).

RFC 2045 specifies MIME (Multipurpose Internet Mail Extensions); the newlines every 76 characters is for quoted-printable encoding, which in turn is designed to be safe when transmitted with SMTP.

The protojson package allows:

In Python, you should use the base64.b64encode function (not base64.encodebytes) to encode data accordingly.

We ran into this issue because requests to our service can be made by passing a protobuf object we expect as a JSON object, and one of the fields we require is a base64 encoded byte string. The user has the flexibility to use whatever encoder they would like. I would think this is a pretty common way of making requests to services that use protojson?

I would argue the user should use base64 encoding as specified by RFC 4648, not quoted-printable email encoding.

I’m hesitant to accept your CL because I don’t see the use-case — once you switch the code from the wrong encoding function to the correct one, there isn’t a motivation for the change anymore.

MioYamanaka commented 2 months ago

We definitely can tell our users to not use base64.encodebytes , but I am wondering why it wouldn't make sense to make this change anyways - Why is checking for the string length being divisible by 4 a better indicator than just checking for the padding character at the end of the string when determining if a base64 string has no padding? Especially if valid base64 strings can be created where the length may not be divisible by 4?

Also, is it documented anywhere that the protojson package or Protobuf standard only allows those encodings? That would be helpful as users of the package.

stapelberg commented 2 months ago

We definitely can tell our users to not use base64.encodebytes , but I am wondering why it wouldn't make sense to make this change anyways - Why is checking for the string length being divisible by 4 a better indicator than just checking for the padding character at the end of the string when determining if a base64 string has no padding? Especially if valid base64 strings can be created where the length may not be divisible by 4?

I’m not arguing for the current code — on a technical level, it’s fine to change it, I think.

What I’m arguing against is to support MIME encoding.

Also, is it documented anywhere that the protojson package or Protobuf standard only allows those encodings? That would be helpful as users of the package.

The protojson package documentation links to https://protobuf.dev/programming-guides/proto3#json, which states:

JSON value will be the data encoded as a string using standard base64 encoding with paddings. Either standard or URL-safe base64 encoding with/without paddings are accepted.

The term “base64” commonly refers to RFC 4648, and has nothing to do with MIME.

The fact that Python offers a MIME-related function in its base64 package is unfortunate, I think.

puellanivis commented 2 months ago

Yeah, I had similar reservations about accepting newlines just because it’s in the MIME encoding.

Does the python implementation use encodebytes for encoding a proto message?

Here I was specifically asking this because more important is that we’re supporting what the python protobuf does. “Side-effect” support of certain deviations is almost guaranteed by nature of complex systems, but supporting deviations can quickly become an exhausting task.

aktau commented 2 months ago

If the official Python proto implementation does that, it should certainly start using the non-MIME functions highlighted by @stapelberg above.

dsnet commented 2 months ago

I should note that RFC 4648, section 3.3 says:

Implementations MUST reject the encoded data if it contains characters outside the base alphabet when interpreting base-encoded data, unless the specification referring to this document explicitly states otherwise.

As I understand it, the protobuf documentation does not call out any specific non-alphabet characters that are to be ignored. Thus, newlines should technically be rejected by a strict interpretation of the spec.

It is an unfortunate artifact of the Go "base64" implementation that it does not strictly adhere to the RFC as it ignores newlines by default. See golang/go#53845.

However, given that the Go "base64" ignores newlines and "protojson" is already implicitly ignoring newlines by transitive dependency on "base64", then I guess we should at least implement it correctly in "protojson". As such, I believe this line:

if len(s)%4 != 0 { ... }

should really be:

if (len(s)-strings.Count(s, "\n")-strings.Count(s, "\r"))%4 != 0 { ... }

Note that the Go "base64" package ignores both newlines ('\n') and carriage returns ('\r').

dsnet commented 2 months ago

As an aside, I think the Python implementation is quite strange. What is the utility of injecting newlines when it is always escaped in JSON? It's increasing the payload size for no benefit (as it doesn't help readability either).

puellanivis commented 2 months ago

This is not actually the behavior of the python implementation. From the original poster:

The user has the flexibility to use whatever encoder they would like.

It seemed to me that they were at least implying that some client is encoding their JSON manually, rather than using the official python JSON protobuf implementation. Indeed, as you mention, it seems odd that any implementation would inject the newlines, as they have to get escaped, and thus increasing payload size, and not even breaking up lines, since the newline is now escaped.

As for:

if (len(s)-strings.Count(s, "\n")-strings.Count(s, "\r"))%4 != 0 { ... }

I’ve already mentioned, strings.Count() is going to run in O(n) time, whereas simply checking for a = at the end achieves the same behavior for all valid base64 padded and unpadded, an in O(1) time. As well, this seems like the way I would personally as a human check to see if a base64 text were padded or not.

PS: This also defers any decisions on what letters need to be ignored or errored on to the base64 implementation itself. Really, we just care here about telling it to expect padding or not, not to gauge validity. Once we start counting \n and \r there’s the question of if we should even be counting \r if it doesn’t occur as part of a CRLF combo, etc.

dsnet commented 2 months ago

We already do an O(n) check to detect whether to use StdEncoding or URLEncoding: https://github.com/protocolbuffers/protobuf-go/blob/d4621760eaa24af1d915dd112919dbb53f94db01/encoding/protojson/decode.go#L478

So we could fold the check into that.

Alternatively, we could gather data on which representation is most common. For example, we could just try StdEncoding without padding first, and if it fails, fall back on the other representations.

puellanivis commented 2 months ago

🤔 Good points. Though, as I note, I’m still unsure why we need to do anything over checking for a = as a suffix. All valid and padded base64 will end with =, and all valid and unpadded base64 won’t. 🤷‍♀️

Validity of the base64 encoding itself otherwise should be an implementation detail of the base64 library used.

The current len(s) % 4 != 0 was indirectly a true statement about valid, minimal, and padded base64 that had about the same computational complexity as checking for a = suffix—perhaps even less. So, it was a pretty reasonable solution. But now, if we’re to switch to iterating through and counting up newlines and carriage returns… it just seems like a whole lot more work, than the other true statement about any valid and padded base64.

dsnet commented 2 months ago

I’m still unsure why we need to do anything over checking for a = as a suffix

I think you are mostly correct. The correct logic would be:

if !strings.HasSuffix(strings.TrimRight(s, "\r\n"), "=") {
    enc = enc.WithPadding(base64.NoPadding)
}

The reason is because "\r\n" are characters that are entirely ignored by the Go "base64" package (even if they appear in padding):

fmt.Printf("%x\n", must.Get(base64.StdEncoding.AppendDecode(nil, []byte("\rA\nA\n=\r=\n")))) // prints 00

puellanivis commented 2 months ago

I’m not sure the trim is even particularly necessary. There should never be a case where MIME rules would force a newline or carriage return after the =, right? Because if there were no following characters, the NL/CR would be unnecessary to add, right?

I grant that the Go base64 package ignores newlines and carriage returns regardless of where they appear, but maybe we should not be looking to support literally anything accepted by that package, right? As that’s a bit too much of an implementation detail specific to base64.

From our perspective we don’t expect anything to end with a newline or carriage return? I mean, we already weren’t even expecting them in the first place. 😂

dsnet commented 2 months ago

My interpretation of RFC 4648, section 3.3 is that if an implementation chooses to ignore certain non-alphabet characters it's more or less equivalent to remove all such characters from the input regardless of whether they are in the padding or not. If we're going to ignore '\n' and '\r', then we should at least be as faithful to the RFC.

puellanivis commented 2 months ago

🤷‍♀️ Sounds reasonable. I was just trying to think of how we can do less work. :trollface:

MioYamanaka commented 2 months ago

So to confirm, should I update my change request in Gerrit to be the code in the earlier comment (by @dsnet) that trims \r\n?

stapelberg commented 2 months ago

If we want to go forward with a code change, then yes, that sounds like the best approach.

But I’m still unclear on what your motivation for changing the code is at this point. We already established that you were using the wrong encoding function and should switch over to the correct one. Once you switch to the correct function, what remaining motivation is there to change anything in Go Protobuf? (In other words: Could we not just close the bug and move on?)

MioYamanaka commented 1 month ago

Based on the conversation above between @dsnet and @puellanivis it seemed like the conclusion was that changing the code would be better. I just wanted to clarify if that will be done on the protojson side, or if I should be adjusting my change request to have that change.

stapelberg commented 1 month ago

To recap:

The original report (protojson can’t decode input) turned out to be due to using the wrong encoding function, not actually a bug in protojson.
We are not aware of any valid encoded text that can’t be decoded by protojson.
Go’s base64 implementation isn’t as strict as the RFC asks for, so we’re already in somewhat mushy territory. We could change the check as @dsnet pointed out. But there doesn’t seem to be any specific need to do so.

Notably, if we don’t change anything, other people making the same mistake (reaching for MIME encoding instead of base64 encoding) will encounter the same roadblock, which will lead them to switching to the correct encoding function.

I think not changing anything is the most appealing option, hence I’m closing this issue.

golang / protobuf

encoding/protojson: unmarshalBytes assumes no padding if byte string length is not divisible by 4 #1626