golang / protobuf

Go support for Google's protocol buffers
BSD 3-Clause "New" or "Revised" License
9.64k stars 1.58k forks source link

Tracking issue: Fully Lazy Extension decoding #1609

Closed stapelberg closed 1 month ago

stapelberg commented 2 months ago

I’m filing this issue to provide some background for a series of changes I’m about to send. What follows are excerpts from a number of different internal documents and investigations, in the hope that they provide enough background to follow along at a high level. Let me know if anyone is super interested in some detail for some reason, and I can expand.

Background

Currently, Go Protobuf supports lazy decoding for extensions, but only when compiled with support for the feature: see internal/flags.LazyUnmarshalExtensions, which is enabled when building with the protolegacy build flag.

However, the current lazy decoding support is not lazy across proto.Size() and proto.Marshal() calls.

This means that programs such as logs analysis pipelines benefit from lazy extension decoding, as they typically don’t read many extensions (or read no extensions at all).

Programs like proxy servers, which read, maybe extend and then write the same proto message, do not benefit from lazy extension decoding, as they will need to lazily decode the extension just to encode it immediately afterwards.

Because we’re spending a significant cost in this code path (decoding just to re-encode), we’d like to change the code to support fully lazy extensions, meaning they will be encoded with the same bytes that they were read from the wire.

Non-minimal wire format

One complication with this effort is what we call non-minimal wire format:

When encoding Protobuf messages, there is one minimal wire format size and a number of larger non-minimal wire formats that decode to the same message.

Non-minimal wire format refers to scenarios like non-repeated fields appearing multiple times, non-optimal varint encoding, packed repeated fields that appear non-packed on the wire and others.

We can encounter non-minimal wire format in different scenarios:

While we believe that non-minimal wire format is rare in general, we do have at least one report of non-minimal wire format (the kind where a non-repeated field is present multiple times) encountered in our production systems (more investigation pending).

Shrinking messages

When lazy decoding is enabled, it might be surprising at first that messages can shrink: accessing a message (typically expected to be a read-only operation) needs to lazily decode the message and can thereby modify how Go Protobuf will re-encode the decoded message.

Consider the following sequence of events:

  1. Call proto.Size() on a message with lazy extensions that use non-minimal wire format.
  2. Access the extension, thereby causing lazy decoding, thereby causing normalization.
  3. Now, proto.Size() reports fewer bytes than before!

Encoding invalid wire format

When a sub-message has to be encoded, Protobuf’s Tag-Length-Value wire format requires that we put the correct size on the wire before the message contents. [Side note: Other implementations like upb follow a back-to-front approach which sidesteps this problem. We have thoroughly investigated adopting this approach for Go Protobuf but discarded the option because of the complexity of supporting back-to-front marshaling.]

In today’s implementation, the marshaler calculates the size of each sub-message before recursing into the message marshaling function. To avoid duplicate sizing costs, the size is cached inside the Protobuf message struct after the first calculation in the sizeCache field.

There are edge cases where a message can shrink between populating the size cache and encoding it, which would result in Go Protobuf producing invalid wire format. This would be a terrible outcome, so we’ll prevent producing invalid wire format by adding checks to Go Protobuf’s encoder that detect the situation and error out instead.

Plan

Out of scope, but maybe worthwhile at a later point: make lazy extensions available without the protolegacy build tag.

stapelberg commented 1 month ago

There are edge cases where a message can shrink between populating the size cache and encoding it, which would result in Go Protobuf producing invalid wire format. This would be a terrible outcome, so we’ll prevent producing invalid wire format by adding checks to Go Protobuf’s encoder that detect the situation and error out instead.

This check was added in commit https://github.com/protocolbuffers/protobuf-go/commit/94bb78c93b96f99652979c7f02fd5c8dcc765d3d

For readers from the future: If you see this error triggering (e.g. size mismatch (see https://github.com/golang/protobuf/issues/1609): calculated=36, measured=37), the most likely reason for it is accidental sharing and concurrent mutation of a Protobuf message or submessage.

We fixed one case where a submessage was shared (defined in the server struct) between requests, but then each request handler would also modify that same submessage with per-request data. When you concurrently process requests, this results in modifying a protobuf message while marshaling (encoding) it, which is not permitted.

Instead of producing invalid wire format, Go Protobuf will error out. The fix is to change your code to not mutate Protobuf messages while marshaling (encoding).

mehdipourfar commented 1 month ago

After running into this error, I feel like I don't know anything about how a program should work at all. It is so frustrating. My service was working perfectly and after updating it's protobuf version, some rpcs cannot handle concurrent requests anymore. By emptying some random inner message which I'm hundred percent sure won't be mutated during runtime, the errors disappear. I don't remember that I have read anywhere that one should not reuse protobuf messages.

stapelberg commented 1 month ago

Hey Mehdi. Sorry for the trouble you’re running into, I can understand it can be frustrating.

I have investigated a small handful of different programs running into this error, and in 100% of the cases it was a race / concurrency issue. They can be hard to spot, but in the end we always found the issue. I think most likely you’re also dealing with a hard-to-spot concurrency issue.

If the program you’re working on is Open Source and you could share a core dump (you can add a panic to internal/errors to get a good crash), I can take a look — often, the stacktrace already provides enough information. If the stacktrace isn’t sufficient, you can look at the output buffer to see how many bytes it contains and thereby identify where in the message hierarchy the problem is located.

Note that this new check prevents invalid wire format, so before this check, your program would occasionally produce invalid data. I think the new failure mode of erroring is better than silently producing corrupt data.

mehdipourfar commented 1 month ago

Hey @stapelberg. Thank you for your tips. I finally found that some inner messages were mutated in an external library and fixed that.

echarrod commented 1 week ago

Hello, I am experiencing an issue after upgrading google.golang.org/protobuf from v1.34.1 -> v1.34.2, and my search has led me here, hoping you can help since you mentioned sharing a stack trace above @stapelberg 🙏

I have a couple of tests which are failing. THey check that marshalling a custom struct into a protobuf message, and then unmarshalling the same message, produces the same result as the original.

But after this package update, it seems the objects aren't equal, as the object being serialized now has a sizeCache of 1. In the previous version it used to have 0 (and the object being unmarshalled on the other side still has 0).

My code is something like this (seems a bit tricky to create a gist/snippet of it, but I might have to):

// Marshalling
req, err := proto.Marshal(action.RequestPB)
if err != nil {
    return nil, err
}

// Unmarshalling
req2 := proto.Clone(def.Request)
if err := proto.Unmarshal(action.RequestPb, req2); err != nil {
    return nil, err
}

// Comparing
require.Equal(t, req, *req2)

But I get an error:

Diff:
                            --- Expected
                            +++ Actual
                            @@ -4918,3 +4918,3 @@
                               },
                            -  sizeCache: (int32) 1,
                            +  sizeCache: (int32) 0,

I've looked through the Protobuf docs, and since attempted modifying to use UseCachedSize when serializing:

opts := proto.MarshalOptions{UseCachedSize: true}
buf := make([]byte, 0, opts.Size(action.RequestPB))
req, err := opts.MarshalAppend(buf, action.RequestPB)
if err != nil {
    return nil, err
}

...but still get the same error. Do you know if there might there be an easy way to fix this?

I see from those same docs they say:

it is best practice to treat proto.Size results as an upper bound and never assume that the result matches the actually encoded message size. ... so I wonder whether I might have to re-structure my test if this is intended as an "upper bound" 🤔

puellanivis commented 1 week ago

But I get an error:

Diff:
                          --- Expected
                          +++ Actual
                          @@ -4918,3 +4918,3 @@
                             },
                          -  sizeCache: (int32) 1,
                          +  sizeCache: (int32) 0,

Please compare protobufs using github.com/google/go-cmp/cmp and protocmp which will test only the semantics of the protobuf rather than internal bookkeeping fields.

echarrod commented 1 week ago

Ah thanks for the pointer 🙏 I've got it working with:

require.Empty(t, cmp.Diff(req, *req2, protocmp.Transform()))