golang / protobuf

Go support for Google's protocol buffers
BSD 3-Clause "New" or "Revised" License
9.78k stars 1.58k forks source link

Nested composite fields sometimes not being marshaled #1287

Closed gudvinr closed 3 years ago

gudvinr commented 3 years ago

What version of protobuf and what language are you using? package: google.golang.org/protobuf v1.25.0 protoc-gen-go: v1.25.0 libprotoc: 3.6.1

What did you do?

II have a system which, among other things, passes messages to multiple receivers. There's simplified version of it, which should be sufficient to show how it works with protobuf messages:

schema.proto ```protobuf syntax = "proto3"; package package; option go_package = "proto.definitions/generated/package"; message Point { sint32 x = 1; sint32 y = 2; } message PointCommand { Point src = 2; } message PointEvent { Point src = 3; } ```
code.go ```go package main import ( "fmt" "io" "net" "google.golang.org/protobuf/proto" pd "proto.definitions/generated/package" ) var watchers map[uint64]chan<- proto.Message type Session struct { writer io.Writer } func (sess *Session) Listen(conn net.Conn) { sess.writer = conn for { // data []byte : read from conn var cmd pd.PointCommand if err := proto.Unmarshal(data, &cmd); err != nil { panic(err) } if cmd.Src == nil { panic(fmt.Errorf("nil src")) } evt := pd.PointEvent{ Src: cmd.Src, } for _, w := range watchers { w <- &evt } } } func (sess *Session) Watch(ch <-chan proto.Message) { for msg := range ch { data, err := proto.Marshal(msg) if err != nil { panic(err) } // send data to sess.writer } } func ListenAndServe(addr string) error { l, _ := net.Listen("tcp", addr) defer l.Close() for { conn, _ := l.Accept() sess := new(Session) { var id uint64 // generate unique id ch := make(chan proto.Message) watchers[id] = ch go sess.Watch(ch) } go sess.Listen(conn) } } func main() { go ListenAndServe(":xxxx") <-make(chan struct{}) } ```

While most of the wrapping code wiped out, the key part is still there: message passed without any modifications after creation and then marshaled separately for different receivers. However, wire data may not have src field despite explicit checking.

User code of course was the main suspect and I ran application with race detector without any error output. In controlled environment, when src set to Point with default values, proto.Marshal produces expected serialization result with zero-length field value.

What did you expect to see?

Empty (but not nil) nested structs should always be marshaled as empty ones.

What did you see instead?

Sometimes marshaled data written to Writer has no serialized src field even though it can't be nil.

puellanivis commented 3 years ago

Note that for scalar message fields, once a message is parsed there's no way of telling whether a field was explicitly set to the default value (for example whether a boolean was set to false) or just not set at all: you should bear this in mind when defining your message types. For example, don't have a boolean that switches on some behaviour when set to false if you don't want that behaviour to also happen by default. Also note that if a scalar message field is set to its default, the value will not be serialized on the wire. [emphasis added] https://developers.google.com/protocol-buffers/docs/proto3#default

Since this is expected behavior and required by the standard, there is not really any chance that we would consider changing this behavior. It is always important to remember that a scalar message field does not have presence detection any more than any other scalar does. You cannot rely upon any semantic difference between nil and &pb.Point{}, even though in Go they are two separate things. In protobuf, the two are semantically identical.

gudvinr commented 3 years ago

@puellanivis message is not a scalar type. Wire type of emdedded message is essentially same as string, byte array or any other repeated value and encoding reference confirms that:

embedded messages are treated in exactly the same way as strings (wire type = 2)

And page that you linked says:

For message fields, the field is not set. Its exact value is language-dependent. The default value for repeated fields is empty (generally an empty list in the appropriate language).

Also keep in mind, that page you linked says about parsing messages when field not found on the wire at all and not about encoding:

When a message is parsed, if the encoded message does not contain a particular singular element

puellanivis commented 3 years ago

While messages are not a “scalar value type”, any field that is not a repeated field is a “scalar message field”. So, the clause provided concerning “The default value for repeated fields…” does not apply in this case, as the message field is not repeated.

I do not know what to do but explain it again: the proto3 standard specifically defines that “not-present” and “default/zero value” are semantically identical. This applies for messages as much as it does for any other non-repeated field.

I understand that in Golang nil and &pb.Point{} are semantically different, and distinguishable. But from the perspective of protobuf, they are still identical. The x field of a (*pb.Point)(nil) message is still 0 and why there is the GetX() receiver method.

Since “a message does not appear” and “a message is empty” are semantically identical, and the former is a shorter encoding, the shorter encoding is done in the wire format.

gudvinr commented 3 years ago

the proto3 standard specifically defines that “not-present” and “default/zero value” are semantically identical

They are semantically identical for scalar types as per documentation.
Scalar types, as per documentation are "one of the following types". And there's no "message" type in this table.

This particular page says:

For message fields, the field is not set. Its exact value is language-dependent.

And this section of documentation is about decoding values from wire when field in question is omitted. This is not applicable to the issue.

But from the perspective of protobuf, they are still identical.

No, you can distinguish between 1A 02 0A 00 which is PointEvent{Point: {}} and 1A 00 which is PointEvent{Point: nil}. Which also means that "empty message", in fact, takes event more bytes on the wire.

Also, from documentation on Go generated code:

Message fields can be set to nil, which means that the field is unset, effectively clearing the field. This is not equivalent to setting the value to an "empty" instance of the message struct.

There's no difference here between proto2 and proto3.

in Golang nil and &pb.Point{} are semantically different

It is the same in generated code for different languages:

dsnet commented 3 years ago

I'm unable to follow the example in the initial post. If the problem is with proto.Marshal then you should be able to reduce the reproduction to a simple pair of proto.Marshal and proto.Unmarshal calls without channels are other other unrelated logic distracting from and obscuring what is expected.

neild commented 3 years ago

While messages are not a “scalar value type”, any field that is not a repeated field is a “scalar message field”.

FYI, repeated fields, map fields, and messages are not scalar. All other field types are scalar.

Both proto2 and proto3 implement presence for message-valued fields, preserving a distinction between an empty message (&pb.M{}) and an absent one ((*pb.M)(nil)) across marshal/unmarshal. (Neither implements presence for repeated or map fields; there is no distinction between an empty list/map and one which is not present.)

gudvinr commented 3 years ago

I'm unable to follow the example in the initial post. If the problem is with proto.Marshal then you should be able to reduce the reproduction to a simple pair of proto.Marshal and proto.Unmarshal calls without channels are other other unrelated logic distracting from and obscuring what is expected.

Honestly, I thought it would be better than answering obvious questions like "are you sure there's no r/w operations without mutexes", "are you sure client packets indeed contain empty message and not missing field" and so on but ok.

Basically, it reads protobuf package from tcp socket and calls proto.Unmarshal

var cmd pd.PointCommand
proto.Unmarshal(data, &cmd)

Then nested field checked for nilness and assigned to new protobuf-generated struct:

evt := pd.PointEvent{
    Src: cmd.Src,
}

Reference to this object wrapped in proto.Message then distributed through channels to receiving end and marshaled using proto.Marshal from multiple goroutines like this:

// var ch <-chan proto.Message
msg := <- ch
proto.Marshal(msg)

In the end, sometimes client gets 1A 00 instead of 1A 02 0A 00 under unknown circumstances.

dsnet commented 3 years ago

This still doesn't help us. What is the value of cmd.Src? You issue seems to be about the PointEvent.Src field being omitted sometimes, but that's entirely dependent on what's being stored into that field. If it's nil, then the fact that the field is omitted is entirely expected.

gudvinr commented 3 years ago

What is the value of cmd.Src?

cmd.Src has type *Point. Point is a struct generated from protobuf schema. It may not be nil since, as I pointed out, nested field checked for nilness before sending to receivers. So, it may contain any combination of two int32 values.

In controlled environment, when cmd.Src == Point{} it never ends up being nil on the wire. But in production we caught errors when nested field can be encoded as nil despite preliminary checks and without modifying operations in-between.

dsnet commented 3 years ago

Again, I'm going to ask that you provide a self-contained reproduction that we can actually run.

gudvinr commented 3 years ago

Again, I'm going to ask that you provide a self-contained reproduction that we can actually run.

I can't provide self-contained reproduction since I can't guarantee reproduction by myself because an issue seem to be concurrency-related and rarely occurs. Only thing I know for sure that there is PointEvent.Src field pointing to PointCommand.Src which is empty but not nil. proto.Marshal then called from multiple goroutines using &PointEvent{...} as proto.Message value.

If you just call proto.Marshal on &PointEvent{Src: &Point{}} it will write 1A 02 0A 00 as you would indeed expect.

dsnet commented 3 years ago

If you are not observing that behavior, then there is nothing we can do without a reproduction.

gudvinr commented 3 years ago

If PointEvent.Src is non-nil, then proto.Marshal will marshal it in such a way that proto.Unmarshal will unmarshal a PointEvent as a non-nil PointEvent.Src.

In my case Marshal and Unmarshal called for different objects of different types so yes, I am observing different behaviour:

I don't think that it's important though.

Is there some trickery inside proto.Marshal that involves usage of unexported fields which might modify struct fields or somehow alter behaviour of marshaling process?
Is it guaranteed that proto.Marshal is goroutine-safe when you call it multiple time in parallel on the reference to same object?

I don't mean to be rude, I just really haven't dug really deep into internals of marshaler but it is kind of important to know.

dsnet commented 3 years ago

In my case Marshal and Unmarshal called for different objects of different types so yes, I am observing different behaviour:

That is some sketchy behavior. I recommend never relying on being able to marshal type T and unmarshal it into a type R. Doing so is asking for compatibility issues left and right. That said, the fundamental problem in this situation is that the field numbers are different between PointComment and PointEvent:

puellanivis commented 3 years ago

I wrote a more minimal version, but it’s still outputing an empty message rather than an absent one. (It appeared that it was empty, because it was just a single message encoding. Which happened to be the sub-message not the top-level message, which is implicit.)

puellanivis commented 3 years ago

OK, so catching up, I find that my code is still relevant, because I get 1A 00 from just this code:

package main

import (
    "fmt"

    pb "github.com/puellanivis/empty-pb-message/proto"

    "google.golang.org/protobuf/proto"
)

func main() {
    src := &pb.Point{
    }

    evt := &pb.PointEvent{
        Src: src,
    }

    data, err := proto.Marshal(evt)
    if err != nil {
        fmt.Println("!!!", err)
        return
    }

    fmt.Printf("% X\n", data)
}

So, the 1A 02 0A 00 is probably the unexpected value?

gudvinr commented 3 years ago

So, the 1A 02 0A 00 is probably the unexpected value?

Not at all, since src is not nil, it should be 1A 02 0A 00 and not 1A 00 as per documentation.

puellanivis commented 3 years ago

Not at all, since src is not nil, it should be 1A 02 0A 00 and not 1A 00 as per documentation.

If PointEvent.Src = nil then it prints nothing. 1A 00 is: field 3, zero length. Which makes sense because it has no non-default values.

puellanivis commented 3 years ago

While messages are not a “scalar value type”, any field that is not a repeated field is a “scalar message field”.

FYI, repeated fields, map fields, and messages are not scalar. All other field types are scalar.

Both proto2 and proto3 implement presence for message-valued fields, preserving a distinction between an empty message (&pb.M{}) and an absent one ((*pb.M)(nil)) across marshal/unmarshal. (Neither implements presence for repeated or map fields; there is no distinction between an empty list/map and one which is not present.)

I appear to have gotten confused, because rendering just a single embedded Point message in a PointEvent message gave me a single zero-length message. I didn’t recognize that this was the embedded Point not the surrounding PointEvent. 🤦 I would like to blame not having my coffee, but I think it was well into the afternoon by that point. 😆

dsnet commented 3 years ago

Closing as inactionable. Mismatching types when serializing and deserializing is going to run into strange effects sooner or later. Furthermore, the effects of this is more driven by the protobuf wire format than it is by a specific implementation. As such there's isn't anything specifically actionable on our end.

gudvinr commented 3 years ago

Mismatching types when serializing and deserializing is going to run into strange effects sooner or later.

I think there's some misunderstanding here. You mentioned earlier that

I recommend never relying on being able to marshal type T and unmarshal it into a type R.

This implies that conversion between T and R happens through wire format but it isn't the case. I am not unmarshaling into T from binary representation of R.
Conversion happens after unmarshaling from type T by copying fields to R using generated code and then object of type R marshaled into its binary representation.

Basically, this:

var T PointCommand
proto.Unmarshal(/* data */, &cmd)
R := pd.PointEvent{Src: cmd.Src}
proto.Marshal(&R)
neild commented 3 years ago

Is it guaranteed that proto.Marshal is goroutine-safe when you call it multiple time in parallel on the reference to same object?

Yes, so long as that object is not modified during marshaling.

Is there some trickery inside proto.Marshal that involves usage of unexported fields which might modify struct fields or somehow alter behaviour of marshaling process?

Marshaling can modify internal state for messages containing extensions, but it does so in a concurrency-safe fashoin.

If you think there's a problem here, we really need a test case that we can run that demonstrates the issue.