Recommend proto3 - Githubissues

MarcoPolo commented 1 year ago

There seems to have been a misunderstanding in the past around proto2 vs proto3. My attempt here is to clear up the confusion, recommend proto3 in general, and explain why proto3 should be preferred.

Our main confusion is about field presence. That is, if a field is omitted from the serialized wire format does the user of the decoded message know the difference between if the field was unset or set as the default value. This document has a lot of good information and is worth the read: https://github.com/protocolbuffers/protobuf/blob/main/docs/field_presence.md

Origins of the confusion

Proto2 would always serialize an explicitly set field, even if it was set to the default. This meant that you could know on the decoding side whether the field was set or not. This is called Explicit Presence. For example, in the Rust protobuf compiler, it would wrap these in Option<T>: https://github.com/tokio-rs/prost#field-modifiers.

The confusing thing is that the language guide for proto2 states:

A well-formed message may or may not contain an optional element. When a message is parsed, if it does not contain an optional element, accessing the corresponding field in the parsed object returns the default value for that field.

The subtlety here is that this doesn't say anything about "hasField" accessors. Which may be provided by the implementation to check if the field was set or not. This is essentially with prost is doing with Option<T> types.

Another confusing thing is that this language guide doesn't mention "presence" a single time. Which is what we're talking about here.

In proto3, if a field was set to its default value it would not be serialized. This meant that the decoding sided wouldn't know if the field was omitted because it was unset or because it was the default value. This is called No Presence.

Field Presence Proto2 vs Proto3

To clarify field presence in proto2 vs proto3:

From https://github.com/protocolbuffers/protobuf/blob/main/docs/field_presence.md#presence-in-proto2-apis

Proto2

Field type	Explicit Presence
Singular numeric (integer or floating point)	✔️
Singular enum	✔️
Singular string or bytes	✔️
Singular message	✔️
Repeated
Oneofs	✔️
Maps

Proto3

Field type	`optional`	Explicit Presence
Singular numeric (integer or floating point)	No
Singular enum	No
Singular string or bytes	No
Singular numeric (integer or floating point)	Yes	✔️
Singular enum	Yes	✔️
Singular string or bytes	Yes	✔️
Singular message	Yes	✔️
Singular message	No	✔️
Repeated	N/A
Oneofs	N/A	✔️
Maps	N/A

Advantages in Proto3 compared to Proto2

No required modifier
- This is generally considered an anti-pattern since all future versions of this message will need to contain this field. Generally users should prefer custom validation.
Opt-in explicit presence
- It's good to be able to get the space advantages of no-presence while still being able to opt-in to explicit presence. If we pass in an empty byte array most of the time this is semantically the same as passing no byte array, so it's nice to avoid paying the byte-cost for this.
- But if we do want explicit presence we can opt in to it. This is useful in case we do semantically care about knowing if there was nothing set.
Simple feature set: "The reason for removing these features is to make API designs simpler, more stable, and more performant. " from https://cloud.google.com/apis/design/proto3
Better ecosystem support. As libraries develop, it's likely they will support the latest protobuf spec rather than continue supporting proto2. This is already the case with protons, the compiler that JS-IP uses (see this bug).
User-defined default value for fields is no longer available. This was somewhat tricky to get right.

Next steps

Come to mutual understanding around proto2 vs proto3.
Come to consensus around recommending proto3.
Make the change to README.md

marten-seemann commented 1 year ago

In proto3, if a field was set to its default value it would not be serialized. This meant that the decoding sided wouldn't know if the field was omitted because it was unset or because it was the default value. This is called No Presence.

Isn't that potentially a problem? If I explicitly set a field, and the value just happens to coincide with the default value, I'd want it to be serialized, no?

EDIT: I might be misunderstanding, does this only apply to non-optional fields? That would make sense then, since if a field is non-optional (= required), there's no way it could have been left empty.

achingbrain commented 1 year ago

does this only apply to non-optional fields?

Yes - more correctly it applies to singular fields, the default type of field - see the Specifying Field Rules of the proto3 spec.

On deserialization to their object form both singular and optional fields are set to their default values if no value was present on the wire.

On serialization optional fields write any value that was set even if it was the default (and no value if one was not set), singular fields only write out a value if the value set was not the default.

Consequently optional fields let you know if the field was set, singular fields do not.

MarcoPolo commented 1 year ago

On serialization optional fields write any value that was set even if it was the default (and no value if one was not set), singular fields only write out a value if the value set was not the default.

yup: here are some example tests: https://github.com/MarcoPolo/proto2and3-playground/blob/main/src/main.rs#L122. proto3 optional bytes behave exactly like proto2 optional bytes.

MarcoPolo commented 1 year ago

In proto3, if a field was set to its default value it would not be serialized. This meant that the decoding sided wouldn't know if the field was omitted because it was unset or because it was the default value. This is called No Presence.

Isn't that potentially a problem? If I explicitly set a field, and the value just happens to coincide with the default value, I'd want it to be serialized, no?

EDIT: I might be misunderstanding, does this only apply to non-optional fields? That would make sense then, since if a field is non-optional (= required), there's no way it could have been left empty.

A couple of things that may help clarify (I'm talking about proto3 here):

The default value is type defined, not user defined. For example, an empty array of bytes is the same as not setting the bytes. I believe in most cases this is semantically the same. But in the cases where you do need to differentiate this you can use the optional modifer.
Fields without the optional modifer are not required. This is a little confusing, so I'll elaborate. Proto2 had a notion of "required" fields. This meant that if this field wasn't present, decoding would fail. This is not the case with Proto3. Proto3 doesn't have a notion of "required" fields that will cause decoding to fail. This is on purpose and a good thing. If a non optional field is missing in the message, proto3 assumes it's the default value for that type. So for a byte array the default value would be an empty byte array or a string would be an empty string. For a nested message we do "explicit presence" regardless if the field has an optional modifier (see the table above).

if a field is non-optional (= required), there's no way it could have been left empty.

With the above, hopefully it's clear there is no "required" notion in proto3. A field that is not marked optional could have been left empty. This is usually fine, but if your program depends on knowing if the field was set vs unset it should mark the field as optional in the protobuf.

vyzo commented 1 year ago

Not failing decode on missing required fields is not a good thing, now you need extra logic in the client for this and a giant footgun.

MarcoPolo commented 1 year ago

Not failing decode on missing required fields is not a good thing, now you need extra logic in the client for this and a giant footgun.

You probably already have some logic in the client side checking if the value is even appropriate. Having required fields makes backwards/forwards compatibility hard because a required field is forever. This isn't just my opinion, there's a lot written about this:

Required Is Forever You should be very careful about marking fields as required. If at some point you wish to stop writing or sending a required field, it will be problematic to change the field to an optional field – old readers will consider messages without this field to be incomplete and may reject or drop them unintentionally. You should consider writing application-specific custom validation routines for your buffers instead.

A second issue with required fields appears when someone adds a value to an enum. In this case, the unrecognized enum value is treated as if it were missing, which also causes the required value check to fail.

from: https://developers.google.com/protocol-buffers/docs/proto#specifying-rules

And more:

https://stackoverflow.com/questions/31801257/why-required-and-optional-is-removed-in-protocol-buffers-3
https://github.com/protocolbuffers/protobuf/issues/2497#issuecomment-267422550 (note this is from 2016, so it's a bit out of date since proto3 now has optional).

mxinden commented 1 year ago

Thanks @MarcoPolo for the research and the elaborate description.

One thing to note is that proto3 only supports optional since protoc v3.15.0. Users may use an older protoc version. E.g. Debian bullseye ships with v3.12.4. I don't think we should give this much weight. In other words, I don't think this is an argument against proto3. I am still raising it here so we can help users that run into it. (The error message with protoc <v3.15.0 does not make this obvious.)

I am fine with libp2p moving to proto3.

I am sorry for being the source of the confusion on presence in proto2 and proto3.

marten-seemann commented 1 year ago

I am sorry for being the source of the confusion on presence in proto2 and proto3.

No need to be sorry. This is not easy to reason about.

Debian bullseye ships with v3.12.4.

I don't think we should pay a lot of attention to what Debian does. I don't see any good justification for their focus on "stability" (aka outdated software). For example, they ship with Go 1.15, which was released in Aug 2020 and has been unmaintained for more than a year now.

I'm wondering if we should move all of our existing protobufs to proto3 as well. It would be nice to be consistent across our entire stack, and we could get rid of proto2 compiler dependencies. We'd have to check that this can be done in a backwards-compatible way in all our protocols though.

achingbrain commented 1 year ago

This is not easy to reason about.

It certainly is not. I discovered the other day that the official protobuf.js module doesn't handle default values properly when deserializing "singular" fields so even Google don't get it right sometimes and it's their spec.

I'm wondering if we should move all of our existing protobufs to proto3 as well

I've been doing this with the js stack as we took the decision to only support proto3 in protons and it's mostly been ok.

One oddity is when a proto2 field has been marked as required, and you are sending a message to a peer that will use a proto2 decoder it needs a value on the wire. The only way to ensure this happens in proto3 is to mark the field optional as if it's singular the value will be omitted if it's the default value.

🤪

thomaseizinger commented 1 year ago

Is it correct to think about it in the following way:

Every type in proto3 has a default value. Unset fields in message will show up as their default value upon deserialization.
optional fields have "None" as a default value (like Rust's Option or Haskell's Maybe).

achingbrain commented 1 year ago

Every type in proto3 has a default value. Unset fields in message will show up as their default value upon deserialization.

Yes, but with one gotcha that the default value for Message fields is for them to be unset, the exact value of which is language-dependent.

optional fields have "None" as a default value (like Rust's Option or Haskell's Maybe).

According to the spec all fields should be set to their default value upon deserialization (even optional fields).

If the field is marked optional you should be able to check if it was explicitly set - how you do that varies by language and even by protobuf implementation within the language.

If the field is singular (the default) you cannot check if it was explicitly set.

See Specifying Field Rules.

libp2p / specs

Recommend proto3 #465

Origins of the confusion

Field Presence Proto2 vs Proto3

Proto2

Proto3

Advantages in Proto3 compared to Proto2

Next steps