Add support for unknown fields

tokio-rs / prost

PROST! a Protocol Buffers implementation for the Rust Language

Apache License 2.0

3.95k stars 508 forks source link

Add support for unknown fields #2

Open danburkert opened 7 years ago

danburkert commented 7 years ago

See https://docs.google.com/document/d/1KMRX-G91Aa-Y2FkEaHeeviLRRNblgIahbsk4wA14gRk/view and https://github.com/google/protobuf/issues/272.

per-gron commented 6 years ago

I'm interested in this.

In order to make this work, unknown fields have to be stored in the structs somehow, something along the lines of adding an UnknownFieldSet to every generated message struct. It could be made to be just a null pointer and nothing more unless there are unknown fields.

This would be an API break, since existing struct instantiations would break because of not setting this field. A seemingly sensible way to make this sane is to recommend everyone to instantiate structs like this:

syntax = "proto3";

message Person {
  string name = 1;
}

Person { name: "Per".to_string(), ..Default::default() };

This would work if all generated structs would #[derive(Default)] (or equivalent), which seems like a sensible thing to do in general because Protobufs in general assume that no code breaks from just adding a field. (Not sure how this would interact with proto2 default values?) This also seems compatible with https://github.com/danburkert/prost/issues/43 (#[non_exhaustive]).

For completeness, the parsing code would probably have to be updated to at least not discard groups, even if it's not made capable of actually parsing them.

I don't know if known fields with unexpected type should go into the unknown fields set or not.

Does this make sense?

per-gron commented 6 years ago

I have started hacking on this.

There are some ugly corner cases that I'm not entirely sure how to deal with. For example the well-known types such as BoolValue when represented as their Rust native types obviously won't be able to keep unknown values. The same applies to maps.

danburkert commented 6 years ago

@per-gron thanks for the PR. Could you explain the motivation behind the unknown fields feature? I've read through the upstream docs linked in the first comment, but I don't really understand why it's become such a priority for Google to support this feature. My take is that unknown fields are full of subtle footguns1, and the described usecases are better either not a good fit for protobuf (RMW) or better done through preserving the original encoded message (intermediary servers).

per-gron commented 6 years ago

@danburkert I honestly don't know exactly why either, but I can speculate a bit: It seems like the fact that proto3 is different from proto2 in this regard has made it difficult for a lot of internal projects to adopt proto3. It's an API break that simply makes it scary to upgrade when you have a huge code base, and the difference is not important enough to justify a change of behavior.

I agree that this feature is footgun-prone; yet I think it is better to behave as similarly as possible to the other protobuf libraries than to not have it at all. (It is of course possible to provide some kind of Prost setting to disable the feature for projects that don't care about having the same behavior across languages.)

For me personally, I don't really care much about this particular feature or lack of it, the reason I wanted to work on this is that it seems like Prost is a very nice Rust Protobuf library, and I think it would be good for Rust to have a more "officially endorsed" version that people can trust will work as expected, including across languages. When working with large projects and across different languages, uniformity really helps a lot, even with really small details.

(Note: Even though I work at Google I don't have any special power to "officialize" any particular Rust Protobuf library, but I hope that fixing details like this one so that its behavior is very close to the main Protobuf libraries along with having people from the Rust community agree that Prost's struct-based API is nice will go a long way.)

danburkert commented 6 years ago

Just to set expectations, it's never been a goal of mine to have prost officially endorsed by Google or the upstream Protobuf authors or community. Nor has it been a goal to have feature compatibility with other Protobuf libraries, since there are so many Protobuf features (at least in the upstream C++ implementation). My goals with prost are roughly (and probably omitting some important ones):

Idiomatic generated code
performance
straightforward build system integration
Compatibility with other Protobuf libraries (wouldn't really be protobuf otherwise)
Support the core proto2/proto3 features

The upstream project has done a great job publishing conformance tests which prost uses to ensure compatibility.

As far as adding features for the sake of parity, there's no way I can feasibly do that as a single maintainer, nor do I think I could even maintain such a library if the features were contributed by others. prost is currently 'feature-complete' for the purposes of the application for which I wrote it, so isn't a burning need for me to keep adding features. I'm especially disinclined to add features which I think are harmful to creating flexible applications, and while I haven't completely made up my mind, I tend to think unknown fields fall in that category. I've yet to come across a solid usecase for them that isn't error prone or better solved in a different way.

per-gron commented 6 years ago

Thanks for the information about your priorities with Prost. (I've been on vacation hence my slow reply.)

The reasons that I wrote this PR are 1) there was an open issue about it written by you, 2) unlike JSON/groups/reflection this is something that is or will soon be required by the spec so this is not quite the same as those other features.

Given help with maintenance, would you be interested in expanding the scope of this library to include features that makes it possible to get broader adoption? If not, would you support a fork with that goal?

scottlamb commented 6 years ago

I don't really understand why it's become such a priority for Google to support this feature. My take is that unknown fields are full of subtle footguns1, and the described usecases are better either not a good fit for protobuf (RMW) or better done through preserving the original encoded message (intermediary servers).

I happen to work at Google (not on protobufs!) and know folks who really pushed for this (reversing the proto 2->3 decision to drop unknown field handling). Those two use cases are the reason why.

I'm not sure why you say protobuf is not a good fit for RMW; it's overwhelmingly common for databases to be full of protobuf values which servers do RMW on. It generally works well...but most teams are still using proto 2 (with no plans to adopt proto 3). Some that started using proto 3 got nasty surprises; thus the push.

I see your point intermediary servers could preserve the original encoded message, but there are also reasons it's easier to work with a message field embedded in a message field rather than a bytes field embedded in a message field. Think ease of writing an ASCII message (to check in as test data or to send an RPC by hand on the commandline while debugging), ease of understanding debug representations (inverse of the above), and not having to explicitly have code do additional serialization/deserialization steps / possibly mess up the type of the contained proto.

The oneof footgun doesn't concern me too much, fwiw. It's wrong to have a message that has more than one of the oneof present; inconsistent behavior in which is used should probably be expected. There are bigger footguns involved in proto compatibility (reusing field numbers / changing types of fields is a huge common way to screw up; there are a lot of variations of how to make this mistake and how it can bite), but dealing with backward and forward compatibility can't be avoided in many applications.

Given help with maintenance, would you be interested in expanding the scope of this library to include features that makes it possible to get broader adoption? If not, would you support a fork with that goal?

I'm also interested in the answer here (though I can't commit to significant help myself). I appreciate the hard work you've done on this project. Nonetheless, it's a bit frustrating that there are at least three apparently-commonly-used Rust proto implementations, with none being a superset of the others (much less matching Google's official C++ protobuf implementation). I'd love a path to resolve that, and that probably starts with knowing if the base should be prost or one of the others.

dg-builder commented 1 year ago

Hi, what's the current status for supporting unknown fields? I see multiple attempts over the past couple of years but no progress.

It would be great to rm this file :) https://github.com/tokio-rs/prost/blob/master/conformance/failing_tests.txt

mzabaluev commented 1 week ago

Unknown fields still need to be supported in edition 2023 onwards.

I can think of two reasonable ways to support this:

In-message state

Add a private field to every generated message to be exposed via an .unknown_fields() accessor in Message:

#[derive(prost::Message)]
pub struct MyMessage {
    __prost_unknown_fields: Vec<prost::UknownField>,

    #[prost(string, tag = "1")]
    pub a_field: String,
}

This will effectively make every message non-exhaustive for matching and impossible to construct with struct expressions, so it will need builders like in #901. These can be considered good features with regard to protobuf versioning semantics. This change can be introduced as a build option to preserve backward compatibility.

One disadvantage is an increase in struct size for every message, regardless of need.

Out-of-band data for encode/decode

Add methods to Message to enable round-tripping with unknown fields for users who need them:

pub trait Message {
    // ...

    fn decode_with_unknown(
        mut buf: impl Buf,
    ) -> Result<(Self, UnknownFields), DecodeError>
    where
        Self: Default,
    {
        // ...
    }

    fn encode_with_unknown(
        &self,
        unknown_fields: &UnknownFields,
        buf: &mut impl BufMut
    ) -> Result<(), EncodeError>
    where
        Self: Sized,
    {
         // ...
    }
}

UnknownFields needs to be a recursive dynamic data structure to account for nested messages. It will be hard to use for any purposes other than round-tripping and informational dumping.

I think this could be added in a backward-compatible way by providing a kind of DecodeError that transports unknown fields, and handling it in the in-crate and generated implementations.

caspermeijn commented 1 week ago

I can think of two reasonable ways to support this:

In-message state

I like this approach. It guarantees that the unknown fields are part of the message and it doesn't have to break existing APIs.

Add a private field to every generated message to be exposed via an .unknown_fields() accessor in Message:

It is not necessary to hide this field. We can make it a publicly accessible field as long as it doesn't have a conflicting name. The public field also prevents the struct from becoming non-exhaustive.

One disadvantage is an increase in struct size for every message, regardless of need.

I think it should be an optional feature to save unknown fields. When people are interested, they can enable it in the code generator with a self chosen name.

https://github.com/tokio-rs/prost/pull/574 takes a similar approach as I mention here.

mzabaluev commented 1 week ago

In-message state

I like this approach. It guarantees that the unknown fields are part of the message and it doesn't have to break existing APIs.

Only if it is opted in by matcher settings in Config, otherwise it will break generated APIs, which for prost-build is tantamount to breaking its own API.

Add a private field to every generated message to be exposed via an .unknown_fields() accessor in Message:

It is not necessary to hide this field. We can make it a publicly accessible field as long as it doesn't have a conflicting name.

OK, with a configurable name they can avoid conflicts.

The public field also prevents the struct from becoming non-exhaustive.

A developer who wants their generated Rust bindings to adhere to Protobuf semver semantics would actually want the message structs to be non-exhaustive, so that future field additions do not become breaking changes. But this can be done by other means, such as the #[non_exhaustive] attribute.

I expect the added field to be a nuisance to any user who doesn't care about unknown fields, so they'll match it with .. anyway.

One disadvantage is an increase in struct size for every message, regardless of need.

I think it should be an optional feature to save unknown fields. When people are interested, they can enable it in the code generator with a self chosen name.

One problem with this is, it's yet another setting affecting the generated API that users might disagree on. In an ideal world, a set of proto packages is represented by a single Rust library crate that everybody else can depend on, rather than generating sets of incompatible types for the same thing (a-la prost-types vs. pbjson-types vs...). If the in-struct baggage for unknown fields will prove as unpopular as I expect, developers of such "designated binding" crates would be discouraged from enabling it, meaning that it won't be generally available for binding consumers. A feature gate guarding the struct members is not usually a good solution, because API should not be breakable by enabling feature gates.

mzabaluev commented 1 week ago

I'm leaning towards the out-of-band approach. The general philosophy of Prost, as I see it, has been to make simple things easy and hard things sometimes possible. Most protobuf applications do not care about unknown fields and would not know what to do with them because, well, if they'd know the format of a field, they'd use a proto revision that describes it!

Support for unknown fields is necessary for components that inspect messages, possibly set or modify some known fields, then re-encode the messages to pass them along with any fields they don't recognize. With the parallel, recursive UnknownFields data structure as I sketched it, it would be easy to handle the unknown fields of the top-level message, but modifications in nested message fields would need some painstaking effort to keep the parallel data structure up to date.

mzabaluev commented 1 week ago

API should not be breakable by enabling feature gates.

Instead, prost-build could generate separate struct variants augmented with unknown fields, in an added-on module, and provide conversions between the two:

#[derive(prost::Message)]
pub struct Foo {
    #[prost(message, optional, tag = "1")]
    pub bar: Option<Bar>,
}

#[derive(prost::Message)]
pub struct Bar {}

#[cfg(feature = "unknown_fields")]
pub mod with_unknown_fields {
    #[derive(prost::Message, prost::UnknownFields)]
    pub struct Foo {
        #[prost(unknown_fields)]
        __prost_unknown_fields: Vec<prost::UnknownField>,
        #[prost(message, optional, tag = "1")]
        pub bar: Option<Bar>,
    }

    #[derive(prost::Message, prost::UnknownFields)]
    pub struct Bar {
        #[prost(unknown_fields)]
        __prost_unknown_fields: Vec<prost::UnknownField>,
    }

    // Convert to a `Foo` message struct not augmented with information on unknown fields.
    impl From<Foo> for super::Foo {
        fn from(full_message: Foo) -> Self {
            todo!()
        }
    }

    // Convert from a `Foo` message struct carrying no unknown fields.
    impl From<super::Foo> for Foo {
        fn from(foo: super::Foo) -> Self {
            todo!()
        }
    }

    impl prost::WithUnknownFields for super::Foo {
        type FullMessage = Foo;

        fn with_unknown_fields(
            self,
            unknown_fields: into Vec<prost::UnknownField>,
        ) -> Result<Foo, prost::UnknownFieldError> {
            todo!("check if the unknown field tags are not found among the known fields")
        }
    }
}

So the simple users get their simple structs, and proxy use cases and other sophisticates can do something like use my_proto::with_unknown_fields as my_proto; and deal with the unknown fields and the non-exhaustiveness.