proto3 and unknown fields

joshuarubin commented 9 years ago

I know that unknown fields have been removed from proto3, but I am trying to get an explanation about why this change was made and if there is any way to replicate that behavior in proto3.

Thanks so much.

referred from golang/protobuf#25

dhendry commented 9 years ago

I too am wondering about this. I am looking into migrating what is essentially a messaging system to gRPC (where proto3 seems to be recommended). In my case, clients send messages (text plus rendering information) to each other via a server where the server needs to understand the text and certain parts of the rendering info. I want to allow client developers to experiment with new features (pre-release) without having to deploy server code for every change.

Essentially, its a case where I want a shared proto definition between the client(s) and server, but dont want to require the server proto definition to be the latest to process requests.

solicomo commented 9 years ago

I'd like to hear about the explanation, too.

The behavior of proto2 makes sense to me.

jeremyong commented 8 years ago

I have a lot of concerns about silently deleting data upon deserialization, to the point that even though we have internally been using proto3 for several months, I am considering changing things back to proto2. This change would be a lot easier to stomach if there was a message option to allow serialization and deserialization of unknown fields instead of discarding them.

jeremyong commented 8 years ago

Being unable to add unknown fields that persist is also unacceptable for us. Reading the code, it's pretty clear the decision to omit unknown fields happens at compile time rather than at runtime (based on the generated code), so it seems proto3 is a no-go. Personally, I very much liked most of the changes to the new version except this one. Changing the default behavior alone might have been ok, especially given that the new behavior is well-documented, but doing so without a way to restore old behavior seems like a misstep. Supporting a plugin that reverts that behavior seems too expensive relative to the cost of just using proto2 with restrictions (optional only, etc).

dhendry commented 8 years ago

Still no answers to this? This is a fundamental issue which is seriously hindering our the adoption of protobuf in many areas.

jeremyong commented 8 years ago

+1 proto2 is a permanent fixture for us. Changing default behavior is one thing but changing it in a way that doesn't let the user even control it is a strict loss in my opinion. What I foresee moving forward is a huge fragmentation in the client ecosystem. Maintaining support for both proto2 and proto3 semantics is too much to chew for most developers, and I'm already seeing some client libraries do this awkward dance where they have some proto2 properties and some proto3 properties. The easiest example of this causing a problem in history is the move from Python2 to Python3. One possible solution might be a file level option that informs the protobuf compiler not to strip unknown fields.

liujisi commented 8 years ago

The proto3 spec doesn't forbid preserving unknown fields. Instead, it allows implementation to choose whether to preserve unknowns. The current C++/Java chose to drop the unknowns though. We are currently looking the issue and will keep this thread posted.

jeremyong commented 8 years ago

Thanks @pherl for providing the update. FWIW, I think it is worth considering how the behavior might be standardized, for the same reason people argue against undefined behavior in C or C++. Undefined behavior (if present) should really be due to a lack of foresight if it exists, but for something like this, we might as well come up with an actual solution since we're already aware of the problem.

joshuarubin commented 8 years ago

Thanks for keeping this issue alive. I'd just like to add that we are interested in support for Go, but that might need to be addressed in golang/protobuf.

jeremyong commented 8 years ago

@pherl Any progress on this front?

gfecher commented 8 years ago

+1 for preserving unknown fields.

I accept that you can not trivially maintain compatibility with the JSON format (at least as long as you want to marshal fields with their names), but I think a lot of shops would be happy to pay this price for not having to release their low-level infrastructure in lock step with their newest clients.

In fact Kenton seems to wonder himself (https://capnproto.org/news/2014-06-17-capnproto-flatbuffers-sbe.html): Apparently, version 3 of Protocol Buffers, aka “proto3”, removes this feature. I honestly don’t know what they’re thinking. This feature has been absolutely essential in many of Google’s internal systems.

In my opinion the right approach would be to make this an option of the proto compiler on compiling the proto: this way everybody can decide for themselves whether the benefits outweigh the downsides.

For now I have overridden the PreserveUnknownFields function in both cpp_helpers.h and java_helpers.h in the compiler code to always return true and this seems to work, but I would appreciate it if someone from google could confirm.

xfxyjwf commented 8 years ago

Some updates: we tried to gather data to prove "unknown fields are essential for Google systems", but the result is not so convincing (the experiment is done in a Google sub-system, not the whole of Google).

For those of you who are interested in adding back unknown fields in proto3, could you describe your use case in more details and explain why unknown fields is required (e.g., can the same use case be supported using some other proto3 features)? We need to prove unknown fields are needed in some common use cases in order to add it back.

jeremyong commented 8 years ago

Here is a use case I developed internally that makes heavy usage of unknown fields:

In addition to the message itself, we often annotate the message before sending it over the wire with metadata indicating if a field was deleted or not, if it was set to a default field, etc. Internally, we use a diff-ing scheme to create a protobuf message "diff" which handles maps, fields, and messages (recursively applied). The application of the diff itself is associative, so many diffs can accumulate into one, and this makes for a fairly elegant scheme for updating state for a particular message across many clients that may or may not be online.

Generalizing this use case, any protobuf message that is derived from the reflection API must necessarily leverage the unknown field set, since by definition, we cannot know the shape of the message a priori. Think of this as a "higher order message" whereas messages that are schema defined are first order messages.

On Sun, Jun 12, 2016 at 11:30 AM, Feng Xiao notifications@github.com wrote:

Some updates: we tried to gather data to prove "unknown fields are essential for Google systems", but the result is not so convincing (the experiment is done in a Google sub-system, not the whole of Google).

For those of you who are interested in adding back unknown fields in proto3, could you describe your use case in more details and explain why unknown fields is required (e.g., can the same use case be supported using some other proto3 features)? We need to prove unknown fields are needed in some common use cases in order to add it back.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/google/protobuf/issues/272#issuecomment-225452538, or mute the thread https://github.com/notifications/unsubscribe/AAPRJdU9zmn_iHC60rz014oYrvt0n_zQks5qLFBXgaJpZM4D8C3u .

Jeremy Ong PlexChat CTO 650.400.6453

gfecher commented 8 years ago

Hi,

We have a use case with a mixture of data validation/data transformation and storage. Our infrastructure component understands certain bits of the schema that it validates/changes, but it is oblivious to the rest of the payload. It does store it, however, and clients running on the new schema expect the newly introduced fields to be returned intact.

In general any component could benefit from preserving unknown fields where only a partial understanding of the message is needed, especially where the bits the component does care about does not change often, but the rest of the schema does. I can think of routing, storage, certain types of data transformation, etc.

I would be interested in knowing how you managed to solve these use cases (which I'm sure you have internally at google) without preserving unknown fields.

InfinitiesLoop commented 8 years ago

We need unknown fields, because it's one of the ways we know on the server-side that our proto definition is out of date, and needs to be re-synchronized. Without unknown fields, we would have to resort to polling or some other less authoritative way of detecting when the client has added fields.

Also while I understand trying to reduce feature surface area, unknown fields don't exactly cause a problem, do they? Dropping them has more negatives than positives, please add them back to proto3.

JesseChisholm commented 8 years ago

If the proto3 way was to set some option, like option (ProtoOptions).preserveUnknownFields = True; that would allow those of use who need it to keep it and those of you who don't need to do without it.

Best of both worlds. :)

dhendry commented 8 years ago

I would absolutely want the ability to preserve or strip unknown fields at runtime. There are levels of our system which get deployed regularly, are kept up to date, and should be validating the well known schema (and stripping unknown fields), but there are other internal layers which get deployed far less frequently, that are not directly exposed to clients or potentially malicious actors where preserving unknown fields is highly desirable so we dont have to do full and extensive deploys for every little change.

rohitsaboo commented 8 years ago

Hey guys,

We would love to have this feature, too :) During my relatively long time at Google, I was aware of many services that relied on this behavior from proto2.

Essentially, think of any set of three or more services where A talks to C via B, and we don't want to redeploy B when a proto that is being passed between A and C gets a new field added to it. (I also posted this as a question on stackoverflow.)

Would be great to have an update for supporting this feature and/or an alternative mechanism that you believe can solve this problem for us.

Thanks, Rohit

jeremyong commented 8 years ago

Still no word on what the original justification was too.

Kaiserchen commented 8 years ago

The use-case we have is the following:

We use Stream Processors, namely kafka-streams, that rearranges protobuf messages. For example we have 2 streams of protobuf messages that we join with each other. The join will just output a joined message having the two others as fields. Sometimes we also aggregate streams to list of messages of previous streams. The stream processors only know about the fields relevant for them (join fields, group by fields ...) all the other fields are carried along as unknown-fields.

This allows the stream processor to continue working even when upstream schema changes happen, we do not need to redeploy our stream processing application, and the new fields end up in the output for free.

To add some drama: I think loosing the unknown fields will force us to move to avro

matthewrj commented 8 years ago

This is a bit of a deal breaker for us too. We have the same use case where A sends data to B which reads some fields and forwards the message to C. We don't want to have to constantly update B when the schema changes even though it doesn't read any of the new fields. The current behaviour is quite dangerous since C can't tell if one of the new fields was set to the default value or if B is just out of date and lost data.

InfinitiesLoop commented 8 years ago

Would really appreciate an update on the feedback here. Whether Proto3 is going to ever support unknown fields can impact decisions being made even for folks still on Proto2, because if it isn't, we may need to invent other ways of solving our problems in order to avoid rearchitecting things when/if we move to proto3.

chmod007 commented 7 years ago

I have two use cases, both of which have sub-optimal workarounds:

1) Include a signature in the same protobuf as the payload to be signed. To verify the signature, I deserialize, extract and remove the signature, reserialize and verify the signature. This breaks if the signed message contains any new fields unknown to the process verifying the signature. The workaround is to serialize in two levels, with the inner (signed) message serialized as bytes in the outer message.

2) A server is the ultimate source of small update packets that are then routed peer-to-peer. Unserializing and reserializing before passing the message on to other peers strips out unknown fields. The workaround is for peers to share the original bytes instead of deserializing and reserializing.

acozzette commented 7 years ago

One thing to keep in mind is that proto2 is not going away. We are still actively improving it and plan to keep doing so indefinitely, so proto2 is still a good choice if you have a use case that depends on unknown fields. The one main drawback is that a few languages (such as C# and Ruby) are currently proto3-only, but if you're not using those languages then that's not a problem.

@chmod007 , have you thought about using proto2 for your two use cases? Is that possible or do your schemas have to be proto3 for another reason?

Xorlev commented 7 years ago

I'll add a few usecases.

We have a gRPC service proxying RPC traffic. It would be awfully nice to not have a hard requirement to deploy the proxy first upon schema changes in any of the services it proxies.
We also maintain stream processing services which are processing protos from other parts of the organization. If they add a field, I'd prefer that field doesn't disappear unexpectedly just by flowing through our stream processor. There's some pretty awful documentation / tooling / coupling implications of needing to redeploy stream jobs any time upstream producers evolve their schema. Depending on any cycles in data flows, there may be no topological order that produces valid schema updates without doing a 2-step deploy: 1) upgrade proto schema, redeploy all the (many) things that might rely on it 2) update producer to fill in field, deploy producer. Pray all the systems were updated.

re: proto2 vs. proto3, it's kind of annoying to mix and match. It's pretty counterintuitive to only use proto2 to maintain unknown fields, but have proto3 definitions for gRPC servers. I agree with most of the design choices in proto3 (e.g. removing optional/required fields, map types), but not this.

I'd actually been unaware proto3 removed unknown field support until I expected it to maintain an unknown field and it didn't (and came to report it as an issue). I'd touted unknown field support as a huge selling point for protobufs when we'd first implemented them.

The protobuf website originally recommended that new projects use proto3, which is why we'd adopted it, but this is a pretty huge issue for us. We'll likely be forking the compiler similarly to @gfecher as the proto3 ship has long since sailed and this behavior is very important to helping us produce robust infrastructure.

stevvooe commented 7 years ago

@pherl @xfxyjwf Do you have suggestions for how to work around this with proto3? If this was removed, what techniques were used to avoid requiring this pattern within Google?

As far as I see it, this was the chief benefit of protobuf:

+----------+                        +----------+
|          |   +----------------+   |          |
|          |   |                |   |          |
| Producer +--->  Intermediate  +---> Consumer |
|          |   |                |   |          |
|          |   +----------------+   |          |
+----------+                        +----------+

Producer and Consumer could be updated with new fields, while intermediate can remain on the same version. If intermediate is a proxy of sorts, then this is important.

jeremyong commented 7 years ago

@stevvooe We've been continuing to use proto2 for the intermediate proxy type thing since they are binary compatible. Throughout our codebase, we've been propagating proto2 everywhere since it's really annoying to maintain two different semantics for the proto definitions themselves but if you wanted, producer and consumer could use proto3.

I do have some plans eventually to do a separate C++ compiler entirely that consumes proto3 syntax but retains the API of the unknown fields unless someone else gets to it first. I want to do other changes like using more STL containers (vectors and maps) as the backing in-memory storage and fix the oddities with the arenas we've been seeing.

liujisi commented 7 years ago

@stevvooe one possible solution for the intermediate is to preserve the raw payload (if it doesn't need to update the fields). We could also introduce language specific parsing APIs to preserve the unknown fields for such cases.

matthewrj commented 7 years ago

I have exactly the same situation as @stevvooe. In my case the intermediate does update some fields. Is there any work around for when the intermediate does update fields?

stevvooe commented 7 years ago

@pherl Thank you for the response!

@stevvooe one possible solution for the intermediate is to preserve the raw payload

This is what proto2 did, automatically, and allowed updates.

It seems like I could create a gogo plugin (or a patch for gogo) to preserve the unrecognized data.

Xorlev commented 7 years ago

@pherl Thanks for the response!

Do you have any insight into why unknown fields were removed in proto3? Was it to put a nail in the coffin of extensions? I'll admit, unknown fields make it harder to have deterministic serialization, but the introduction of map<> types have similar faults. That said, if your message has a map it's known to be potentially non-deterministic whereas unknown fields made it a message instance by message instance question.

Even still, an option along the lines of option java_allow_unknown_fields option cpp_allow_unknown_fields (or a per-message-specific) would be my ideal resolution here as it makes it a language-specific problem to support unknown fields and makes it quite explicit in your proto whether it's used or not. A linter can help prevent use of these protos in situations such as a proto being used as a join key.

The presence of those options also serve as documentation that that behavior is not handled by default.

I don't want to maintain my own fork of protoc going forward, so it's certainly in my selfish interests to add a user-accessible switch to the mainline compiler. I also realize there may have been good reasons for removing them and I'd be interested in hearing those. I recognize that adding any additional switches to such a prolific project have definite implications going forward as well.

liujisi commented 7 years ago

Thanks for the feedback. @acozzette is looking into this issue and will keep this thread posted. Potentially we would introduce some APIs to optionally preserve the unknowns in proto3.

jeremyong commented 7 years ago

Yes please! I had already begun forking the compiler and this will save me a ton of time.

stevvooe commented 7 years ago

@pherl For the most part, I think we have a way out for the docker use cases. It would just be good to get a clear understanding of the design decision. I am sure there is a good reason, but I am having trouble inferring. Even more so, does this reasoning apply to our use case, as in, are we doing something "bad"?

liujisi commented 7 years ago

The original motivation is to let the language implementation decide whether to preserve unknown fields, i.e. the spec does not require that implementation must preserve unknowns. This simplifies implementations and enables struct-like API. There's nothing wrong with preserving unknowns.

jeremyong commented 7 years ago

If I'm not mistaken, that's simply not consistent with what the documentation has said which explicitly states "removal of unknown fields" as a "feature" of the proto 3 spec. Either way, glad it's being looked at.

fducat commented 7 years ago

Once again, we are gonna recommend internally in my company proto2 due to this lack of unknown field propagation in proto3. We have a major use case in which a dozen Backends are communicating together via the same message, each intermediate computing a little part of the content. The interest of using unknown fields is simply development efficiency by removing team dependencies. Usually one or two BE in the row are interested in the change. Forcing all 12 to update the version in coordination is what we cannot afford.

Having an option would just be the most flexible solution by far, and this for all languages please since we need it at least for C++ Java and Python...

mark-e-hoffman commented 7 years ago

@pherl Hi, is the option to re-introduce the preservation of unknown fields actively being considered? Our company has recently adopted proto3 ( with no proto2 legacy ) under the false assumption that unknown fields were retained. We may have to fall back to proto2 if there will not be a path to optionally support unknown fields in the near future. Any feedback would be appreciated.

JemDay commented 7 years ago

Just to nudge this issue again.

@pherl - You mentioned in late November that you were looking at the possibility of exposing API's to allow unknown fields to be preserved, have there been any decisions regarding this?

Much like other contributors on this thread we're on the verge of moving back to proto2 but would prefer not to go through that exercise if at all possible.

liujisi commented 7 years ago

Hi Jem,

The plan is that: 1) prepare a doc listing the rationale of dropping unknown fields 2) collect use cases when unknown fields are needed; brainstorm and go through the use case and figure out workarounds/alternative without adding unknown fields back. 3) if the alternatives in (2) do not work, or if there's no workaround. We will then preserve the unknowns.

Currently we are on (1) and (2). Will share the docs when they are ready.

On Thu, Jan 19, 2017 at 1:57 PM Jem Day notifications@github.com wrote:

Just to nudge this issue again.

@pherl https://github.com/pherl - You mentioned in late November that you were looking at the possibility of exposing API's to allow unknown fields to be preserved, have there been any decisions regarding this?

Much like other contributors on this thread we're on the verge of moving back to proto2 but would prefer not to go through that exercise if at all possible.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/google/protobuf/issues/272#issuecomment-273911543, or mute the thread https://github.com/notifications/unsubscribe-auth/AATQyZn-v6pUpT7vouWYe8Gy5HdP_Lrxks5rT9xUgaJpZM4D8C3u .

JemDay commented 7 years ago

@pherl - Thanks for the response, glad to hear you guys are taking a look at this.

One of our use-cases is very similar to the one described by @stevvooe where intermediate processes in an invocation chain are decorating messages as they pass through.

I hate to ask (because i hate it when people ask me!), but do you have a sense of when you might have further information to share?

kd8azz commented 7 years ago

I also arrived at this page while debugging a missing unknown field. My use case is a proto that looks like this:

message PluginData {
  SomeData field_a = 1;
  SomeOtherData field_b = 2;
  // Remaining fields are available for plugin-specific implementations.
}

The design, here, was explicitly to allow the users of the API to pass through their own fields. The API is the storage layer, and the user provides both a client and a backend plugin. I considered using Any for this, but my experiences with Any on other projects have led me to consider it harmful. Given that no one but the client needs to understand the other fields, and the client has their proto definition, passed-through unknown fields seemed like the ideal solution.

Any update on an ETA for rationale?

EDIT: I decided it may be useful to clarify two more points:

The experiences I had with Any that led me to consider it harmful centered around receiving data from clients. Some of our tooling wanted to validate the Any on receipt, which made the semantic "Any of the protos you had at compile time" rather than "Any possible proto". Given that that system also wanted dynamically-defined data, this was a nightmare.
In my use case above, the client already knows what type they want, so the string type_url in the Any is a waste of I/O. As a result, I considered adding a plain bytes field to my PluginData proto, which the client would then operate on, assuming it was their type. However, the only difference between that and using unknown fields, on the wire, is that the byte field approach has an extra tag and length stanza, which again, is a waste of I/O, the only benefit being that it would allow me to add more fields to my parent proto in the future. I decided to solve that by reserving a field or two for later.

liujisi commented 7 years ago

We are planning to bring unknown fields back in proto3. Please take a look on the doc about the general plan: https://docs.google.com/document/d/1KMRX-G91Aa-Y2FkEaHeeviLRRNblgIahbsk4wA14gRk/edit#heading=h.w8dtggryroj4

Xorlev commented 7 years ago

Yes! Excellent news @pherl. Thank you for keeping us up to date. :)

fducat commented 7 years ago

Awesome news @pherl. Thanks a lot for the feature and for the clear upcoming implementation plan.

stevvooe commented 7 years ago

@pherl Thanks for the great response!

The provided document addresses all the major concerns. I hope we can also coordinate with unofficial generators, like gogo/protobuf, to coordinate the rollout.

JemDay commented 7 years ago

@pherl - Thanks for the follow-up, much appreciated.

dopuskh3 commented 7 years ago

Hi,

We are planning to implement this at least of Java and C# part. @pherl Could you add precision in the design documentation about what kind of flag will be used to activate this option?

Should this be:

Defined when compiling protoc
Defined as a protoc command line flag
Defined as a Builder option

Regards, F.

danburkert commented 7 years ago

Will conforming implementations be required to preserve the original ordering of unknown fields when serializing messages?

Xorlev commented 7 years ago

Looking at UnknownFieldSet.java it looks like the order of unknown fields is entirely dependent on the backing Map. My guess would be "no", but it might be worth asking the question as to whether deterministic serialization mode should be extended to interleave unknown fields by ascending tag id in the output.

protocolbuffers / protobuf

proto3 and unknown fields #272