protocolbuffers / protobuf

Protocol Buffers - Google's data interchange format
http://protobuf.dev
Other
65.23k stars 15.44k forks source link

proto3 and unknown fields #272

Closed joshuarubin closed 6 years ago

joshuarubin commented 9 years ago

I know that unknown fields have been removed from proto3, but I am trying to get an explanation about why this change was made and if there is any way to replicate that behavior in proto3.

Thanks so much.

referred from golang/protobuf#25

danburkert commented 7 years ago

@Xorlev I bring it up because the two usecases mentioned in the doc ('intermediary servers', and 'read-modify-write') have subtle edge-cases when field re-ordering is combined with oneof fields. Serializing unknown fields by ascending tag id (or naively serializing unknown fields after known fields) can change the meaning of messages.

For instance consider the following schemas known to the intermediate server and end-user, respectively:

// Intermediate server schema.
message Schema {
  oneof test_oneof {
    string s = 1;
  }
}

// End-user schema.
message Schema {
  oneof test_oneof {
    string s = 1;
    int32 i = 2;
  }
}

Now consider a non-canonical serialized message: Schema { i = 42, s = "foo" } (note: when there are duplicate values for a tag in a serialized message, the last value wins). When re-serializing this message the intermediate server will output Schema { s = "foo", i = 42 } if ordering by ascending tag or unknown fields last. An end-user would interpret the original message as Schema { s = "foo" }, and interpret the message from the intermediate server as Schema { i = 42 }.

Edit: I should note, this issue is already discussed in the proto3 language guide, but the ramifications are a little more serious if/when people implement intermediate servers which should maintain message contents.

acozzette commented 7 years ago

@danburkert That's a good point about the oneof edge case. However, I think it would be fairly rare for anyone to actually hit that edge case, because you would have to both create a non-canonical serialized message and also add or remove an item from a oneof definition, which we already warn against doing in the docs you linked to. Trying to update all implementations to preserve unknown field ordering would probably also be too time-consuming and difficult to be practical.

This issue has actually come up in the past, because we are thinking about a C++ optimization that would involve stripping out unused fields at link time by effectively treating them as unknown fields. (Actually we already do something like this for Java lite). But when we do this we need to be careful not to mess with oneofs because naively treating them like unknown fields could cause the problem you described.

liujisi commented 7 years ago

The upcoming 3.4 release will provide APIs to explicitly drop unknowns for Java. There are already APIs in C++ and Python (DiscardUnknownFields). Make sure you are using those APIs explicitly if you rely on the current behavior.

jeremyherbert commented 7 years ago

I have a kind of related issue which I am not sure fits here, but I'm asking anyway :) Using proto3 with python (protobuf version is 3.3.0).

I am intending to use the intermediary pattern as mentioned above. However, I am using the oneof keyword. If I try to deserialize and reserialize a message with an unknown oneof, it discards the unknown oneof data. For example:

message.proto

message Message1 {
  bool flag = 1;
}
message Message2 {
  bool other_flag = 1;
}
message MasterMessageV1 {
  oneof payload {
    Message1 m1 = 1;
  }
}
message MasterMessageV2 {
  oneof payload {
    Message1 m1 = 1;
    Message2 m2 = 2;
  }
}

python test code:

>>> from message_pb2 import Message1, Message2, MasterMessageV1, MasterMessageV2
>>> test_m2 = Message2(other_flag=True)
>>> master = MasterMessageV2(m2=test_m2)
>>> encoded = master.SerializeToString()
>>> print(encoded)
b'\x12\x02\x08\x01'
>>> decoded = MasterMessageV1.FromString(encoded)
>>> print(decoded.SerializeToString())
b''

Essentially, if I serialize a MasterMessageV2 message with the m2 field set, when deserializing as a MasterMessageV1 it discards payload. If I then reserialize the decoded object and then deserialize as a MasterMessageV2 the m2 data is missing.

I realise that I can just change the type of payload to be bytes and decode them separately with some sort of payload_type enum, but then that breaks a lot of the niceness that comes with protobuf. Is this a bug? If not, is there a way that everyone else is handling this?

liujisi commented 7 years ago

@jeremyherbert This is the same issue as proto3 currently doesn't preserve unknown fields. It should be addressed in the next couple releases.

Note that adding new fields into oneofs is risky. Even with unknown fields preserved, the new field will not be visible in the oneof of the old binary. Instead of seeing an unrecognized type, the old message will treat the oneof as not set. You would have to dig into the unknown fields to distinguish between an unset oneof vs an unrecognized oneof.

vozbu commented 7 years ago

@pherl, the pattern "save unknown fields and then discard it" seems excessive for me. Isn't it better just to pass a flag to parsing function telling it to save or not to save unknown fields while parsing? It will save you memory and CPU in case you don't need these fields while will retain all desired benefits. In our workflows we sometimes have most of fields in message as unknown, and I'm afraid that parsing it will degrade our performance.

Actually, I would like to have such flag in proto2 too.

liujisi commented 7 years ago

@vozbu what language are you using? We do have API to skip unknowns fields in Java. Other languages chose to have a discard unknown fields API after parsing is finished mostly to reduce the complexity in implementation.

vozbu commented 7 years ago

@pherl, I'm talking about C++. I haven't seen the implementation to judge about it. I speak my thoughts as a user.

jbolla commented 7 years ago

@pherl, the doc you shared states "3.4 release (ETA: Q3 2017): Google protobuf implementation for each language will provide APIs to explicitly drop or preserve unknowns for proto3. A temporary flag will be introduced for the default parsing behavior - default to drop unknowns."

3.4 is released. Did that actually make it in? I'm using Java and I see the flag for retaining unknowns, explicitDiscardUnknownFields in CodedInputStream, but the parsing code I see is using: final boolean shouldDiscardUnknownFieldsProto3() { return explicitDiscardUnknownFields ? true : proto3DiscardUnknownFieldsDefault; } So even if you don't set that flag you get proto3DiscardUnknownFieldsDefault, which defaults to false and appears not to have any way for external users to change.

liujisi commented 7 years ago

The plan would be only to provide APIs for explicitly drop unknowns, for those who depend on the behavior. The default is only for testing only. In 3.5 we will flip the default.

On Wed, Sep 13, 2017 at 4:32 PM jbolla notifications@github.com wrote:

@pherl https://github.com/pherl, the doc you shared states "3.4 release (ETA: Q3 2017): Google protobuf implementation for each language will provide APIs to explicitly drop or preserve unknowns for proto3. A temporary flag will be introduced for the default parsing behavior - default to drop unknowns."

3.4 is released. Did that actually make it in? I'm using Java and I see the flag for retaining unknowns, explicitDiscardUnknownFields in CodedInputStream, but the parsing code I see is using: final boolean shouldDiscardUnknownFieldsProto3() { return explicitDiscardUnknownFields ? true : proto3DiscardUnknownFieldsDefault; } So even if you don't set that flag you get proto3DiscardUnknownFieldsDefault, which defaults to false and appears not to have any way for external users to change.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/google/protobuf/issues/272#issuecomment-329325957, or mute the thread https://github.com/notifications/unsubscribe-auth/AATQyUtZy8f6n6c-aVRnPPPSXV0oKyGuks5siGYYgaJpZM4D8C3u .

liujisi commented 6 years ago

All languages will be fixed in 3.5.x releases.

leighmcculloch commented 6 years ago

@liujisi Now that direction has changed and support added for preserving field to some implementations, will this recommendation in the official proto3 documentation be changing?

Proto3 implementations can parse messages with unknown fields successfully, however, implementations may or may not support preserving those unknown fields. You should not rely on unknown fields being preserved or dropped. For most Google protocol buffers implementations, unknown fields are not accessible in proto3 via the corresponding proto runtimes, and are dropped and forgotten at deserialization time.

Ref: https://developers.google.com/protocol-buffers/docs/proto3#unknowns

acozzette commented 6 years ago

@leighmcculloch Good catch, I'll update that documentation to say that unknown fields are now preserved for proto3 messages as of version 3.5.

MalteJ commented 6 years ago

Is there a public method to detect if a deserialized message has unknown fields?

This would be useful to check a message which is coming from an untrusted source. I do not want to relay the message to other services if I am not sure it complies to my proto format. Also in my case I cannot reserialize it because the serialized messages bytes are cryptographically signed (the serializer is not deterministic across different protobuf implementations).

I'm about to replace protobuf with JWT for this :(

MalteJ commented 6 years ago

There are methods to get a list of unknown fields. But:

In Go the parameter name suggests it should not be used ("XXX_unrecognized"). And the C++ docs say:

Get the UnknownFieldSet for the message.

This contains fields which were seen when the Message was parsed but were not recognized according to the Message's definition. For proto3 protos, this method will always return an empty UnknownFieldSet.

https://developers.google.com/protocol-buffers/docs/reference/cpp/google.protobuf.message#Reflection.GetUnknownFields.details

dsnet commented 6 years ago

In Go, there is not currently a reliable way to programmatically interact with unknown fields. At best, you can use proto.DiscardUnknown to recursively discard all unknown fields. However, there is no stable API to iterate and/or modify the current set of unknown fields.

Furthermore, not all unknown fields are stored in XXX_unrecognized, unknown fields in the extension ranges are stored in proto.XXX_InternalExtensions. The current state of affairs is unfortunate, and we're working on v2 of the API, which will provide a stable way to read, modify, and write unknown fields.

kditrj2d commented 5 years ago

I'm coming to this party rather late... I've just upgraded a C# application that uses protobuffers from version 3.4.0 to 3.6.1. The application relies on unknown fields not being preserved. Now by default they ARE preserved and I've seen a significant and unacceptable increase in memory consumption. (The ratio of known to unknown fields is about 1:5.) There is mention here of APIs being available to explicitly discard the unknown fields but its not clear to me whether these were temporary and have now been removed or still exist. What is the current situation? Do these APIs still exist in the version 3.6.1 C# distribution? If so where can I find details?

Xorlev commented 5 years ago

From my understanding (though I don't work on protobufs, I've just been a part of this thread for a long time), these APIs are here to stay -- you will be able to keep or discard unknown fields depending on your use case.

https://github.com/protocolbuffers/protobuf/blob/e479410564727d8954e0704254f4345f97e3d844/csharp/src/Google.Protobuf/MessageParser.cs#L333-L340 Appears to be what you want -- applied to a MessageParser, it returns a new MessageParser which discards/doesn't discard unknown fields.

kditrj2d commented 5 years ago

Thanks for the reply. Found it, tried it, code now works again.