tafia / quick-protobuf

A rust implementation of protobuf parser
MIT License
452 stars 87 forks source link

Fields are boxed unpredictably #121

Closed jimblandy closed 6 years ago

jimblandy commented 6 years ago

The pb-rs code generator lacks a stable rule for deciding whether a field of a Rust structure representing a protobuf message should be boxed or not. Reordering the declarations in a .proto file, or upgrading the version of pb-rs, could change the generated Rust types in a way that would break the code that uses them.

There is an simple and stable rule that the generator could follow, instead of its present algorithm, that would let users easily anticipate where the language will box messages and where it will not.

In the current code, the FileDescriptor::break_cycles function traverses all the message types in the specification, looking for points where a message type might contain instances of itself, and introducing Box types to avoid defining types that are infinite in size. However, this algorithm is pretty subtle, and introduces boxes based on where it happens to detect cycles. If the algorithm were to change, the placement of boxes might change as well.

For example, on the following input:

syntax = "proto2";

message A {
  optional B f = 1;
}

message B {
  required A g = 2;
}

pb-rs generates the following struct types:

pub struct A {
    pub f: Option<B>,
}

pub struct B {
    pub g: Option<Box<A>>,
}

But changing the order of the two message declarations in the .proto file:

syntax = "proto2";

message B {
  required A g = 2;
}

message A {
  optional B f = 1;
}

changes the generated Rust types to:

pub struct B {
    pub g: A,
}

pub struct A {
    pub f: Option<Box<B>>,
}

The type of B::g changes from Option<Box<A>> to A.

However, any cycle of message inclusion in a .proto file must include at least one field that is repeated or optional. Otherwise, well-formed messages would need to be infinite in length. If pb-rs were to always box optional fields, and simply report an error whenever a cycle exists that was not broken by a repeated or optional field, then users would be better able to predict Rust types, and the code generator could be simplified.

Boxing optional fields seems desirable in general, simply because doing so lets the Rust values avoid spending memory on sub-messages that are absent. The size of a Rust Option<T> is at least that of T.

tafia commented 6 years ago

Thanks for the issue you are right. I'll try to answer to most of the points raised.

  1. optionals. Indeed, box should be done on optional fields only which MUST exist else the proto definition is wrong. I'll update it.
  2. repeated. If a message is repeated, it is currently not boxed at all (already in a Vec).
  3. lack of specification for which field will be boxed. This is an issue in general. I propose to implement these rules, if there is a cycle: a. only optional fields are boxed b. in all available optional fields, the box is put on the field with the biggest (sizeof) message. This is not mandatory but I suppose it is better this way, plus it at least give an ordering rule independent of the message order c. from all maximum size and optional fields, take the first one which appears.
jimblandy commented 6 years ago

Using the maximum size to decide where to break cycles would be an improvement, but it still seems a little obscure.

Google's protoc generates C++ that boxes every message that is contained in another message, whether optional, repeated or required. If pb-rs were to inline all required messages, and box all optional messages, that would be a step up from protoc, and still provide a simple relationship between protobuf types and Rust types.

Yes, repeated and Vec go naturally together - especially since a zero-capacity Rust Vec doesn't actually allocate any heap storage.

tafia commented 6 years ago

Fine. I am not a big fan of losing performance, in particular in proto3 files where fields are optional.

On the other hand guaranteeing a stable generated code is probably more important (questionable here too as you probably don't want to expose the internal deserialization details).

Anyway, in absence of real proof I guess following what other libraries are doing is probably wise and in general one should try to avoid cyclic messages if possible.

I have revamped (again) the break cycle algorithm, to identify all strongly connected components first then box all their optional fields.

tafia commented 6 years ago

close #119