support zero-copy string deserialization via ByteString.

VladimirBramstedt commented 2 years ago

Hiya. there have been a few PRs which allowed prost to use the bytes::Bytes type for the protobuf bytes type, allowing them to just be slices of a bigger Bytes buffer. Strings however dont quite work. https://docs.rs/bytestring/latest/bytestring/ is just a wrapper around bytes::Bytes with added UTF-8 validity checks. it feels like it would be a good addition (as a features, same as enabling using bytes::Bytes instead of Vec<u8> for byte types. we should be able to reuse all the code from that feature. hopefully we can go from something like

message Test {
  optional string s = 1;
  repeated  string rs = 2;
}

to

struct Test {
     #[prost(string = "bytestring", tag = "1")]
     pub s: Option<ByteString>,
    #[prost(string = "bytestring", tag = "2")]
    pub rs: Vec<ByteString>,
}

to being able zero-copy this when deserializing from Bytes.

Would that be something we are interested in?

edit: did a little benching with proto that contains a lot of strings. when decoding from bytes::Bytes, much faster:

decode 1000000 items stdstring:6181ms bytestring: 2922ms

when decoding from `&[u8], somewhat slower: decode 1000000 items stdstring:6349ms bytestring: 6974ms

because in the &[u8] case we need to copy anyway, and then i guess all the extra indirection of using Bytes shows.

seems to me like quite a big performance boost for string-heavy scenarios.

LucioFranco commented 2 years ago

I don't think adding arbitrary support for things like bytestring make sense, bytes yes since its in the tokio ecosystem and its used very heavily. That said, I think we need better support for being able to swap out generated types and that means updating the code gen in such a way that allows a user to add support like this without needing to make a change to prost. Though I have not figured out that design yet and its likely a substantial amount of work.

VladimirBramstedt commented 2 years ago

i agree with you in general, regarding swapping out generated types However. ByteString is just a newtype around bytes::Bytes. we can roll our own type and export if, if you dont want to depend on other repos. strings are very heavily used, and it feels a shame to give up this much performance. its not exactly niche use case. do you have other suggestions that we can do, in light of just how much this leaves on the table performance wise? i didnt quite expect these numbers, but it makes sense, essentially turning deserializaion of strings into a utf8 validity check and a few ref counts...

LucioFranco commented 2 years ago

I totally get that, in the short term if you have control over your proto files you can just swap it for Bytes instead of a string then convert the bytes to a string type. I do agree we are leaving perf on the table here but I also don't think adding new types right now is the right way when investing into a proper way to let users select what ever type they want would be much more effective and cover many more cases.

slinkydeveloper commented 2 years ago

Hi, just stumbled on this and I think it makes sense to include such a feature. As @VladimirBramstedt pointed out, this is definitely not a niche use case and providing a reasonable default implementation for no copy strings looks like an important feature. I have no specific opinions about bytestring, but it looks active from the numbers i see on crates.io :shrug:

In my use case it would be a bit problematic to swap the type definition in proto files, as I share my proto with other teams which develop the "client side" of the project in various languages, so letting people manually deal with string/bytes conversion might be problematic and error prone. I can still go and manually replace the string with bytes in the protos I import on the server side, but it's not a great solution.

I would make it optional though, as done with bytes, where at proto generation time one can pick whether to use std String or ByteString, and then you must include a feature like bytestring when declaring the prost and prost-derive dependency. I can help with this feature if you're interested.

LucioFranco commented 2 years ago

Right I am not disagreeing with wanting to add this but I don't want to add another ad-hoc type to the config. I want to solve this generically and spend effort/review/design time on that instead. Even though the initial cost is more than just adding bytestring support, it will enable many other users to do what they want and reduce complexity at the same time.

VladimirBramstedt commented 2 years ago

Right I am not disagreeing with wanting to add this but I don't want to add another ad-hoc type to the config. I want to solve this generically and spend effort/review/design time on that instead. Even though the initial cost is more than just adding bytestring support, it will enable many other users to do what they want and reduce complexity at the same time.

i dont disagree either. i think your approach is the correct long term solution, i just wonder how long. for me this specifically affects a production workload, and these kind of saving translate to money. if long term is years, i guess i have no choice but to maintain a fork.

Now, as for the long-term solution, maybe this isnt the correct ticket to discuss this, but my thought was to actually use the capabilities build into protobufs to do type mapping. protobufs support custom options, see https://developers.google.com/protocol-buffers/docs/proto#customoptions

message Test {
  optional string s = 1 [(prost.gen_types) = "bytestring::ByteString"]; // use bytestring
  repeated  string rs = 2; [(prost.gen_types) { type = "std::collections::HashSet<String>", decode: "::my::custom::decoder::func"]; //or maybe something even more free
 optional OtherMessage foo = 3 [(prost.wrap) = "Box"];  // maybe even use it to allow wrappers here.
}

where normal rust types can be implemented in prost, and "custom types" can be decoded using either their Message impl if possible (for your own type custom types), or a user provided custom function to allow users to decode to types that are not defined in their own crate (similar to how serde does with the #[serde(deserialize_with = "func")] attribute. custom func may be something like with the same signature as prosts merge fn, ie fn<B:Buf>merge(&mut self, buf: &mut B, /* wire_type, etc*/) -> Result<Whatevertype,DecodeError>

bonus point is that since you can use custom options on mesage or file level too, it can make the information that is currently split between the build.rs file (for what prost will do) and the actual protobuf, reside in the same file for a better overview.

thats my quick take, and if you think thats a good idea id love to drive this forward.

same idea used for validation, for an example https://github.com/bufbuild/protoc-gen-validate/blob/main/README.md

LucioFranco commented 2 years ago

i dont disagree either. i think your approach is the correct long term solution, i just wonder how long. for me this specifically affects a production workload, and these kind of saving translate to money. if long term is years, i guess i have no choice but to maintain a fork.

If maintaining a fork is what is needed then that is fair, I think if it saves you money then it makes sense to contribute upstream to help the project out.

I think going the extensions/custom options method is probably the best way, but that will be a while since the extensions support is still in early design phase.

I think the quickest way to have generic support is to abstract how we have bytes/btree support by having a map that maps path -> string type name. Then the user can implement Message for custom types and use them in their generated code by just changing the build config. I think this solution would take less than a few weeks to get shipped as I don't think its actually that complicated and we can then remove bytes/btree support specifically.

VladimirBramstedt commented 2 years ago

I think the quickest way to have generic support is to abstract how we have bytes/btree support by having a map that maps path -> string type name. Then the user can implement Message for custom types and use them in their generated code by just changing the build config. I think this solution would take less than a few weeks to get shipped as I don't think its actually that complicated and we can then remove bytes/btree support specifically.

so instead of having like config.bytes(&[".foo.Bar"]) like we have now, we'd remove the specific method in favour of like

let typemap = BtreeMap::from([(".package.msg.field", "std::string::String"), 
    (".package.msg.field2","bytestring::BytesString")]);
config.type_map(typemap)

i think thats doable, as a medium term solution.

LucioFranco commented 2 years ago

I don't think we even need BtreeMap? I think we could just have a config option that is just config.add_type_override(".package.msg.field", "foo:bar::Baz"); and its up to the user to ensure that type implements Message. I think that would support every use case? We could also provide helper methods for btree/bytes out of the box but I am not totally convinced we need to do that.

LucioFranco commented 2 years ago

As for the types that implement Message in the prost crate those are the ones we use which are bytes and things in std. Anything outside of that will need to implement it themselves or have a new type wrapper. This is due to not wanting prost to depend on many external projects that may have different versioning schemes.

VladimirBramstedt commented 2 years ago

thats a start. i really wish we could provide an interface for providing a free function for decoding types rather than impl message, for other types. my usecase was bytestrings, there was talk in some other github issue for the Heapless types (which are de-facto standard in no_std and no alloc environments, and imho can be treated the same as Bytes for that matter. the choice to include Bytes "because its oftenly used" but not Heapless is arbitary and desktop-centric). anyway, newtyping everything makes your own wrappers not easily compatible with other libs easily.

i could play with the type override approach and see how far i get, if you wish.

tokio-rs / prost

support zero-copy string deserialization via ByteString. #752