Grpc and proto - Githubissues

feribg commented 5 years ago

Great project and idea.

I wanted to share some feedback and not really an issue per se. The main reason I've moved away from both Nats streaming and passed on liftbridge has been the heavy use of protobufs.

Most implementations of protobufs allocate, some heavily so, making it pretty terrible for low latency messaging. I ended up with doing something very similar (albeit much much more barebones) than liftbridge on top of pure NATS for that reason only.

I understand it' a trade off between simplicity/adoption/speed, but figured it might be some useful feedback as sooner or later even if you optimize the storage pipeline, protos will become a limiting factor or at least they did in my case.

Would be great to see a streaming messaging system supporting pluggable ser/des formats, that would really bring something great to the table. Or at least one that uses an encoding that doesn't allocate on the hot path - flatbuf, capnproto, sbe (latter is only good for numeric data, so unlikely to be useful for the general case).

👍 👍

tylertreat commented 5 years ago

Totally fair criticism. Using gRPC/protobuf was something I debated on for a while, and you're right about it being a trade-off between simplicity, adoption, and performance. Ultimately, I went with gRPC because it makes implementing client libraries significantly simpler (and thus adoption, at least in theory).

It does look like you can use Flatbuffers with gRPC, so I may take a look at that. Maybe that would be an improvement over protobuf?

I have had some thoughts about a separate zero-copy read API but admittedly have not put a ton of thought into it at this point.

tsingson commented 5 years ago

Protobuf is easier to use than flatbuffers and easier to encode.

In my personal project, I prefer flatbuffers for better decoding performance.

feribg commented 5 years ago

@tylertreat Never used flatbuf with grpc, but it could be an interesting approach, worth it the benchmark at least.

I think the second point you brough up is properly better long term, it's basically what I was eluding to in my comment. To keep in mind the zero copy use case and keep all the core architecture and data structures with that in mind, avoid strings as much as possible (or plan to offer a way out of them as opt-in for client drivers). So later when the project has enough adoption offering a lower level zero-copy API would be easier when things are not completely tied to grpc.

https://github.com/real-logic/aeron has some pretty good example of data structures used to keep messaging latency as low as possible. It is just a transport, but I think many of the ideas can be ported.

tsingson commented 5 years ago

For most message structures pass NATS, whether simple or complex, protobuf in GRPC is suitable, balanced, and easy to encode and decode. Therefore, GRPC (protobuf) is a suitable solution.

However, I think it is feasible to provide a replaceable interface that flatbuffers in GRPC are feasible when using simple data structures.

I'm already researching, fork LIFTBRIDGE to supports flatbuffers in GRPC, only useing in my project. Because the data structure in our specific project is relatively simple and rarely change, flatbuffers in GRPC should help to reduce the decoding overhead after message distribution. ( in LIFTBRIDGE client) .

tylertreat commented 5 years ago

@tsingson I will be curious to hear the results of your experimentation with flatbuffers and gRPC and if it's worth pursuing in upstream Liftbridge, what the trade-offs are, etc.

tsingson commented 5 years ago

working on.......

i will report result in my trial project.

riaan53 commented 5 years ago

Just compile the proto with https://github.com/gogo/protobuf , no code changes necessary. It is fast - https://github.com/alecthomas/go_serialization_benchmarks

llchan commented 4 years ago

I helped out with the Flattbuffers in gRPC C++ code so figured I'd chime in here. From a C++ client perspective the decoding step would look pretty similar to the protobuf version. And from what I understand, the go experience for both server/client is fairly similar: message building/reading is a touch more verbose but it's quite similar structurally.

Imho, as a low(ish)-level messaging library, we should consider every allocation, and given that protobuf doesn't have zero-copy support in most languages, I think it's worth at least an experimental branch to play around with it.

p.s. I might start a cpp-liftbridge repo, mostly as a testbed for myself but maybe it will be a thing.

edit: poked around a bit, I think I can set up a proof-of-concept on a branch, will take a look more this weekend

llchan commented 4 years ago

Quick status update: there's an in flight flatbuffers PR that adds an object API (ie allocating+copying into structs), and I made some additional enhancements on top of that to add a drop-in protobuf mode. Using this, I currently have liftbridge working with flatbuffers with no major code changes :). Will try to get this vetted first before making more invasive changes that take advantage of the lower-level direct flatbuffers API.

feribg commented 4 years ago

@llchan Amazing news! Did you get a chance to get any preliminary end to end latency numbers?

llchan commented 4 years ago

This is really quite preliminary. Haven't even spun up a linux host for testing yet, just getting it functional in WSL at the moment (this is a personal side project so not using work hardware). I'll probably need some guidance to set up realistic benchmarks. That said, I wouldn't expect miracles with this first pass, as it's doing a ton of unnecessary copies at the moment.

Not urgent, but @tylertreat whenever you have a second, I have some questions since I am very new to this codebase:

What's PacketEncoder and related? How come some things are encoded "by hand", and others are protos?
If you're not opposed to it, I may restructure some of the messages once we get further along in playing around with flatbuffers: for message propagation, one thing we can do is use a "nested flatbuffer", which allows us to do an opaque memcpy of the message into the wrapped message, saving us from having to parse/traverse the incoming request. At read time, this incurs a few minor offset-related calculations, but no extra copies are necessary to reach into the nested message.

I'll probably have more questions as I get deeper into this, so if there's a better chat medium let me know (though maybe GH is good, to keep interested parties in the loop).

annismckenzie commented 4 years ago

Just chiming in here. I'm working on a docker-compose dev setup over in #115 so if you need a working dev cluster you may want to check that out.

tylertreat commented 4 years ago

@llchan

What's PacketEncoder and related? How come some things are encoded "by hand", and others are protos?

Legacy cruft mostly. I removed the decoder code here since it actually was no longer being used. The PacketEncoder stuff now is just used for serialization to the log. I have no qualms revisiting this piece. In fact, it probably needs to be revisited but just hasn't been a priority.

If you're not opposed to it, I may restructure some of the messages once we get further along in playing around with flatbuffers

No problem with this as long as we're not breaking existing clients. For breaking changes, we should have an opt-in flag or something I think? I have not actually committed to a stable 1.0 release yet, so we're technically ok to make breaking changes, I'd just like to keep them to a minimum. That said, if breaking API changes need to be made, now's the time to do it.

llchan commented 4 years ago

This branch changes the protocol so it's already API-breaking. While it's possible to build different variants under different build tags, or structure the encoder in a runtime-swappable way, I'm not sure we actually want to support multiple formats down the road; if we decide to switch to flatbuffers I would suggest we make it a hard cut.

That said, once some of my flatbuffers changes are merged in, I think we can make this a minimal code change for existing clients by generating protobuf-like shims. Hopefully it's a smooth transition even if the on-wire format is totally different.

llchan commented 4 years ago

Actually, I take that back. The nested flatbuffers changes I was alluding to are all internal to liftbridge server, and the client API remains the same structurally, so I think it should be reasonably safe.

tylertreat commented 4 years ago

@llchan Any preliminary benchmarks indicating what the performance difference is? I'm open to a hard cut over to Flatbuffers if it makes a big enough dent performance-wise.

llchan commented 4 years ago

Okay I got a nats-server set up for testing. Preliminary test with the straight unoptimized conversion (i.e. copying everything into structs) shows it equivalent to protobuf using lift-bench (about 1% faster for the best run of each out of 5, but thats within the measurement noise). This is reasonable since it's doing roughly the same amount of work that protobuf is doing.

Next step is for me to convert some of the code to build the messages directly rather than initializing a struct and encoding it.

llchan commented 4 years ago

This is extremely unscientfic and rough right now, but just to get a ballpark for an indication that this is a worthwhile expriment:

protobuf

Elapsed: 6.028546539s, Msgs: 4096, Msgs/sec: 679.434085
Elapsed: 4.873512933s, Msgs: 4096, Msgs/sec: 840.461502
Elapsed: 5.328956409s, Msgs: 4096, Msgs/sec: 768.630795
Elapsed: 6.348228391s, Msgs: 4096, Msgs/sec: 645.219382
Elapsed: 5.007464874s, Msgs: 4096, Msgs/sec: 817.978778

flatbuffers

Elapsed: 4.625727358s, Msgs: 4096, Msgs/sec: 885.482365
Elapsed: 4.386690292s, Msgs: 4096, Msgs/sec: 933.733573
Elapsed: 4.46344128s, Msgs: 4096, Msgs/sec: 917.677582
Elapsed: 4.457075176s, Msgs: 4096, Msgs/sec: 918.988314
Elapsed: 4.407425056s, Msgs: 4096, Msgs/sec: 929.340816

These are with 512KiB values sent to liftbridge+nats both on localhost. Larger messages should show a more noticeable benefit since the main advantage of flatbuffers is to avoid copies. There's much more to be done but figured I would share some progress. Stuff will get increasingly invasive so I'll try to structure the commits so they are easy to follow.

llchan commented 4 years ago

Another update: now I'm just testing just the NewMessage + nats publish step, and ignoring all the acks and stuff (i.e. pushing a writer to its limits). protobuf at ~2.5k/s, flatbuffers at 4.0k/s, and for a reference point raw nats-bench gets somewhere around 4.1k/s with 1 publisher and 0 subscribers. This is a slightly unfair comparison because I spent no time optimizing the protobuf stuff, but given the way protobufs are I think it would be difficult to get the same level of control over encoding/allocs.

feribg commented 4 years ago

@llchan Throughput will be better, but tail latency should be way better. Flatbuf doesn't do 0 compression last time I checked right, so how do you deal with allocating buffers and avoiding resize, do you just allocate big buffers or you take a resize hit?

llchan commented 4 years ago

Yep, tail latencies should be better, because we can design for better buffer reuse to minimize GC pauses and page faults etc. What I did for my test program was I reused the buffer since we're sending in serial and the message sizes are the same. I think that's a common use case, where message sizes are roughly the same size, and in those cases the message building should not bother freeing it at all. I think down the road we could probably have a buffer pool, possibly keyed by size, that can facilitate reuse without the user managing buffers manually.

I'm not 100% sure I answered your question since it didn't really involve zero-packing, but you are correct that flatbuffers does not zero-pack. If you're asking about flatbuffers mutation, I am not currently using that, though an optimized publisher would be able to do that. For example, a metrics publisher can build the message once, and on each publish they just mutate the relevant bytes with the generated methods without needing to build out the whole message. Typically those messages have some constant part, like a string identifier, and a changing part, and there's no need to re-copy the static identifier.

In any case, the takeaway is really that flatbuffers allows us more flexibility in the encoding step so we can do stuff like this.

tylertreat commented 4 years ago

I've been exploring flatbuffers a bit for Liftbridge, but it appears gRPC support in flatbuffers is still quite limited based on language. E.g. it appears to not be supported for Python or Javascript. If this is indeed the case, it seems we should stick with protobuf, at least for a 1.0 release, to support a wider range of client libraries?

feribg commented 4 years ago

@tylertreat Have you looked at Rsocket at all, seems like it has some better semantics for building a message queue type of application, especially around flow control and maybe multicast down the road, it might allow the liftbridge broker to do UDP multicast for example. Will still require an API definition (which I guess is the main gain from gRPC our of the box) https://medium.com/netifi/differences-between-grpc-and-rsocket-e736c954e60

adsharma commented 3 years ago

@llchan this is awesome work. Thanks for sharing.

flatbuffers

Elapsed: 4.625727358s, Msgs: 4096, Msgs/sec: 885.482365
Elapsed: 4.386690292s, Msgs: 4096, Msgs/sec: 933.733573
Elapsed: 4.46344128s, Msgs: 4096, Msgs/sec: 917.677582
Elapsed: 4.457075176s, Msgs: 4096, Msgs/sec: 918.988314
Elapsed: 4.407425056s, Msgs: 4096, Msgs/sec: 929.340816

I have dabbled with flatbuffers over the years mainly for zero-copy reasons. One of my observations is that a flatbuffer encoded message is larger than a corresponding gRPC message due to the storing of the offsets. Do you have a comparison of bytes/sec on the wire?

These days there are excellent serde implementations integrated into various language ecosystems. Rust has serde, python has pyserde and I'm sure go and C++ have something nice as well. I wish we could invest in one zero-copy, sendfile network implementation and have various encoders compete on their own merits as opposed to how rich the surrounding tooling, ecosystem is.

These implementations are declarative, so you can switch encoders by changing a decorator on some message type. You can also use different encoders for different message types. Perhaps use gRPC on the control plane, but do something more optimized on the data plane.

Also, does NATS do anything with PUB/SUB such as sending bytes on the wire once (fire-and-forget) and then processing acks from each client at a higher level?

Related blog post: https://adsharma.github.io/flattools/

liftbridge-io / liftbridge

Grpc and proto #87