richardstartin / richardstartin.github.io

12 stars 4 forks source link

posts/dont-use-protobuf-for-telemetry #22

Open utterances-bot opened 3 years ago

utterances-bot commented 3 years ago

Don’t use Protobuf for Telemetry | Richard Startin’s Blog

Protobuf needs no introduction, but this post argues that you shouldn’t use it for telemetry. The basic premise of this post is that a good telemetry library needs to be lightweight to avoid perturbing the application; inefficient diagnostic tools are self-defeating. Unlike other formats, nested Protobuf messages cannot be written contiguously into a stream without significant buffering. The post doesn’t argue to never use Protobuf, but that the trade-off made by the wire-format itself, as opposed to any existing implementation, are unlikely to work for lightweight message senders.

https://richardstartin.github.io/posts/dont-use-protobuf-for-telemetry

chauhraj commented 3 years ago

Very informative article and thanks for intrinsics insight...

danielweis commented 3 years ago

Hi! Your introduction is a very nice write up :)

Re: efficient varint encoding

I actually prototyped the clz-oriented varint size estimation back in December of 2015 (I still have the pending change in my workspace!). Unfortunately, it performed worse for small varints and better for larger ones in our microbenchmarks. Given that the distribution of varints in the wild is overwhelmingly small, we decided to stick with what we had.

Re: heaviness

We've expended little effort on optimizing code size for protobuf-java so there's probably a lot of low hanging fruit. We have put a lot of effort into optimizing protobuf-javalite for lower footprint, though with less of a focus on the runtime and more of a focus on the average size of generated messages. Generally, it is larger applications that care about footprint, and larger applications have a lot of generated messages but only one runtime, so this tradeoff ends up being a good deal.

opinali commented 3 years ago

I think varint encoding doesn't force you to use the shortest possible encoding, so if you want you can encode the number 3 in two bytes, it will just have all 7 useful bits of the second byte as zeros. This means you can have fixed-size length fields as long as you choose a hardwired max size, which also happens in binary formats that need length fields without varint; with the extra cost of 1/8 bits. And you can make that hardwired size different for the length of different things in the same message.

brentru commented 3 years ago

Great analysis and write up. You mentioned msgpack, what else would you suggest for low-latency use cases?

kevincox commented 3 years ago

I think the more fundamental concern is that you shouldn't use protobuf for large messages. For encoding as you said you can't stream it and for decoding there is only minimal laziness available. (You can do streaming event based decoding though). This is well documented as well. If your messages are larger than a few megabytes or over 1k fields (recursive) you should probably consider something else. (Quite possibly a stream of smaller messages)

richardstartin commented 3 years ago

@brentru I have a preference for formats which allow you to write monotonically to the output. The benefit in doing this is that you can choose arbitrarily small buffers to write to, because there is no risk of needing to go backwards to fix something like a length up and find it has already been flushed. Formats like Protobuf and BSON with length prefixes don’t satisfy this property. Msgpack and, its derivative, CBOR do, because they only require element length prefixes, the downside is that they are self describing so end up being larger on the wire than Protobuf. It’s a big trade off, but if your data is mostly textual, JSON is perfect for streaming because you just need to maintain a stack indicating at which levels arrays or objects were started. Parsing JSON is traditionally slow, and, like other self describing formats, it’s a lot larger on the wire, but it’s dirt cheap to write, without needing any look ahead into an incoming stream of tokens, which means the serialised model never needs to be materialised, or any look back into the output, allowing freedom in buffer sizing. That said, needing to convert any numbers to text will wipe out benefits from JSON’s structural symmetry pretty quickly.

tkowalcz commented 3 years ago

Hi, thanks for interesting article. It helped me understand protobuf before implementing my own no-allocation client for a specific service.

I have a question about your way of writing varints. I assume you benchmakred it? How did it fare? I tried four algorithms for writing varints and the length+loop version was the slowest one. Code - if anybody cares to check.

richardstartin commented 3 years ago

@tkowalcz Yes, I benchmarked the encoding and found a 20% improvement for longer varints, with a very small penalty for one byte varints. This is similar to experiences reported by the developers of Kotlin’s protobuf library where this approach was implemented.

opinali commented 3 years ago

@brentru I'm curious about this point of "going back". The generated code for Protobuf keeps track of serialized size for all objects, which you can get with methods like ByteSizeLong() (C++) or getSerializedSize() (Java). This is one of the solutions you mention in your article; this size, once computed, is cached in the proto objects, and it includes the effect of varint encodings as well as any other variable-length fields such as strings. I'm not intimate with the implementation but I guess the serializer uses the cached fields so it can append to the output without never having to go back to patch size fields. Of course that means the size fields take some memory, and you might need a first pass through all the data to compute the sizes if they aren't populated yet, but in practice I believe both should be minor concerns outside of worst-case microbenchmarks.

You should also consider the full systemic effect of a design. Size prefixes create some extra work for writers, but they pay off for readers. Code that deserializes the stream knows the length of upcoming fields and objects, which makes it possible to optimize memory allocation in various ways (this is of course only relevant for languages that give control for that stuff like C++). Most data is read more times than it's written, even for one-off use in places like RPC if you have middleware that needs to sniff the data for security, monitoring, etc. Size prefixes also optimize scanning a subset of the data since you can quickly skip most uninteresting fields (and this is even more effective in deeply-nested protos; skipping one big message can move you kilobytes ahead). This makes serialized protos a good choice for persistent data formats that support fine-grained queries. BTW, most tradeoffs of proto are great for persistence: varints save significant space, the streams overall are close to optimally compact without any compression, and any noticeable savings in I/O (even from SSDs) are orders of magnitude more important than any other factor in 99% of systems.

Completely agree on the heavy weight of Java protobuf though, but it's hard to blame only proto for that. Hopefully we can someday have a new implementation that's optimized to use Java's new value types / records, with much more compact representation and sequential layout. Even low-level memory optimizations like arena allocation / placement-new may be possible as automatic JIT optimizations (at least I can see how this might be implemented, but no plans yet from the JDK team AFAIK).

richardstartin commented 3 years ago

@opinali you've missed the point of the article; that in the case of telemetry, the writing is done at the edge, likely on someone else's infrastructure, on their infrastructure budget, without any control over resources. Protobuf's reader/writer balance only makes sense when you control both sides so can trade one off against the other. The article's title wasn't Don't use Protobuf; it makes sense in other contexts.

opinali commented 3 years ago

@richardstartin good point, I had read the article 3mo ago when I wrote my first reply here, forgot this context. I was also very skeptical of the notion that these traits of proto can be as significant as introducing measurable costs or perturb the monitored app, but perhaps due to my bias as a Googler (for one thing, most of our telemetry is produced by C++ code, even for Java apps; also it's a closed system so for any possible performance tradeoff we get both the producer and consumer consequences, we can optimize for the net total).

franz1981 commented 3 years ago

I haven't verified how the counted loop code how/if unrolling is performed and if the data dep is preventing it to happen or hurt perf, maybe manually loop unrolling for different factors precomputing value batches would be interesting :)

tkowalcz commented 3 years ago

@franz1981 I removed the loop altogether and used Richards Long.numberOfLeadingZeros(value) method to advance "write pointer". In JMH tests that approach was the winner, even more so if I know the largest value and limit number of bytes to use.

Example here.

richardstartin commented 3 years ago

@franz1981 I've been called out on suspect claims about performance again here which triggered me somewhat and I spent some time digging in to it, because it just doesn't tally with my experience from writing the custom protobuf serialisation for DDSketch mentioned in this post. The best I could come up with was this, which is faster than my implementation or the ones in the other blog post after converting to handle longs:

  private static final int[] VAR_INT_LENGTHS = new int[65];
  static {
    for (int i = 0; i <= 64; ++i) {
      VAR_INT_LENGTHS[i] = ((63 - i) / 7);
    }
  }

  @Override
  public void write(long value) {
    int length = VAR_INT_LENGTHS[Long.numberOfLeadingZeros(value)];
    buffer[length] = (byte)(value >>> (length * 7));
    switch (length - 1) {
      case 8:
        buffer[8] = (byte)((value >>> 56) | 0x80);
      case 7:
        buffer[7] = (byte)((value >>> 49) | 0x80);
      case 6:
        buffer[6] = (byte)((value >>> 42) | 0x80);
      case 5:
        buffer[5] = (byte)((value >>> 35) | 0x80);
      case 4:
        buffer[4] = (byte)((value >>> 28) | 0x80);
      case 3:
        buffer[3] = (byte)((value >>> 21) | 0x80);
      case 2:
        buffer[2] = (byte)((value >>> 14) | 0x80);
      case 1:
        buffer[1] = (byte)((value >>> 7) | 0x80);
      case 0:
        buffer[0] = (byte)(value | 0x80);
    }
  }

@tkowalcz I haven't got round to running your benchmarks, but if you write a very sarcastic blog post you may trigger me into doing so 🤣

tkowalcz commented 3 years ago

If I ever write a blog I'm sure as hell not enabling a comment section 😂