stripe / veneur

A distributed, fault-tolerant pipeline for observability data
MIT License
1.73k stars 174 forks source link

Add support for extended clientside aggregation #1048

Closed praboud-stripe closed 1 year ago

praboud-stripe commented 1 year ago

Summary

This adds Veneur serverside support for statsd clients using extended clientside aggregation. This allows clients to pack multiple values into the same metrics packet - see the example from the docs:

Without extended clientside aggregation:

my_distribution_metric:21|d|#all,my,tags
my_distribution_metric:43.2|d|#all,my,tags
my_distribution_metric:1657|d|#all,my,tags

With extended clientside aggregation:

my_distribution_metric:21:43.2:1657|d|#all,my,tags

NB: clients still need to enable this manually; the Veneur server does not eg: negotiate a protocol version its clients to enable this, or anything fancy like that. This PR just adds support for parsing the packets produced by clients using option.

There are three main changes made by this PR:

  1. We're changing the interface of ParseMetric to take a func(*UDPMetric) callback called multiple times, instead of returning a single *UDPMetric value. Why do it this way? The other alternatives I considered were to either return a slice, or to publish the results back to a channel. Returning a slice requires allocating a slice - I'd rather avoid the extra GC churn if we can, since the caller would just immediately discard the slice after iterating over it. Publishing to a channel is tempting, but also has some problems. The caller is just publishing to a channel anyway, so it seems like this should work - but we determine which channel to publish to based on the Digest of the metric. Internalizing that routing logic into the parsing code seems like a layering violation. We could publish to an intermediate channel, but this requires an extra goroutine to shunt the metrics around. Overall, a callback is definitely a bit less ergonomic than the current pattern of directly returning a *UDPMetric, but it seems like the best option available (and we can add some test helpers to improve the ergonomics in our tests).
  2. The ParseMetric code needs to be re-ordered a bit - instead of parsing the value in essentially the order it which it arrives in the packet, now that there are multiple values, it's more convenient to parse them last, immediately before the metric is published via the callback function. This has changed the exact error message produced by some of the tests checking for invalid packets. This doesn't really matter, though, for two reasons: a) the error is just logged internally - there's no API contract around which error is returned and b) since this is a case where there's more than one thing wrong with the same packet, and it's just a question of which error we hit first, both errors are equally correct.
  3. I've also chosen to convert the ParseMetric code to call bytes.IndexByte directly, rather than using the SplitBytes helper. This isn't strictly required for the rest of this change, but it's faster. Without removing SplitBytes, there's a slight performance regression in the parsing code from adding the new multi-value logic. Net of the the SplitBytes change, this PR improves the benchmark performance of that code. I'm not that attached to this part of the PR, and can unwind it if others are opposed to this part, but I wanted to "make room" from a performance standpoint for the rest of these changes. I did do some profiling of SplitBytes itself to see if there's anything obvious we can do to improve the performance of the helper, but didn't have much luck here.

Motivation

Extended clientside aggregation avoids retransmitting the name & tag of the same metric reported multiple times; this can significantly increase the serialization performance on both the client and server, and reduces network traffic. I'd like to experiment with rolling out extended clientside aggregation to some services within Stripe, and this is the first step in doing so.

Test plan

I've added automated tests covering the new multi-value code. I've also stood up a local copy of the Veneur server & client, and verified that metrics published using extended clientside aggregation are correctly ingested by the server.

praboud-stripe commented 1 year ago

cc @bobby-stripe