Closed achille-roussel closed 7 years ago
First off, the idea of using structs with struct tags to define metrics just makes sense to me.
I don't think we should lock ourselves to the current API for a change this big (granted I haven't looked at the code in depth yet, so can't say for sure).
For batching, are we doing synchronous writes to the stats package? Or does it write to a buffered channel and return immediately? If we're not doing async writes already, that would drastically help in the short term without an API change. We could then batch inside the stats package itself (unless I'm misunderstanding some bit of the current implementation).
I do have worries about using a struct for metrics though such as:
Otherwise, it looks like the logical next step for metrics in golang. Let's do it!
0 values (obviously, can possibly be solved with tags)
This is a good point, something like the omitempty
of encoding/json
seems like it would be the easiest way to address.
Setting the same metric twice
Do you mean tracking multiple values of a histograms within a single struct for example? There are different ways we can support this:
header.bytes
and body.byte
).Dynamic tags (sometimes you just can't avoid them)
Yes, here are a couple of ideas I have:
stats.Engine.WithTags
)[]Tag
are used to pass dynamic list of tagsDo you mean tracking to values of a histogram within a single struct for example?
Effectively, yeah. How do you measure the same value (metric name + tags all equal) twice before submitting it to the client library? Slices seem plausible, but perhaps cumbersome.
Similarly for the dynamic tags problem, we might have an issues submitting metrics using this pattern.
func (mc *MetricsClient) Submit(interface{}, tags...) {
// which fields to apply the tags to?
}
On your suggestion to use a channel to do async writes. The stats package used to be designed this way but publishing to a channel is actually more expensive than serializing a metric (~1-10us to publish to a channel vs ~300ns for serialization), so we moved away from it and to the current design.
Effectively, yeah. How do you measure the same value (metric name + tags all equal) twice before submitting it to the client library? Slices seem plausible, but perhaps cumbersome.
The API has to accept a interface{}
to support all metric struct types, so it's pretty simple to accept an array or a slice of structs. Embedding the array/slice within the metric struct is also possible.
Similarly for the dynamic tags problem, we might have an issues submitting metrics using this pattern.
func (mc *MetricsClient) Submit(interface{}, tags...) { // which fields to apply the tags to? }
In this case the tags would apply to everything, you make a point that we'll probably need finer grain control so embedding tag slices in the struct seems like the way to go, at the expense of having to sort the tags which would be costly, but should be fine if it doesn't happen on every metric (in my experience you know the tag names you want to set most of the time).
Adding to the list of issues that I noticed:
strconv.AppendFloat
. Having an API that both integer and floating point types will help address this issue.I like the idea of a subpackage, so we can experiment with this first before making the shift.
Alright here are some numbers about the experiment embedding batches of metrics in struct types:
BenchmarkEngine/discard/Engine.ReportAt(struct) 10000000 122 ns/op
BenchmarkEngine/discard/Engine.ReportAt(struct:large) 10000000 217 ns/op
BenchmarkEngine/datadog/Engine.ReportAt(struct) 10000000 226 ns/op
BenchmarkEngine/datadog/Engine.ReportAt(struct:large) 2000000 599 ns/op
GOMAXPROCS=10
BenchmarkEngine/discard/Engine.ReportAt(struct)-10 20000000 66.0 ns/op
BenchmarkEngine/discard/Engine.ReportAt(struct:large)-10 20000000 113 ns/op
BenchmarkEngine/datadog/Engine.ReportAt(struct)-10 10000000 144 ns/op
BenchmarkEngine/datadog/Engine.ReportAt(struct:large)-10 5000000 313 ns/op
The _struct_ benchmark reports a struct with one metric, the _struct:large_ reports a struct with 10 fields. The cost of handling extra metrics increases by ~10ns per metric, the cost of serializing increases by ~20ns per metric, so total we can account for ~30ns of CPU time spent for each metric getting serialized.
This is pretty satisfying to my, when we look at the numbers with the current stats package:
- GOMAXPROCS=1
BenchmarkEngine/discard/Engine.Add.1x 20000000 64.3 ns/op 0 B/op 0 allocs/op BenchmarkEngine/discard/Engine.Add.10x 2000000 640 ns/op 0 B/op 0 allocs/op BenchmarkEngine/datadog/Engine.Add.1x 10000000 195 ns/op 0 B/op 0 allocs/op BenchmarkEngine/datadog/Engine.Add.10x 1000000 1888 ns/op 0 B/op 0 allocs/op
- GOMAXPROCS=10
BenchmarkEngine/discard/Engine.Add.1x-10 20000000 68.5 ns/op 0 B/op 0 allocs/op BenchmarkEngine/discard/Engine.Add.10x-10 2000000 682 ns/op 0 B/op 0 allocs/op BenchmarkEngine/datadog/Engine.Add.1x-10 10000000 200 ns/op 0 B/op 0 allocs/op BenchmarkEngine/datadog/Engine.Add.10x-10 1000000 1843 ns/op 0 B/op 0 allocs/op
We can clearly see that the cost increases linearly here.
I'm pretty excited about getting this code into production, I'd like your feedback on how to proceed, here are a couple of options I thought about:
1. Publish it as a different repository. The upside is there's no problem with backward compatibility, but I'm worried about splitting the effort and the increased maintenance cost of having two repositories to do pretty much the same thing.
2. Break the stats package API to support the new design, honestly there isn't much that needs to change for the commonly used types and methods, it would be mostly the various handler types and a cleanup of some abstractions that don't make sense anymore.
3. Extend the current stats package with the new APIs, in this case I'm mostly worried about making the package really messy, hard to use, document, and test (there would be 3+ ways of doing one thing).
I tend to like option (2) better for a few reasons:
- `stats` is a short package name which communicates well the purpose of the code
- it keeps the API to its minimal form (one design, one repo)
- with proper guidelines on how to do the migration it'll benefit all services we have that depend on it
Please let me know what you think!
We use govendor in most places, so i don't mind you going with 2 :)
Be nice to see a comparison of the new proposed API, vs the old one!
Here are more details about what needs to be changed:
The stats.Handler interface needs to be changed to support the batched API, which means adapting the datadog/influxdb/prometheus clients as well (replacing/removing things that don't make sense anymore). I don't expect this to be a problem because all the code I know about is using the stats package and doesn't directly deal with the clients for each system. The new handler interface looks like this:
type Handler interface {
HandleMeasures(time time.Time, measures ...Measure)
}
Getting rid of the stats.Counter, stats.Gauge, stats.Histogram, stats.Timer, stats.Clock types. I introduced hoping I could leverage the fact that caching some info in those structs would enable some performance improvements, but they were small or inexistent in the end so I'd rather cleanup the API to keep the focus on the preferred way of reporting metrics.
I plan on keeping the Incr/Add/Set/Observe/ObserveDuration functions, all the code I'm aware of depends on those. They are less efficient than going through the batch API but they are very convenient and there are cases where there's no need for performance. Adoption of the new version would be pretty painful if we were removing those.
The Engine will get new methods
func (eng *Engine) Report(metrics interface{}, tags ...Tag) { ... }
func (eng *Engine) ReportAt(time time.Time, metrics interface{}, tags ...Tag) { ... }
The metrics argument can be a struct carrying a batch of metrics, or a slice or array of those.
- The sub-packages providing convenience wrappers for collecting metrics of in various situations (httpstats/netstats/procstats) should be left unchanged as well. The implementations will be revisited to support the new batching API tho, that's kind of the point of all of this ;)
I'm open to other suggestions, now is a good time to think of any other improvements we want to make.
I'm finding it harder and harder to have metric collection not be the bottleneck in some high-performance applications that I'm working on. I want to open the discussion on a new design to make metric collection 10 to 100x faster while keeping high-level abstractions to easily instrument applications.
Here are the most common issues:
Here are a couple of ideas I have on what we can do:
Batches of metrics can be efficiently represented by structs types like:
There are a couple of reasons why I think this approach can be more powerful:
encoding/json
andgithub.com/segmentio/objconv
for example). This would make it possible to generate highly efficient ways to render the sorted list of tags. I also means that we can do more optimizations to like generating efficient hash tables to store and aggregate the metrics before flushing them to the network.Now this is a pretty big shift from the way we've been doing instrumentation, so I have a couple of questions:
Let me know!