segmentio / stats

Go package for abstracting stats collection
https://godoc.org/github.com/segmentio/stats
MIT License
208 stars 32 forks source link

Proposal for a new design #51

Closed achille-roussel closed 7 years ago

achille-roussel commented 7 years ago

I'm finding it harder and harder to have metric collection not be the bottleneck in some high-performance applications that I'm working on. I want to open the discussion on a new design to make metric collection 10 to 100x faster while keeping high-level abstractions to easily instrument applications.

Here are the most common issues:

Here are a couple of ideas I have on what we can do:

Now this is a pretty big shift from the way we've been doing instrumentation, so I have a couple of questions:

Let me know!

abraithwaite commented 7 years ago

First off, the idea of using structs with struct tags to define metrics just makes sense to me.

I don't think we should lock ourselves to the current API for a change this big (granted I haven't looked at the code in depth yet, so can't say for sure).

For batching, are we doing synchronous writes to the stats package? Or does it write to a buffered channel and return immediately? If we're not doing async writes already, that would drastically help in the short term without an API change. We could then batch inside the stats package itself (unless I'm misunderstanding some bit of the current implementation).

I do have worries about using a struct for metrics though such as:

Otherwise, it looks like the logical next step for metrics in golang. Let's do it!

achille-roussel commented 7 years ago

0 values (obviously, can possibly be solved with tags)

This is a good point, something like the omitempty of encoding/json seems like it would be the easiest way to address.

Setting the same metric twice

Do you mean tracking multiple values of a histograms within a single struct for example? There are different ways we can support this:

Dynamic tags (sometimes you just can't avoid them)

Yes, here are a couple of ideas I have:

abraithwaite commented 7 years ago

Do you mean tracking to values of a histogram within a single struct for example?

Effectively, yeah. How do you measure the same value (metric name + tags all equal) twice before submitting it to the client library? Slices seem plausible, but perhaps cumbersome.

Similarly for the dynamic tags problem, we might have an issues submitting metrics using this pattern.

func (mc *MetricsClient) Submit(interface{}, tags...) {
    // which fields to apply the tags to?
}
achille-roussel commented 7 years ago

On your suggestion to use a channel to do async writes. The stats package used to be designed this way but publishing to a channel is actually more expensive than serializing a metric (~1-10us to publish to a channel vs ~300ns for serialization), so we moved away from it and to the current design.

achille-roussel commented 7 years ago

Effectively, yeah. How do you measure the same value (metric name + tags all equal) twice before submitting it to the client library? Slices seem plausible, but perhaps cumbersome.

The API has to accept a interface{} to support all metric struct types, so it's pretty simple to accept an array or a slice of structs. Embedding the array/slice within the metric struct is also possible.

Similarly for the dynamic tags problem, we might have an issues submitting metrics using this pattern.

func (mc *MetricsClient) Submit(interface{}, tags...) {
// which fields to apply the tags to?
}

In this case the tags would apply to everything, you make a point that we'll probably need finer grain control so embedding tag slices in the struct seems like the way to go, at the expense of having to sort the tags which would be costly, but should be fine if it doesn't happen on every metric (in my experience you know the tag names you want to set most of the time).

achille-roussel commented 7 years ago

Adding to the list of issues that I noticed:

yields commented 7 years ago

I like the idea of a subpackage, so we can experiment with this first before making the shift.

achille-roussel commented 7 years ago

Alright here are some numbers about the experiment embedding batches of metrics in struct types:

The _struct_ benchmark reports a struct with one metric, the _struct:large_ reports a struct with 10 fields. The cost of handling extra metrics increases by ~10ns per metric, the cost of serializing increases by ~20ns per metric, so total we can account for ~30ns of CPU time spent for each metric getting serialized.
This is pretty satisfying to my, when we look at the numbers with the current stats package:
- GOMAXPROCS=1

BenchmarkEngine/discard/Engine.Add.1x 20000000 64.3 ns/op 0 B/op 0 allocs/op BenchmarkEngine/discard/Engine.Add.10x 2000000 640 ns/op 0 B/op 0 allocs/op BenchmarkEngine/datadog/Engine.Add.1x 10000000 195 ns/op 0 B/op 0 allocs/op BenchmarkEngine/datadog/Engine.Add.10x 1000000 1888 ns/op 0 B/op 0 allocs/op

- GOMAXPROCS=10

BenchmarkEngine/discard/Engine.Add.1x-10 20000000 68.5 ns/op 0 B/op 0 allocs/op BenchmarkEngine/discard/Engine.Add.10x-10 2000000 682 ns/op 0 B/op 0 allocs/op BenchmarkEngine/datadog/Engine.Add.1x-10 10000000 200 ns/op 0 B/op 0 allocs/op BenchmarkEngine/datadog/Engine.Add.10x-10 1000000 1843 ns/op 0 B/op 0 allocs/op


We can clearly see that the cost increases linearly here.

I'm pretty excited about getting this code into production, I'd like your feedback on how to proceed, here are a couple of options I thought about:
1. Publish it as a different repository. The upside is there's no problem with backward compatibility, but I'm worried about splitting the effort and the increased maintenance cost of having two repositories to do pretty much the same thing.
2. Break the stats package API to support the new design, honestly there isn't much that needs to change for the commonly used types and methods, it would be mostly the various handler types and a cleanup of some abstractions that don't make sense anymore.
3. Extend the current stats package with the new APIs, in this case I'm mostly worried about making the package really messy, hard to use, document, and test (there would be 3+ ways of doing one thing).

I tend to like option (2) better for a few reasons:
- `stats` is a short package name which communicates well the purpose of the code
- it keeps the API to its minimal form (one design, one repo)
- with proper guidelines on how to do the migration it'll benefit all services we have that depend on it

Please let me know what you think!
yields commented 7 years ago

We use govendor in most places, so i don't mind you going with 2 :)

Be nice to see a comparison of the new proposed API, vs the old one!

achille-roussel commented 7 years ago

Here are more details about what needs to be changed:

func (eng *Engine) ReportAt(time time.Time, metrics interface{}, tags ...Tag) { ... }


The metrics argument can be a struct carrying a batch of metrics, or a slice or array of those.

- The sub-packages providing convenience wrappers for collecting metrics of in various situations (httpstats/netstats/procstats) should be left unchanged as well. The implementations will be revisited to support the new batching API tho, that's kind of the point of all of this ;)

I'm open to other suggestions, now is a good time to think of any other improvements we want to make.