mozilla / ichnaea

Mozilla Ichnaea
http://location.services.mozilla.com
Apache License 2.0
570 stars 139 forks source link

Split metrics being used as counters and timers #960

Closed jwhitlock closed 4 years ago

jwhitlock commented 4 years ago

There are some metrics that are being used as both counters and timers. This is confusing, and possibly invalid.

locate.fallback.lookup was split into locate.fallback.lookup_timing and locate.fallback.lookup_count in PR #944.

data.export.upload is also used as both a timer and a counter.

I'll add more as I find them.

jwhitlock commented 4 years ago

The request metric is also used as a timer and counter.

I've gone through the current metrics code and haven't found any more dual-use metrics.

hannosch commented 4 years ago

The metrics code started out with graphite as a backend, where the metric type was an enforced top level namespace. So repeating the type inside the metric name was redundant, see for example https://github.com/lukevenediger/statsd.net/blob/master/statsd.net/Documentation/guidance/metric-anti-patterns.md#anti-pattern-redundant-endings-such-as-timing-count-or-total

IIRC datadog instead added the type as an automatic postfix, so you had things like metric.name.count, metric.name.avg, metric.name.min etc. where the min/avg/max postfix represented the timer. Repeating the metric type would have resulted in silly looking metric.name_count.count.

The statsd protocol itself doesn’t mention this topic at all.

jwhitlock commented 4 years ago

Thanks @hannosch, that helps.

We're using InfluxDB to store metrics, and Grafana to explore and display them. I looked at locate.fallback.lookup from a few months ago, when we were using fallback lookups. The only datapoints I was able to see had the tag metric_type=counter, and there was no data with the tag metric_type=timing, despite the code recording both a counter and a timer under that name. I was able to see metric_type=timing data in datamaps, which only is being sent timer data. It appears to store it as milliseconds, as an integer.

I can't determine from the InfluxDB docs why there aren't records for both the timer and counter for locate.fallback.lookup or the other dual-use metrics. This could be a bug in InfluxDB, or the metrics client in our code, or even a limitation of Grafana. In any case, it would be easy to avoid by sending only one metric type per name.

I see two ways to proceed. One way is to add .timing to timer metric names:

A second way is to stop sending counters, and just send timers. Grafana could aggregate timer counts if a count is needed. This would require deeper changes to the code, and there may continue to be issues writing timer data to names that were traditionally counter names.

jwhitlock commented 4 years ago

Fixed in PR #962