Closed jwhitlock closed 4 years ago
The request
metric is also used as a timer and counter.
I've gone through the current metrics code and haven't found any more dual-use metrics.
The metrics code started out with graphite as a backend, where the metric type was an enforced top level namespace. So repeating the type inside the metric name was redundant, see for example https://github.com/lukevenediger/statsd.net/blob/master/statsd.net/Documentation/guidance/metric-anti-patterns.md#anti-pattern-redundant-endings-such-as-timing-count-or-total
IIRC datadog instead added the type as an automatic postfix, so you had things like metric.name.count
, metric.name.avg
, metric.name.min
etc. where the min/avg/max postfix represented the timer. Repeating the metric type would have resulted in silly looking metric.name_count.count
.
The statsd protocol itself doesn’t mention this topic at all.
Thanks @hannosch, that helps.
We're using InfluxDB to store metrics, and Grafana to explore and display them. I looked at locate.fallback.lookup
from a few months ago, when we were using fallback lookups. The only datapoints I was able to see had the tag metric_type=counter
, and there was no data with the tag metric_type=timing
, despite the code recording both a counter and a timer under that name. I was able to see metric_type=timing
data in datamaps
, which only is being sent timer data. It appears to store it as milliseconds, as an integer.
I can't determine from the InfluxDB docs why there aren't records for both the timer and counter for locate.fallback.lookup
or the other dual-use metrics. This could be a bug in InfluxDB, or the metrics client in our code, or even a limitation of Grafana. In any case, it would be easy to avoid by sending only one metric type per name.
I see two ways to proceed. One way is to add .timing
to timer metric names:
locate.fallback.lookup_count
back to locate.fallback.lookup
locate.fallback.lookup_timing
to locate.fallback.lookup.timing
data.export.upload
timer to data.export.upload.timing
request
timer to request.timing
task
and datamap
timers can keep their names, since there's no counter at the same name.A second way is to stop sending counters, and just send timers. Grafana could aggregate timer counts if a count is needed. This would require deeper changes to the code, and there may continue to be issues writing timer data to names that were traditionally counter names.
Fixed in PR #962
There are some metrics that are being used as both counters and timers. This is confusing, and possibly invalid.
locate.fallback.lookup
was split intolocate.fallback.lookup_timing
andlocate.fallback.lookup_count
in PR #944.data.export.upload
is also used as both a timer and a counter.I'll add more as I find them.