micrometer-metrics / micrometer

An application observability facade for the most popular observability tools. Think SLF4J, but for observability.
https://micrometer.io
Apache License 2.0
4.49k stars 993 forks source link

Consistently use baseTimeUnit in StatsD #5687

Open coreyoconnor opened 1 week ago

coreyoconnor commented 1 week ago

Draft: Opening for discussion with (incomplete) proposed changes.

There is a baseTimeUnit for Timers that appears to be inconsistently used. Given the definition "The base time unit of the timer to which all published metrics will be scaled" I'd expect that a value, eg 2000 sec, reported as the value 2000 with the unit seconds would be emitted to statsd as the value in the base time units, 2000 in this example. However, the actual value is 2,000,000. Which is the corresponding millisecond value for the seconds expected to be recorded.

As that explainer above is a challenge to read: I've included some (untested or even compiled) changes to illustrate.

pivotal-cla commented 1 week ago

@coreyoconnor Please sign the Contributor License Agreement!

Click here to manually synchronize the status of this Pull Request.

See the FAQ for frequently asked questions.

pivotal-cla commented 1 week ago

@coreyoconnor Thank you for signing the Contributor License Agreement!

jonatan-ivanov commented 1 week ago

I think there might be a misunderstanding since Timers does not really have a time unit from user's perspective. (They have a unit under the hood: always nanoseconds but that is not available for the users, they don't need to know about this internal detail).

The expectation is that a recorded value is converted to the unit of the registry before publication. When you record a value using a timer and a unit, the unit you used for the recording has nothing to do with the unit used for publishing. (Btw we don't recommend measuring and recording duration like that, but we recommend using Timer.Sample, see the docs.)

This means that if the unit of the registry is milliseconds, it does not matter if you record in seconds, millis or micros, everything will be converted and published in millis. StatsD happens to use milliseconds by default: https://github.com/micrometer-metrics/micrometer/blob/4d9efae6b7f3635be5e99fa089eb2851f9d9ca78/implementations/micrometer-registry-statsd/src/main/java/io/micrometer/statsd/StatsdMeterRegistry.java#L464-L467

So using your example, if you do this:

timer.record(2_000, TimeUnit.SECONDS);  // about 30 min

It will be (recorded as nanos under the hood and) published in the unit of the registry. Since StatsD uses milliseconds as the unit, you will see 2,000,000ms when published.

This is the expected behavior which is also guarded by tests and that's why that your change in AbstractTimer fails the tests. If you want to override the time unit of a registry, you can override the getBaseTimeUnit method I linked above but based on your diff it seems that millis is hardcoded in a few places in case of StatsD which we can fix but is there any backend that can use StatsD with a non-milliseconds unit?

coreyoconnor commented 3 days ago

Starting from the end as that will explain things:

If you want to override the time unit of a registry, you can override the getBaseTimeUnit method I linked above but based on your diff it seems that millis is hardcoded in a few places in case of StatsD which we can fix

indeed! I am using with getBaseTimeUnit overriden in the registery. Which I would expect to override (ha!) the time unit used. As you can tell: no such luck.

But what backend is this for? Datadog

There are several other issues where users have encountered the same unexpected behavior from datadog. I can pull these up later.

Specifically: Datadog does not support units for timers. Timers are unit-less gauges.

So, setting the units has no effect - datadog is going to accept them as unit-less and presumes you've pre-scaled them to whatever you want or manually changed the metric config in the UI.

Obviously, changing the metric config in datadog for every timer after they've been submitted is a non-starter.

So what to do? Seems to me like overriding the unit should do what is documented as: Overrides the unit used. "The base time unit of the timer to which all published metrics will be scaled". Which, if it was true, would do what is expected for datadog.

So even if there is no room for adjusting micrometer to scale the values nicely: The documentation is a bit misleading no?

jonatan-ivanov commented 11 hours ago

As I said above, it seems that "millis" is hardcoded in a few places in case of StatsD which we can fix. Please remove your change from AbstractTimer and replace hardcoded TimeUnit.MILLISECONDS to baseTimeUnit(). I think doing that should make you able to override the base unit. (I can help to write a tests for this use-case.)

Could you please also run ./gradlew format and ./gradlew build to see if your changes break the build or not? (Right now they do.)

The documentation is a bit misleading no?

If I understand this correctly, I don't think the documentation is misleading, I think the docs are ok but we have a bug: I agree with the intention of the docs but the current behavior does not follow that. :) After fixing the issue, and you still think the docs are misleading, please open an issue/PR.

coreyoconnor commented 5 hours ago

Yes, certainly: I will resolve the tests. As well as the other issues. Where would be good to document this aspect of datadog? At least enough to point somebody in the right direction. The statsd registry or config?

As you mentioned statsd works in milliseconds. So, Datadog is an exceptional case then? I'm not familiar with the standards but sounds like it! In that case, should this be a datadog specific config somehow? Or is the override of the base time unit sufficient to capture the exceptional aspect?

for reference, the related motivating tickets.

These are duplicates - just adding here for completeness

discussion about the the unitless of timers: