micrometer-metrics / micrometer

An application observability facade for the most popular observability tools. Think SLF4J, but for observability.
https://micrometer.io
Apache License 2.0
4.46k stars 989 forks source link

Wrong values with CUMULATIVE metrics in StackdriverMeterRegistry #3150

Open rafal-dudek opened 2 years ago

rafal-dudek commented 2 years ago

I spotted 2 errors in curent implementation of CUMULATIVE metrics in StackdriverMeterRegistry. It is more of a design flaw than a bug.

1 ) StartTime and EndTime are set individually for each Batch, as separate extents (with 1ms delay between start and end time). https://github.com/micrometer-metrics/micrometer/blob/main/implementations/micrometer-registry-stackdriver/src/main/java/io/micrometer/stackdriver/StackdriverMeterRegistry.java#L313-L314

But as you can read here: https://cloud.google.com/monitoring/api/ref_v3/rest/v3/projects.metricDescriptors#MetricKind https://cloud.google.com/monitoring/mql/reference#time-series https://cloud.google.com/monitoring/api/ref_v3/rest/v3/TimeSeries#Point

Cumulative measurements in a time series should have the same start time and increasing end times

Cumulative: This can be thought of as a Delta time series with overlapping extents. The constraint is that, if two points have overlapping extents, they must have the same start time.

For CUMULATIVE metrics, the start and end time should specify a non-zero interval, with subsequent points specifying the same start time and increasing end times until an event resets the cumulative value to zero and sets a new start time for the following points.

Sending metric as it is in current implementation causes Stackdriver to detect restart of metric at each new point and causes problems with e.g. rate functions. StartTime should be probably stored for each metric separately as first publish time minus 'step'.

2 ) The second problem is value of CUMULATIVE metric. As StackdriverMeterRegistry is a StepMeterRegistry it uses StepCounter for counters, StepFunctionCounter for function counters etc. After each publish, such meter is reset. And as we can read in the documentation mentioned earlier:

CUMULATIVE: A value accumulated over a time interval.

Cumulative: Thus each point represents a cumulative increase in some measured value over the last point with a common start time.

To sum up, when config.useSemanticMetricTypes() is true we should have:

Below the example how it should work: metric_1 is first send on 1:01, and mectric_2 is first send on 1:03. Both increase values by 1 on each minute. Metric 1:01 publish 1:02 publish 1:03 publish 1:04 publish 1:05 publish
metric_1 startTime 1:00 1:00 1:00 1:00 1:00
metric_1 endTime 1:01 1:02 1:03 1:04 1:05
metric_1 value 1.00 2.00 3.00 4.00 5.00
metric_2 startTime 1:02 1:02 1:02
metric_2 endTime 1:03 1:04 1:05
metric_2 value 1.00 2.00 3.00
norbertwnuk commented 2 years ago

Relates to #2232 / work originally done by @tmarszal, all the logic above can be derived from opentelemetry-operations-go written by Google itself (e.g. 1, 2)

rafal-dudek commented 1 year ago

Can we get some updates on this? Are there any plans to fix the Cumulative types for StackdriverMeterRegistry?

marcingrzejszczak commented 10 months ago

Hey @rafal-dudek ! I'm no Stackdriver expert, do you think you could write a test that would be failing with the current implementation? That would be easier to reason about I guess.

rafal-dudek commented 9 months ago

@marcingrzejszczak Here is a sample test: https://github.com/micrometer-metrics/micrometer/compare/main...rafal-dudek:micrometer:stackdriverCounterTest?expand=1 You may want to write it differently in your repo, but I hope it shows you the concept good enough.