micrometer-metrics / micrometer

An application observability facade for the most popular observability tools. Think SLF4J, but for observability.
https://micrometer.io
Apache License 2.0
4.44k stars 979 forks source link

Add support for sending all data points to providers #1736

Closed rsmogura closed 8 months ago

rsmogura commented 4 years ago

Hi Micrometer Team,

I worked with micrometer and Cloud Watch and would like to have a functionality to send all data points (measures) to cloud watch, instead of aggregated statistics.

There few points why it would more beneficial for me:

I wonder if such approach is aligned with Micrometer approach.

In generally I was thinking in following about following changes:

  1. Add general configuration option if all data points should be collected by given Meter
  2. For given meter add configuration option to decide if statistics (measures) should be send through registry.
  3. Add configuration flag to collect data points (in similar way like it's for histogram).
  4. Allow filtering of above flags per registry.
  5. Add global configuration option for data points to flush - justification we can collect thousands of data points per second, and those datapoints can cause OOM - so flush should be triggered by the number of data points instead of time).

Registers wanting to store data points instead of aggregated measures, could update them self to do this.

shakuzen commented 4 years ago

Sending each measurement does not seem aligned with the goal of reducing cost. As far as I understand[0], CloudWatch charges per API call to store metrics, and sending each individual measurement would drastically increase the API calls to CloudWatch.

Other CloudWatch users limit the metrics sent with MeterFilters or by not binding MeterBinders for metrics they are not interested in.

It also adds significant overhead to send each measurement and the benefit isn't clear. Even if CloudWatch provides functions to aggregate, why is it beneficial to use those on individual measurements instead of summary data?

We generally want to avoid allowing inefficient configurations unless required by the metrics system. Let me know if I'm missing something, though.


[0] Based off the custom metrics example in CloudWatch pricing documentation

codemedian commented 3 years ago

Hey folks,

I also need this and have taken a stab at it recently as I have a use-case is that where I need pXX stats across instances of a service, and the existing way of publishing histograms doesn't quite work for this (at least I can't think of a way to achieve this using CW)

Only way that would work is to publish histogram data in one single putMetricData request when highResolution metrics are enabled, based on this link.

Some thoughts

Personally I'm leaning towards implementing both options, controlled with a config option and potentially with a suffix .approx in the histogram based metric.

I've got a [working prototype in my fork](see here https://github.com/codemedian/micrometer/compare/master...codemedian:feature/cloudwatch-statistics-set?expand=1) but am not sure about the metric naming and how the config/defaults should behave

Additionally - can't find the ticket now, but I remember reading about it. I see this working as a whole as follows

1) create low-res micrometer registry for all metrics (publishes min/max/avg etc) as statistics sets 2) create a 2nd high-res registry with a metrics filter and use metricsFilter to add DistributionStatisticConfig to only the metrics desired which will then be published using the above mentioned way

jonatan-ivanov commented 3 years ago

@codemedian Would you mind sharing your use-case with us? We might be able to give alternative ideas.

codemedian commented 3 years ago

Certainly.

I have a number of instances of one REST service running and am defining SLAs based on p99 on API call latency. I don't particularly care about which instance is having issues.

One way of approaching this would be to publish the p99 of each service as a client side percentile histogram and using a SEARCH and MATH function combine it into a single series. This however has the problem that you can't set CloudWatch alarms on Search results.

So, in order to get min/max/avg aggregated across instances, I dropped the instance tag from the metrics, which requires the statistics set PR I raised to work properly for CW to understand how to aggregate the different metrics.

In the same fashion I now want to publish percentiles on all instances, yet without an instance label. The only way I can think of this working is to let CW do the percentile calc, i.e. publishing histogram data directly. Either every single value/count combination for the time window, or alternatively an aggregate bucket based histogram (which is what I've attempted in my branch)

Added benefit of letting Cloudwatch do the percentile calc is that the pXX options in the UI will work as expected

Hope that makes sense :)

rsmogura commented 3 years ago

Avg of avg from three instances is different than average of values from all three instances. Micrometer can’t handle this right now.

From the other hand each pseudo-statistics metrics cost.

That’s why we decide to store values in CloudWatch on our own.

codemedian commented 3 years ago

Avg of avg from three instances is different than average of values from all three instances. Micrometer can’t handle this right now.

For avg of avg I believe my PR is of interest to you. Statistics Sets allow to publish client side aggregates and still preserve the CloudWatch functionality you're referring to while reducing the number of metrics down to one. Small caveat is that as of now this only works for Timers and DistributionSummaries.

edit: took a quick look at CW pricing, and the cost for one metric for a month is roughly equivalent to one putMetricData call per minute (with the latter losing out narrowly). So, a timer using the current MM approach (4 metrics) would cost you ~0.5$ per month. Alternatively you could have one metric and call put 4 times a minute for roughly the same price meaning that if you have less than 150 unique measurements/15 seconds then your approach is cheaper (unless my maths is off :D)

rsmogura commented 3 years ago

What about percentiles?

codemedian commented 3 years ago

That's what I'm working on (see my first comment and the branch linked) but not yet sure how to proceed.

If you want to be CloudWatch idiomatic it boils down to either publishing all data points or a histogram approximation and marking the metric as a high-resolution one

I'm currently leaning towards supporting both with a config switch on the registry but any input/feedback appreciated

rsmogura commented 3 years ago

I think it’s impossible to provide percentile with statistics sets according to cloud watch documentation.

Generally, you push all data points to CloudWatch or you loose functionality you pay for.

codemedian commented 3 years ago

I don't understand why you're referring to statistics sets here, I have made no such claim? That being said - technically pXX works with statistics sets IFF certain conditions hold true, see documentation

I do agree with you that every data point needs to be published for percentiles to work properly, however I think there is value in - maybe additionally - having client side histogram buckets and using those as approximation for datapoints?

rsmogura commented 3 years ago

Sir, I think you mentioned statistics sets.

My point is different, why to pay for CloudWatch to self limit and get only 10% of functionality?

codemedian commented 3 years ago

😅 Misunderstanding then.

Re Statistics sets I was referring to min/max/avg for client side aggregates. Which gets you 100% functionality for normal, low res, metrics (which do not support percentiles in CW) from what I can tell.

For high resolution metrics and percentiles I 100% agree with you. It's also my preferred approach and I'll see if I can fit in into the current registry/behaviour in any way

github-actions[bot] commented 8 months ago

If you would like us to look at this issue, please provide the requested information. If the information is not provided within the next 7 days this issue will be closed.

github-actions[bot] commented 8 months ago

Closing due to lack of requested feedback. If you would like us to look at this issue, please provide the requested information and we will re-open.