micrometer-metrics / micrometer

An application observability facade for the most popular observability tools. Think SLF4J, but for observability.
https://micrometer.io
Apache License 2.0
4.46k stars 985 forks source link

Associate multiple values with the same tag key #1432

Open cuneytcalishkan opened 5 years ago

cuneytcalishkan commented 5 years ago

Hello,

We are currently using DataDog for reporting our metrics and wanted to replace it with micrometer and statsd using DATADOG flavor. However, we found out that a use case currently working on DataDog is not supported in micrometer. DataDog allows to have a single value or key:value format in the reports, see documentation.

This also allows us to report a metric with the same key repeating with different values. e.g. _metric_key:metric_value_1, metric_key:metric_value2

I have come across issue #1170. In our case, we receive a list of values from input and we want to report these values individually with the same tag key, which in return allows us to use graphs like sum/avg by _metrickey.

I am well aware that not all reporting APIs support this but maybe the individual API integrations should take into account this use case where it is valid for some integrations like DataDog.

Best regards, Cuneyt

checketts commented 5 years ago

You can get this benefit by using multiple counters.

for(String input: inputs) {
  registry.counter("my.metric",Keys.of("metric_key", input)).inc()
}

Is there any reason this wouldn't work? (Unless you are trying to tie this metric to some sort of requests total and this action would inflate that count, in which I would recommend you create a separate counter to track that)

cuneytcalishkan commented 5 years ago

For a single key that might be a solution. However imagine a case where you add different tags along the lifetime of a request and in the end only once this metric is reported. e.g. some service_1 called for metric_value_1 and some other service_2 called for metric_value_2. In this case, we end up with multiple tag keys that have multiple values which means an exponential number of metrics to report instead of only one.

Another example that we cannot use suggested approach is: We receive a request with metric_value_1 and we measure the time as t1. // nominal case We receive a request with metric_value_2 and we measure the time as t2. // nominal case We receive a request with metric_value_1, metric_value_2 and we measure the time as t3. // In this case t3 doesn't correspond to the time spent serving for metric_value_1 or metric_value_2 alone but to both of them combined and in this case the time reported will be wrong with the suggested approach.

shakuzen commented 5 years ago

we end up with multiple tag keys that have multiple values which means an exponential number of metrics to report instead of only one.

That is how tags work if you keep making unique combinations of key/values. There is one time series for each unique metric name/type and tag key/value. Is the difference that you're expecting previous tag key/values to not be reported if they haven't been recorded recently?

Otherwise I'm not understanding the issue here. You can make metrics with the same tag name and different tag values using the current API already. Could you illustrate the issue with a code sample to make it more clear? Note that the metric recording API is not coupled to the reporting mechanism, which changes depending on the backend.

cuneytcalishkan commented 5 years ago

Hello,

I think the use case is pretty simple that we are facing, no need for a code sample. We would like to be able to report a metric with the following tags: key:value_1, key:value_2. This is currently the case with DataDog client where you don't need to actually provide a key/value pair at all. But if the string value you provide includes a : (colon), the left hand side of the string is treated as a key the right hand side as value.

Currently this is not supported because in the dedup method of Tags class, this is prevented.

I have tried sending metrics with this use case and only 1 of the tags is sent, even captured network packets to make sure that there is no problem with DataDog monitoring.

shakuzen commented 5 years ago

Thank you for helping me understand. I didn't realize we were deduplicating tags by key name always. I was able to reproduce the issue now.

mathurn commented 4 years ago

Hi @shakuzen, has a fix for this use case been implemented? We've also have multiples tags with different values for the same key.

balazssandor commented 4 years ago

I am also still curious if we have a solution for this.

harspras commented 4 years ago

Any updates on this?

jayshao commented 3 years ago

I'd also like to upvote this. We have a number of use cases where our data-model is multi-valued, and it would be nice to represent each value as a tag.

If we understand the way tags are persisted (e.g. split off into separate independent time-series) it seems like this should be feasible - excepting deduping on input

shakuzen commented 3 years ago

Sorry for the lack of update on this. While I understand the technical limitation we have right now in Micrometer and that Datadog allows such a tagging model, I'm not sure we can easily justify supporting this for Datadog when it isn't portable to other backends and we would have to deduplicate for reporting to all other backends.

Just so everyone is clear in this issue, you can already record metrics that have different possible values for the same tag. For example, http.server.requests may tag a method which corresponds to the HTTP method. There are different HTTP methods, so it can have different values. The difference here is that this issue is to record a metric with multiple values at the same time, which doesn't happen with the HTTP server request metrics. One recording corresponds to one request, which will only have one value of HTTP method at a time, even though different requests may have different values. So we already support something like the following (pseudocode representing the state of metrics):

http.server.requests.count uri=/books method=get 5
http.server.requests.count uri=/books method=post 3

From this you can derive that the /books uri received 5 GET requests and 3 POST requests, for a total of 8 requests. What we do not support is recording a metric with a tag method having values post and get for the same metric value (e.g. count). I couldn't think of a good example with a valid use case, so I continue with this example.

http.server.requests.count uri=/books method=get 3
http.server.requests.count uri=/books method=get method=post 4
http.server.requests.count uri=/books method=post 2

While not a realistic example, I suppose this would be interpreted as 3 GET requests, 2 POST requests, and 4 requests where the method was POST and GET.

I am surprised there are this many people with use cases for this. Again, the lack of portability and breaking change required for this is a significant hurdle to get over. We would have to allow this in common API, which affects all users.

datibbaw commented 3 years ago

Our use case is that we use cumulative distribution bins for latency metrics, so e.g. a request that takes 34 msec would have the following tags: bin:gt_0, bin:gt_10, bin:gt_20, bin:gt_30 ... the bin sizes become bigger as you go along.

This allows for an easier setup of monitors in datadog, whereby unacceptable requests can be queried as e.g. gt_2000 (more than 2s), which is used for event-based SLO (as opposed to e.g. average of last 15 mins, which doesn't take traffic patterns into consideration).

shakuzen commented 2 years ago

@datibbaw The use case you have describe sounds similar to what we do with histogram buckets. In registries that support percentile histograms (Atlas, Prometheus, Wavefront) we ship those when the configuration is enabled. If using a registry that doesn't support that, you can still set serviceLevelObjectives for manual bucketing of important thresholds on Timer and DistributionSummary. This results in a gauge (with the name of your meter + .histogram) per SLO bucket with a tag named le and a tag value of the bucket threshold. The gauge value represents the bucket count. You can then query your backend for the .histogram gauges where le has a value of say 2000 for 2 seconds on a registry with millisecond base time unit. In Datadog, I think this looks the same in the backend as a single tag name with multiple values on the same metric. We are registering separate gauges for buckets rather than multiple tag values on one meter with the same tag name.

giteshjha commented 2 years ago

Hi did you try this ? thread

datibbaw commented 2 years ago

@shakuzen Sorry for the tardy reply on this. Would it be accurate to say that your suggestion is effectively to define the SLO within the app itself? I'm not sure how heavy it would be to create a gauge for each bucket, but it sounds rather impactful.

The good part of having range tags done in the way that I'm proposing (albeit working only for dd afaict) is that the metric production and SLO definition can be separate from each other, and we're not burning away at the number of metrics that we would have to keep around for each SLO.

Maybe I didn't understand you correctly, and I'm not intimately familiar with atlas, prometheus, etc. (but we're not likely to move away from dd either); if I'm mistaken in my interpretation, please correct me!

patrickbadley commented 1 year ago

Here is our use case (DataDog): We are monitoring metrics that track how many times we update records in a third party system. We are interested in the success/failure rate of these updates at both the object level and field level. An important caveat is that both full object updates and partial updates are supported by the third party system.

We want to include tags on our metrics of both the object name (ex. object_name: customer) as well as the fields (examples: (fields:first_name, fields:date_of_birth) or (fields:date_of_birth, fields:ssn)) on that object that were included in the update so that we can report on.

  1. which objects resulted in the most failures?
  2. which fields on those objects caused the most failures?

@shakuzen Would it be possible to have the deduplication be configurable or abstracted out to the backend implementation in use? The lack of support for this tagging model seems like a limitation of the other backends, not vice versa, therefore I would think the other backends implementation should be responsible for the deduplication if it's required there.

jonatan-ivanov commented 1 year ago

@patrickbadley Does this help to answer your question? https://github.com/micrometer-metrics/micrometer/issues/1432#issuecomment-716153899

The lack of support for this tagging model seems like a limitation of the other backends, not vice versa

I think I see this differently (btw is there another backend that does this?), to me, not being able to do this helps to make things simpler I guess.

In your case, I think instead of listing the field names, you can track a flag about all of them to see which one was updated, and which one was not e.g.: Instead of this:

updates(type:partial, object_name:customer, fields:first_name, fields:date_of_birth) 1
updates(type:full, object_name:customer fields:first_name, fields:date_of_birth, fields:ssn) 1

You can (should?) do this:

updates(type:partial, object_name:customer, first_name:true, date_of_birth:true, ssn:false) 1
updates(type:full, object_name:customer first_name:true, date_of_birth:true, ssn:true) 1

To me the second one seems more explicit (you can see what was updates as well as what was not), and to me it seems easier to read it, aggregate the values and reason about what happened.

patrickbadley commented 1 year ago

@jonatan-ivanov Thank you for the prompt response. I had read https://github.com/micrometer-metrics/micrometer/issues/1432#issuecomment-716153899 before posting my comment, but felt my post may identify a real world use case for this feature.

Before moving on to responding to your comments, there have been multiple comments and reactions to this request indicating that multiple users would like this functionality. Would it be difficult to add a simple configuration option where users can opt-out of the dedup() method call for tags? I'm not familiar enough with the micrometer code to propose a PR, but if we could set an env var or set some other config value and then have the Tags class do something like this:

private Tags(Tag[] tags) {
        this.tags = tags;
        Arrays.sort(this.tags);
        if (Config.dedupTags) {
            dedup();
        }
    }

that would be a low cost solution to our problem. A more robust solution may be to use the strategy pattern or move the dedup call somewhere else in the logic where the backend can determine if the tags need to be deduped or not.

Regarding your other comments, I appreciate the potential workaround, we may end up implementing something along the lines of your suggestion barring support from micrometer for multiple values for the same tag which we still prefer. Without getting too in the weeds of our implementation, we wouldn't be able to easily know what fields are NOT getting synced, so we would only be publishing first_name: true where the true value isn't really meaningful as we would simply be interested in the existence of the key. We would also need to differentiate these field name tags from other tags both to allow us to report on them alone and to avoid potential conflicts of tag names, for example it's possible that an object could have a field name called object_name which would then conflict with our other key. So we would do something like field_first_name: true. Finally some objects have a lot of fields so we would end up with hundreds of different tags for this one use case. While I think this solution could work, it seems messier to me than an isolated, meaningful tag like fields with multiple values.

I do not know if another backend allows multiple values for a single tag like datadog, I have only worked with DataDog.

I don't agree with your comment about this not being a limitation of other backends by not supporting this as "making things simpler". To take the argument to an extreme (at the risk of sounding like a jerk 😬), it would be "simpler" to not have a tag feature at all, but that doesn't make the service better.

Thanks again for you thoughtful response.

jonatan-ivanov commented 1 year ago

Would it be difficult to add a simple configuration option where users can opt-out of the dedup() method call for tags?

I think so, mainly because of two reasons:

If you "just" remove dedup (please feel free to try it out), you will have two Counters if you do this:

registry.counter("test.counter", "a", "0", "a", "1").increment();
registry.counter("test.counter", "a", "1", "a", "0").increment();

But you really should have just one Counter since these are the same. (If you want to play with this, you also need to do last = tags.length; after you remove dedup();, otherwise you will end up with empty tags.)

Finally some objects have a lot of fields so we would end up with hundreds of different tags for this one use case. While I think this solution could work, it seems messier to me than an isolated, meaningful tag like fields with multiple values.

I think we have the same amount of cardinality with both approaches, we did not reduce/increase the amount of data (time series), just presented it in a different format.

To take the argument to an extreme...

This is not as extreme as you would think since backends did not have tags in the past (Graphite, Ganglia, JMX, Etsy StatsD still don't). Since you still need multiple time series, what these "hierarchical" backends did was adding the "tags" (dimensions) to the name, e.g.:

updates|type=full|objectName=customer 1

Handling this is not simple: sometimes only the values were available (updates.full.customer 1) and the order does matter (like it can in the multi-value tags scenario). So I think not having tags (hierarchical backends) is more complicated than having them (dimensional backends) even though it does not seem like it on the first sight.

Let me talk to @shakuzen about this on Monday. I still feel this is asking for trouble but at least I can provide a hack for you: DatadogMeterRegistry has a Tag to String conversion in its writeMetric method. If you modify it, and let's say split the tag values at a separator (,), you can sort of simulate this by doing registry.counter("test.counter", "a", "0, 1"), unfortunately dealing with the order (sort the values) and existing tags (if any).

datibbaw commented 1 year ago

While considering hacks, I would also like to point out that I have a suggested patch that would accomplish this idea very specifically for supplying multiple tags with the same name; you can find the implementation at #2764.

Note that my approach wouldn't avoid dedup at all, it just changes the rules by which tags are compared, and this could theoretically be done differently for just one backend ... in fact, it would only work in one backend, so it already falls in the category of advanced usage.

jonatan-ivanov commented 1 year ago

@datibbaw I'm not considering the hack I mentioned above. I provided it for @patrickbadleyupstart, he can consider doing it if he wants to.

I guess you both read this comment from Tommy:

So allowing something like this in micrometer-core is likely to cause problems unless you're only publishing metrics to Datadog. So I'm not sure yet that this is the best approach. here

Because of this and the things I mentioned in my previous comment, I think I would consider that solution a "hack" but I think it's less risky than a dedup on/off feature. We will discuss this on Monday and will get back to you.

patrickbadley commented 1 year ago

I completely agree and really like @datibbaw’s solution. I only proposed the config option in the absence of something more elegant like the proposed PR. Feel free to ignore the config suggestion favor of @datibbaw’s solution.

patrickbadley commented 1 year ago

@jonatan-ivanov I was wondering if you were able to discuss #2764 with @shakuzen and had any update on this issue? Thank you.

jonatan-ivanov commented 1 year ago

Sorry for the late reply.

Right now we are not planning to do this since one of the main ideas behind Micrometer is the ability to switch between metrics backend with minimal pain. This would encourage users adding lots of Datadog-specific code in their codebase which will make switching significantly harder and the code less portable. This could also open the door for misusage which can open the door for other issues too.

patrickbadley commented 1 year ago

Hi @jonatan-ivanov, I understand that you are trying to stick to a core value of your project, but I don't think this change has all of the negative consequences you mention.

  1. Users would be knowingly opting in to this feature and accepting any of the implications of doing so which I believe is up to them to decide based on their priorities: portability or feature set.
  2. It does not require "Datadog-specific code". The code supports a feature that may only currently be supported by datadog, but could be implemented by any of the other backends.
  3. There is not "a lot" of Datadog-specific code. The code required would be a pretty simple overridden Tag class. If the user did end up switching to a different library they could simply remove the overridden compare method. It's a one line code change I believe (excluding any nice-to-have code clean up to remove references to the unneeded class).
  4. In regards to portability, I believe that the main value in having a library that supports multiple back ends is (or should be) less about having the ability for one user to change their backend, which presumably happens very rarely, but more so, to allow for many different users using different backends to leverage your library. In my experience, changing a third party "back end" of any kind does not happen often, and when it does, it is expected to involve some sort of migration. I don't think that limiting the functionality of one supported back end in the name of portability to back ends with different behavior is a benefit to the library or its users.

I hope you change your mind and add this simple change so that my team, and others, can leverage the full feature set provided by Datadog without negatively impacting any other users of your library.