open-telemetry / opentelemetry-java-contrib

https://opentelemetry.io
Apache License 2.0
158 stars 126 forks source link

Fix Tomcat metric definitions to aggregate multiple MBeans. #1366

Closed jefchien closed 2 weeks ago

jefchien commented 2 months ago

Description: Bug fix - The current Tomcat metric definitions use the otel.mbean constructor function, which expects the query to the MBean server to return a single MBean. If multiple mbeans are returned, it takes the first one and drops the rest, which results in inaccurate/incomplete metric values. The tomcat metrics are all defined with wildcards (*), which are meant to match multiple MBeans.

Changed the Tomcat metric definitions to use otel.mbeans instead, so metric values aren't being dropped. Added an aggregation for doubleValueCallback/longValueCallback since both of those observable instruments will only accept the first value per collection interval.

Existing Issue(s): https://github.com/open-telemetry/opentelemetry-java-contrib/issues/1360

Testing: Added a unit test in the InstrumenterHelperTest to account validate the aggregation.

image image

Documentation: N/A

Outstanding items: Some activemq and jetty metric definitions need to be adjusted to otel.mbeans and tested.

breedx-splk commented 2 months ago

Do we really think that aggregating is the correct approach here? For counter instruments (like the sessions originally reported in #1360) maybe this makes sense, but not necessarily for all instruments, right?

Isn't it possible that two different MBeans registered with the same name could be returning separate metrics with different units? Units aren't really a first-class thing in JMX, so it requires some out-of-band knowledge I think, and aggregating across different units seems problematic.

jefchien commented 2 months ago

Do we really think that aggregating is the correct approach here? For counter instruments (like the sessions originally reported in #1360) maybe this makes sense, but not necessarily for all instruments, right?

Yeah, this is currently only going to be done for gauge instruments (doubleValueCallback/longValueCallback). Will not impact the other types.

Isn't it possible that two different MBeans registered with the same name could be returning separate metrics with different units? Units aren't really a first-class thing in JMX, so it requires some out-of-band knowledge I think, and aggregating across different units seems problematic.

I think this comes down to the metric definition files. It's possible to have multiple MBeans grouped together that share the same attribute name that are unrelated and shouldn't be aggregated, but that would be a case of misconfiguration. Even without this change, the current behavior would still be wrong since it'd only report the value from one of the MBeans.

breedx-splk commented 2 months ago

Hey @jefchien . We talked this over in today's SIG meeting and it sounds like we'd prefer a slightly different approach.

In the case where the user has deployed multiple apps to one tomcat instance (or however they get multiple contexts, each with its own MBean manager), rather than aggregating in code like in this PR, we'd be better served by doing multiple recordings with different attributes, and the context name should be an attribute.

This allows the metric to be correctly aggregated later, and maintains the per-context data points.

Make sense?

jefchien commented 2 months ago

Wouldn't this result in different metrics altogether since that would mean tomcat.sessions without attributes would no longer exist? Doesn't that break current users of the JMX gatherer who collect the metric?

Does this mean that collecting multiple MBeans into a single aggregated metric is not supported? Should setting a metric definition with a wildcard without dimensions for that wildcard be disallowed? Like for example, the ActiveMQ disk metrics where the intention is to aggregate all brokers into a single metric. Should this aggregation always be handled at the collector?

https://github.com/open-telemetry/opentelemetry-java-contrib/blob/32c7e89e8e4549e844fe25bef1bb468f3addce4c/jmx-metrics/src/main/resources/target-systems/activemq.groovy#L101-L124

(setting aside that the definition is incorrect and otel.mbean should be otel.mbeans)

breedx-splk commented 2 months ago

(Sorry for hashing this out in your PR, should have done this originally in the issue) I want to try and clarify a few points, before trying to answer your questions.

First, I agree that the functionality today is not ideal and users are getting surprising and misleading data when there are multiple MBeans that match (like in the tomcat example above). I believe that the problem is no limited to tomcat, and many other targets use wildcard beans, so the problem can exist there as well. I also think that we should work toward making this better.

Today, in the cases where a jmx ObjectName search expression uses a wildcard and, at runtime , happens to match more than one MBean, we currently have no good good way to resolve this. The code, as originally written, expects to be able to record measurements for all of them (at least when passed as a list of bean names, as you discovered), but this fails in the SDK because it amounts to multiple recordings, which is not allowed and a warning is issued.

So maybe there are other options, but I see 3 ways the code could resolve this:

  1. Continue doing the wrong thing. 😿
  2. Aggregate within the groovy code
  3. Report the measurement from each bean, in a meaningful way that allows them to be filtered or aggregated later.

2 above is what this PR is proposing. It assumes that the user always wants the values aggregated when multiple MBeans match the object name query. I have two primary concerns with this:

  1. I have low confidence that this is what users always want. I have no idea if there are other multiple MBeans in other targets who report very different things, such that aggregation is problematic or confusing.
  2. I am concerned that there could be two MBeans that report the same attribute, but with different implied units. This is caused in no small part by the fact that jmx wasn't designed with units in mind. For example, imagine two MBeans that report disk size or memory use or something. Doesn't matter. If one MBean provides the attribute in bytes and the other provides the attribute in megabytes, this becomes problematic. These two should not be aggregated.

3 above is what the SIG basically concluded would be a better approach. For each measurement from each bean, we would include a set of attributes that are different, thereby working around the "it's just using the first one" problem, but also maintaining per-bean granularity that could be externally aggregated if later desired. The simplest way of doing this, IMO, is to add an attribute called jmx.mbean.name whose values is the name of the mbean. We'd probably want to add semconv for this eventually.

Ok, now to address the questions above:

Wouldn't this result in different metrics altogether since that would mean tomcat.sessions without attributes would no longer exist?

Attributes are technically not part of metric identity, so the short answer is no. The attributes do create different "metric streams" within a single metric. The name/value of each attribute is a metric dimension. This is a decent resource that describes this.

Doesn't that break current users of the JMX gatherer who collect the metric?

It could. We do not yet have stability guarantees for this component, it has alpha in the version, and so we are allowed to make breaking changes. Of course, it's never great to introduce breaking changes, but keep in mind that it's definitely reporting the WRONG thing now already, so fixing that should be considered a net improvement.

Does this mean that collecting multiple MBeans into a single aggregated metric is not supported?

The way I read this today before this PR, the answer is no. I didn't see anything that would aggregate across mbeans.

Should setting a metric definition with a wildcard without dimensions for that wildcard be disallowed?

I don't think so. I think there's value in being able to use multiples...like in the tomcat example above, you could filter just the contexts you care about (eg. remove /docs or whatever) and you could still aggregate, either in a collector or a vendor/backend tool. Similarly, the activemq example you cited would allow the user to see per-broker disk metrics, and aggregate them later if it's safe and/or desirable to do so.

breedx-splk commented 2 months ago

@jefchien I hate to be the solo reviewer and blocker on this, so I want to make a proposal to move this forward.

Additional background context: We may see this thing get rewritten without groovy (native java) in the coming months, so expect some additional instability.

To move this forward, I'd like to propose that this PR includes the following:

  1. adds a commandline flag called aggregateAcrossMBeans that defaults to false. If the user then opts into this, then the functionality in this PR should be active. Otherwise we keep the old behavior.
  2. document the new commandline flag and related behavior in the README for this module.

Thanks!

jefchien commented 2 months ago

What you're saying makes a lot of sense and #3 does seem like a better approach, but I think the details for what that looks like needs to be fleshed out a bit more. I appreciate the proposal to unblock this and will add the command line argument to make this opt-in.