open-telemetry / opentelemetry-java

OpenTelemetry Java SDK
https://opentelemetry.io
Apache License 2.0
2.01k stars 834 forks source link

[Metrics] Let me pass Attributes to Bound** Instruments #3561

Closed kvcache closed 3 years ago

kvcache commented 3 years ago

Is your feature request related to a problem? Please describe. Not all exporters support adding resource Attributes to outgoing Metrics. To get around that in my system I'm eagerly binding all of my Counters and whatnot to the Resource's Attributes. But the BoundLongCounter api (and friends) does not have the overload like void add(long value, Attributes attributes); to allow me to optionally attach some observation-specific Attributes.

Describe the solution you'd like I would like the BoundLongCounter interface to extend LongCounter. Or at a minimum add the Attributes overloads from LongCounter.

Perhaps obvious, but this issue is not limited to BoundLongCounter; it is shared by every Bound** Instrument and the ask extends to all of them. Binding Attributes should not preclude the possibility of having dynamic Attributes (which I do).

Describe alternatives you've considered

jkwatson commented 3 years ago

moving this over to the opentelemetry-java project

jkwatson commented 3 years ago

Hi @kvcache . The purpose of the Bound* instruments is precisely to not allow record-time attributes. That code path can be highly optimized and essentially "zero allocation" if we don't allow additional attributes to be passed in. If you need to pass in attributes on the hot path, you'll need to use the regular instruments, and pass the full attribute set in with each recording.

kvcache commented 3 years ago

I think maybe that's missing the point? When I implemented literally this a few years ago I expressed it as allowing a zero-allocation attribute set - rather than exclusively forcing it. It's an arbitrary requirement, not driven by technical necessity, and ought to be called out in docs but allowed. Sometimes you need to do both.

Maybe the right compromise here would be to have a means to tell an Instrument or a Meter to always add the Resource attributes on the attributes' way out for export. That was the zero-allocation compromise I made & it was never found to be costly or restrictive. If you are consuming Resource Attributes you know they won't differ and you don't need to consider them in aggregation uniqueness keying; just toss them in the output body on their way out the door.

Consider this silly scope timer, which cannot disambiguate by any dynamic tags, but at least a user doesn't have to pass in Resource attributes at every single report site:

class Timer(meter: Meter, name: String, description: String, labels: List<Label> = listOf()) {
    val latencyRecorder: BoundLongHistogram = bindResourceLabels(
        meter.histogramBuilder(name)
            .ofLongs()
            .setDescription(description)
            .setUnit(StandardUnit.Microseconds.toString())
            .build(),
        labels
    )

    private fun bindResourceLabels(histogram: LongHistogram, bonusLabels: List<Label>): BoundLongHistogram {
        val labelsBuilder = Attributes.builder().withResourceLabels()

        bonusLabels.forEach { (name, value) ->
            labelsBuilder.put(name, value)
        }
        return histogram.bind(labelsBuilder.build())
    }

    // Records the time spent in the block
    inline fun <T> record(block: () -> T): T {
        val start = System.nanoTime()

        try {
            return block()
        } finally {
            val end = System.nanoTime()
            val latency = kotlin.math.max(0, (end - start) / 1000)
            latencyRecorder.record(latency)
        }
    }
}

Now, if the Meter could be set to report its Resource Attributes, this would no longer need to be bound and I could pass in optional Timer Attributes to make this way more ergonomic. Also that would be a performance boost as Resource Attributes can be considered immutable and invariant, whereas the bound attribute set is not invariant across instances. (as a kludge I make multiple bound timers with the same name and literal same description with a different value for a single "operation" tag to time the same thing by different facets)

jkwatson commented 3 years ago

@jsuereth I wonder if we should remove bound instruments altogether at this point, since they're not specced that I could see in the metrics API spec.

jsuereth commented 3 years ago

We could just remove them. Bound instruments are designed (right now) to avoid a single lock within the metrics SDK and can improve performance of writing metrics by 2x. What you're suggesting is to remove that performance benefit.

Specifically bound instruments ONLY exist as a performance optimisation when attributes are known at allocation time. I can dig out the benchmarking code if you'd like to see the boost.

This feature would basically negate any benefit of bound instruments, instead you should just allocate Attributes and then make append a quick thing (if baseAttributes.toBuilder().put(...).build() isn't elegant enough).

jsuereth commented 3 years ago

Dropping Bound instruments and just building a whole new Attributes map for every single measurement. Terrible for performance and GC overhead (measured).

What code where you using here? How much of this is lack of efficient attribute-append operation?

kvcache commented 3 years ago

I can give you an example but not from my actual service:

// I have to make a timer per api because I can't both bind Resource attributes and provide dynamic attributes.
val listFriendsTimer = Timer(
    myEnvironmentMeterWithResourceAttributesThatNeedToBeInMyAttributeSet,
    "request",
    "an api request. could be any request",
    listOf(Label("api"), "list_friends")
)
val describeFriendTimer = Timer(
    myEnvironmentMeterWithResourceAttributesThatNeedToBeInMyAttributeSet,
    "request",
    "an api request. could be any request",
    listOf(Label("api"), "describe_friend")
)
val addFriendTimer = Timer(
    myEnvironmentMeterWithResourceAttributesThatNeedToBeInMyAttributeSet,
    "request",
    "an api request. could be any request",
    listOf(Label("api"), "add_friend")
)

// I have to make sure I use the separate, correct timer instance for each api instead of instantiating inline with the
// obviously correct attributes. (have to scroll up and down to see if I did it right)
fun listFriends(selfUserId: Integer) = listFriendsTimer.record {
  // I can't record the user ID as a timer attribute because I had to bind the instrument
  // to get Resource attributes.
}

fun describeFriend(selfUserId: Integer, friendUserId: Integer) = describeFriendTimer.record {
  // I can't record the user ID of the requester or the target user as a timer attribute because I
  // had to bind the instrument to get Resource attributes.
}

fun addFriend(selfUserId: Integer, friendUserId: Integer) = describeFriendTimer.record {
  // I pasted this declaration and fell victim to the trivial footgun of having the declaration and the use
  // of the timers separated.
}

An efficient append attribute operation would be helpful and would make recording userId attributes much more tractable. That's 1/2 of it. I also need to report those static, unchanging Resource Attributes.

jsuereth commented 3 years ago

static-unchanging-resource attributes should be done as part of Resource in otel, not passed in with measurements.

Your example here looks like it's not the OTEL API but some other API. Did you check the performance of the OTEL library?

Additionally, it looks like your instantiating a list of your labels in every call. If you pre-allocate the label list, would that be more efficient?

In otel our API takes Attributes not a list of key-value pairs, so I'm not sure we have the same performance issue.

kvcache commented 3 years ago

static-unchanging-resource attributes should be done as part of Resource in otel, not passed in with measurements.

Yeah, I agree - however, the static-unchanging-resource attributes do not get attached to metrics in prometheus export. That means with Otel I had no way to disambiguate metrics from one server versus another on Grafana until I made this kludge.

Your example here looks like it's not the OTEL API but some other API.

Scroll up a bit, it's a class I wrote using the OTEL API to make the OTEL API more ergonomic.

Did you check the performance of the OTEL library?

It has not presented a problem yet. The profiler showed its overhead was acceptable.

Additionally, it looks like your instantiating a list of your labels in every call.

Nope, it's preallocated! I would like to be able to provide both preallocated-static, and dynamic labels.

In otel our API takes Attributes not a list of key-value pairs

Oh I see what you're saying, yeah - I just omitted the AttributesBuilder dance for sake of brevity. There are a bunch of builders and unrelated stuff that distracts from the main issue here: I'm looking to have both static-unchanging-resource attributes and dynamic attributes on Instruments' exported values.

jsuereth commented 3 years ago

the static-unchanging-resource attributes do not get attached to metrics in prometheus export

You can do this if you leverage the Otel collector (e.g. this configuration parameter available on many exporters) to export to prometheus and I think that's a great feature request for the prometheus exporter.

SPECIFICALLY, it'd be great if you raised a feature request around this use case (and I can link to that in an OTEL Specification feature request). I'd love to improve the interaction of OTEL Resource + Prometheus Export as I think we can do far better than now, and your use case exactly matches what resource is intended for. E.g. see this related issue: https://github.com/open-telemetry/opentelemetry-specification/issues/1782

Scroll up a bit, it's a class I wrote using the OTEL API to make the OTEL API more ergonomic.

I see. I think we're likely to provide a timer instrument in the future. Two main points:

  1. Would be happy to accept a Timer instrument in an "extension" library so we can include it in our performance benchmarks and optimise its API.
  2. If you have ergonomic suggestions please let us know. While I fully expect OTEL to become more of an API-standard that provides stability over ergonomics in the long run (like, e.g. JDBC), I still want for this API to be as ergonomic as possible. Given the Kotlin I see above, there may not be a ton we can do, but happy to hear the ideas and things you've tried.

Oh I see what you're saying, yeah - I just omitted the AttributesBuilder dance for sake of brevity. There are a bunch of builders and unrelated stuff that distracts from the main issue here: I'm looking to have both static-unchanging-resource attributes and dynamic attributes on Instruments' exported values.

That's not the purpose of the bound API, unfortunately. It's just there for when you know your attributes ahead of time and we can avoid a lock. It also degrades performance when you're leveraging Baggage to attach labels to metrics from upstream distributed context. The implementation literally does not allow attributes to be added because you're locked into a specific timeseries writer when you use bound (and you've guaranteed that memory has been allocated + frozen for that set of attributes). I.e. "bound" is how you can ensure 0 memory allocations in a hot-path server that collects metrics, and our benchmarks show this. We literally cannot add dynamic attributes without sacrificing that guarantee.

LIke @jkwatson It might be we just drop it to lead to less confusion. I still think it's useful for internal instrumentation to provide the "zero hot-path allocation" guarantees.

kvcache commented 3 years ago

Re: bound instruments

I.e. "bound" is how you can ensure 0 memory allocations in a hot-path server that collects metrics, and our benchmarks show this. We literally cannot add dynamic attributes without sacrificing that guarantee.

I missed that this was a feature of bound instruments - please do not remove these unless you have an equivalent feature elsewhere in your api, and thank you for educating me on this semantic! I will keep the Timer wrapper then anyway to make the attribute binding more clean. Please know that I am currently depending on bound Instruments and if they are contributing to my observed (very) low garbage collector activity then I will sorely miss their removal. Tbh I will go as far as to fork the repo to keep the feature 😬 but would prefer not.

Powerful tools are worth confusion at first glance; let's keep bound instruments in the API if they're the performant option (which I now understand they are). I implore you: Keep bound instruments!

Re: Exporters and resource_to_telemetry_conversion

The PrometheusExporter supports resource_to_telemetry_conversion, but the prometheusremotewriteexporter does not :-(

I'm currently using Promscale to bridge OpenTelemetry to Timescaledb and have to use the remote write exporter, or else add yet another time series db into my infra as a band aid 😄. resource_to_telemetry_conversion is exactly what I'd like but really it seems like such a fundamental thing that it should be standalone, maybe in a processor, so that I don't have to feature request some person to add it out of the goodness of their heart in the fullness of time!

A Timescaledb exporter would be amazing and (if it supported resource_to_telemetry_conversion) would alleviate the Resource binding weirdness I'm currently stuck doing.

anuraaga commented 3 years ago

@jsuereth Do you think you can push a PR to the spec for bound instruments? I'm a bit worried about going with them in one language because they do seem to correlate back to semantic conventions. In particular, I think in most if not all cases, a bound attribute could be encoded into the metric name. We already see this, e.g. http_client_duration and http_server_duration, where client/server could otherwise be a bound attribute. free_mem_heap, free_mem_direct also come to mind as similar. I think it would be unfortunate if we have a fairly large API that may not actually be reflected into semantic conventions (all potential bound attributes end up in the metric name instead) and think we should avoid it. We could consider removing now and adding later if we don't want to go through the spec process for now but think it's important to do.

jsuereth commented 3 years ago

The piece of the specification we had added for bound instruments is this line:

the client MAY allow attribute values to be passed in using a more efficient way (e.g. strong typed struct allocated on the callstack, tuple).

I think it'd be hard to specify the EXACT semantics of bound instruments in the specification across all languages, but I'm happy to do some kind of "For Java, here are guarantees of using bound that we expect all SDK implementations to uphold".

The issue you mention in semantic conventions of metric name vs. attribute is one I actually think goes beyond the notion of "bound", and needs to be addressed irrespective of it. Have you opened a ticket on semantic conventions about whether http_client_duration + http_server_duration should just be http_duration with a set of attributes?

anuraaga commented 3 years ago

I think bound attributes are semantically ones that are known at instrument creation time vs record time. This seems language agnostic and the concept can be added to the spec. Languages could potentially skip any particular implementation if they want since it's always possible to merge at record time. But the concept being introduced would allow use from semantic conventions.

Have you opened a ticket on semantic conventions

I haven't opened an issue since they both seem ok. I think I've used a pattern like what we already have in apps before using Prometheus and didn't have a problem. I can imagine using attribute without a problem too.

Indeed it's convention specific whether it's appropriate to use an attribute vs the name usability wise. But without the concept of bound attributes in the spec, I can imagine actively avoiding them for performance reasons even if it is appropriate usability wise. So I think they are related at least somewhat.

jsuereth commented 3 years ago

@anuraaga now that the dust is settling in metric reader, I'll see if I can expand on the performance statement in the specification.

I still think @jkwatson question of whether we're using confusing terminology applies. I plan to write up a "how to avoid hot path allocation with otel metrics" markdown file and we can determine whether or not the API we expose is the right one. Given expectations of users, "bind" may be seen as pure convenience vs. performance all the time. I'd like to avoid that confusion

kvcache commented 3 years ago

Great, thanks - avoiding allocation is important for us and having that as a concept that OpenTelemetry promotes as an option is incredibly valuable. Hardening it with docs & expectation around "convenience" versus "performance" would help.

Since I (on my current project) must choose obsessively low-cost instrumentation over convenience, I will choose performance-oriented paths without regard for the convenience & setting up that expectation would (I think) have headed off my opening this ticket in the first place. Btw in case it's unclear, I don't want what I asked for anymore because I am convinced that this is the performant pattern and that doing what I asked would erode the low cost that bound instruments promote. Thank y'all again for your education on this ❤️

jkwatson commented 3 years ago

I wonder if it would be useful to convert this to a discussion, @kvcache so others can find it and hopefully understand the nuance a bit better.

kvcache commented 3 years ago

@jkwatson agree - I lack the permissions but any maintainer should feel free to do so from my perspective.