open-telemetry / opentelemetry-specification

Specifications for OpenTelemetry
https://opentelemetry.io
Apache License 2.0
3.72k stars 888 forks source link

Truly auto instrumentation #4255

Open ostrolucky opened 5 days ago

ostrolucky commented 5 days ago

What are you trying to achieve?

Having a way to find unexpected bottlenecks

Additional context.

My biggest problem with Opentelemetry is that the way it seem to work is you have to specify which functions to instrument. There are opentelemetry-auto-* packages, however all those are is predefined lists of such functions. No matter if you use opentelemetry-auto-* packages or not, problem remains: You have to specify exact functions to instrument, which is not how commercial solutions do it.

Why is this a problem?

Since you have to pick functions to instrument, once bottleneck happens in a function you didn't expect would be a bottleneck, you will have no clue from your spans where is the bottleneck, since you don't instrument it. All you are left with at that point is to check all function calls from the place that is identified by parent span, add some or all of them to your instrumentation and redeploy.

To make opentelemetry more useful, we should have a way to instrument on something else than some static list of functions, for example random sampling instrumentation of all function calls, or any function call which exceed time treshold. That would be closer to the way what people are used to and I would argue more useful than what we have now, which leaves you quite stuck once you run into some unexpected slowdown and struggling to find the root cause.

I would also suggest to rename current opentelemetry-auto concept to opentelemetry-preset, as auto here is very misleading if what those packages mostly do is to predefine list of functions to trace

trask commented 5 days ago

hi @ostrolucky! I believe this is the gap that the profiling signal will fill

didiViking commented 2 days ago

@trask Is there ETA for the profiling signal being added to OTEL? Also could you please share any related repositories? Many thanks.

svrnm commented 2 days ago

hi @ostrolucky! I believe this is the gap that the profiling signal will fill

@ostrolucky does @trask's answer solve your issue or is there anything left to discuss?

@didiViking you can have a look here: https://github.com/open-telemetry/opentelemetry-specification/pull/4197 and https://github.com/open-telemetry/opentelemetry-ebpf-profiler to follow the progress

felixge commented 2 days ago

@trask Is there ETA for the profiling signal being added to OTEL? Also could you please share any related repositories? Many thanks.

đź‘‹ I'm Profiling SIG member: I can't speak on behalf of the entire group, but I think we are very close to having the first end-to-end development implementations up and running. But from that point I'd still expect it to be 1 year+ before things are stable enough for serious adoption.

ostrolucky commented 2 days ago

@ostrolucky does @trask's answer solve your issue or is there anything left to discuss?

Well it doesn't solve the issue, just confirms that there really isn't such feature at the moment, although there is some (in very early stage) plan in form of profiling signals, unless I understood the message wrongly.

Your referenced links are for profiling tools. What I, as a user would need to know is if and when are these coming to opentelemetry language extensions/instrumention libraries in form of hooks (so that I'm not limited to specify concrete functions to trace like now). I don't want to make any assumptions though, could you confirm if finalizing profiling is a pre-requisite for the end goal I specified and if it's planned to implement that in language SDKs once profiling is finalized?

svrnm commented 2 days ago

@ostrolucky before providing some details to your question, let me quickly pick up one of your initial statements:

You have to specify exact functions to instrument, which is not how commercial solutions do it.

Since I have worked very closely with some of the commercial solutions, I can tell you, that they do exactly that same thing: They also need to have exact functions to begin their instrumentation in your service, and they also need to have the exact functions to exit their instrumentation, when they call a remote service. They also need to figure out where to inject and extract their proprietary propagation headers. What they do additionally is that they combine that with that random sampling of the function calls, which is similar to our profiling signal.

The challenge with that approach is, that you do that by using byte code transformation, monkey patching and other mechanisms that are injecting into the code to bend it to provide observability. Which is error-prone, and also requires a lots of engineering power to keep those lists up-to-date.

I am saying all that, to highlight a key difference from OpenTelemetry's vision: while we also provide that way of instrumentation, we encourage library and framework developers to natively instrument their code, which means they add all that code which gets injected external directly into their libray. So eventually their should be no opentelemetry-auto-* libraries anymore. MassTransit for .NET is an example for that: https://masstransit.io/documentation/configuration/observability

Back to your question, I assume this is what you mean with your end goal:

To make opentelemetry more useful, we should have a way to instrument on something else than some static list of functions, for example random sampling instrumentation of all function calls, or any function call which exceed time treshold. That would be closer to the way what people are used to and I would argue more useful than what we have now, which leaves you quite stuck once you run into some unexpected slowdown and struggling to find the root cause.

This is what profiling gives you and it will be correlated with other signals as well.

ostrolucky commented 2 days ago

Thanks for the response. All makes sense. I mean, looks like opentelemetry has a vision that I don't entirely agree with (you have to hope somebody manually properly specifies the functions to instrument that can be a bottleneck, even if it's framework/library maintainers themselves). But one note I can't help myself responding to:

The challenge with that approach is, that you do that by using byte code transformation, monkey patching and other mechanisms that are injecting into the code to bend it to provide observability. Which is error-prone, and also requires a lots of engineering power to keep those lists up-to-date.

That depends on the language, doesn't it? For example, PHP 8 has observer API, where it should be easy to do.

dmathieu commented 2 days ago

For example, PHP 8 has observer API, where it should be easy to do.

Then, there's nothing preventing the language maintainers (or you) from starting something which does that in the language of your choice. Because it's so dependent on the language, it's then hard to generalize.

For example, Go has otel-go-instrumentation which uses ebpf to detect what's being run and instrument it. It's not everything, but that goes closed to what you're asking for. There's also instrgen which generates telemetry in-place.

svrnm commented 2 days ago

Thanks for the response. All makes sense. I mean, looks like opentelemetry has a vision that I don't entirely agree with (you have to hope somebody manually properly specifies the functions to instrument that can be a bottleneck, even if it's framework/library maintainers themselves).

The good thing is, that you can do both with OpenTelemetry, and this is what is happening right now and will be reality for time to come. We have libraries and application that adopt OpenTelemetry natively (with all advantages to it), you can find a growing list here, and we have the automatic ways, that leverage different mechanisms. @dmathieu gave some good examples, because those mechanisms reach from language specific APIs (like observer API, byte code instrumentation), some hacks (monkey patching), some compile-time solutions (instrgen) and some kernel-level solutions (ebpf).

This is an ever moving target, but the bottleneck that OpenTelemetry is removing, is that every commercial vendor is doing their own instrumentation and maintains their own list of functions they need to hook in.

The challenge with that approach is, that you do that by using byte code transformation, monkey patching and other mechanisms that are injecting into the code to bend it to provide observability. Which is error-prone, and also requires a lots of engineering power to keep those lists up-to-date.

That depends on the language, doesn't it? For example, PHP 8 has observer API, where it should be easy to do.

How is PHP observer API different from that? This is exactly what the PHP auto-instrumentation extension[^1] is leveraging in exactly that way I described: it provides external instrumentation for common libraries to inject the OpenTelemetry instrumentation code to it. This is done by marking certain predefined method names and by injecting propagation headers into methods that are used for communication.

Let me try to put it that way: the thing you are looking for (aka profiling) and the thing you say is a bottleneck (adding tracing capabilities to a library/application) are not competing/alternative solutions, they are complementary: while the first one enables deep code level visibility, the second one enables distributed tracing.

[^1]: Quote from the README.md: "This is a PHP extension for OpenTelemetry, to enable auto-instrumentation. It is based on zend_observer and requires php8+"