open-telemetry / opentelemetry-python

OpenTelemetry Python API and SDK
https://opentelemetry.io
Apache License 2.0
1.79k stars 622 forks source link

Restricted / Redirected typing for sdk.trace.Tracer and sdk.trace.TracerProvider #2988

Open thekuffs opened 2 years ago

thekuffs commented 2 years ago

I've been working on integrating the opentelemetry sdk into a core support library for a set of microservices. I've run across an issue where the type selection for an interface seems unnecessarily restrictive. Originally I thought this was a design change that just didn't make it to documentation. But after some investigation I don't think that's it. I'll explain the case and then my reasoning.

from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import SimpleSpanProcessor

otlp_exporter = OTLPSpanExporter(endpoint="http://localhost:4317")
span_processor = SimpleSpanProcessor(otlp_exporter)
trace.set_tracer_provider(TracerProvider(active_span_processor=span_processor))

This will execute just fine. But will throw errors in type checking because SimpleSpanProcessor is not one of sdk.trace.SynchronousMultiSpanProcessor or sdk.trace.ConcurrentMultiSpanProcessor.

This was initially confusing to me. It's not in any of the supporting documentation. Those are the only two descendants of sdk.trace.SpanProcessor in that module. Why would one choose to explicitly list descendants instead of just picking the base class for type checks? Especially when the implementation of sdk.trace.Tracer sticks to the interface defined at sdk.trace.SpanProcessor.

These changes were contributed in #594. There seems to be some conversation happening out-of-thread that directed some of this. I am not exactly sure about the original design intent. The type specification seems to force an interface.

I don't know the state of the ecosystem at that time. But these days, opentelemetry-collector can do all the multiplexing you want. With the benefit of being optimized for its own case. And in its own entirely separate process. I'm going to configure mine exactly like that. Sending traces to Jaeger and X-Ray at the same time.

sdk.trace.TracerProvider.add_span_processor() acts as a proxy for the underlying sdk.trace.SynchronousMultiSpanProcessor.add_span_processor() or sdk.trace.ConcurrentMultiSpanProcessor.add_span_processor(). It also provides an implementation for sdk.trace.TracerProvider.shutdown() that allows for an optional exit handler.

I do not think that this part of the interface is necessary, and it is not part of the api specification. I personally only need one TracerProvider for my needs, and would prefer a shorter call stack for the sake of simplicity and performance. If someone needs more than one TracerProvider then the existing MultiSpanProcessor implementations will work as they are.

And there is no need to proxy add_span_processor(), shutdown(), or force_flush(). In fact, these function calls are not part of the api spec. Instead, one should invoke those methods on the SpanProcessor of choice. Perhaps we could makesdk.trace.TracerPovider._active_span_processor public, add a property if we really want to control it, use get_tracer().span_processor, or save a reference to one's chosen SpanProcessor at construction time.

So, to summarize:

I also think:

But the OpenTelemetry docs say that public interfaces to the sdk must remain stable and I'm not sure where the 'public' line is drawn here.

I realize there are several different issues happening here. But I think they're closely related and want to at least launch a discussion from here.

I am absolutely willing to submit a PR of the changes. But I'm new to this project. I want to make sure I understand the design goals.

aabmass commented 1 year ago

Hey @thekuffs, thanks for opening the issue and starting the discussion.

  • The type requirements for sdk.trace.TracerProvider._active_span_processor should be relaxed to sdk.trace.SpanProcessor.

You probably already noticed, but TracerProvider.add_span_processor() actually requires that self._active_span_processor is one of those two types (see https://github.com/open-telemetry/opentelemetry-python/issues/3019#issuecomment-1315913516 which the user is having the opposite issue by violating the type). We could loosen the constructor argument and wrap it in one of the multi-processors if it doesn't hasattr("add_span_processor") but I'm not sure that would be worth the effort.

  • sdk.trace.TracerProvider should scale back its implementation to the interface defined in api.trace.TracerProvider. It does not need to act as a proxy for the underlying SpanProcessor. [...] But the OpenTelemetry docs say that public interfaces to the sdk must remain stable and I'm not sure where the 'public' line is drawn here.

I agree, but I think this is definitely under the public API and we will have to live with it. We could mark it deprecated and instruct users to use the underlying processor. We don't really have any plans of a 2.x major version bump though, and deprecation warnings tend to annoy users if they are spammy.

  • Consider removing both of these MultiSpanProcessor-s. Since we now have BatchSpanProcessor.

How does BatchSpanProcessor help here in terms of multiplexing?

Generally speaking, there is some artistic license taken in each OTel implementation vs what the spec says to the letter. It's OK to have extra things as long as we implement specified behavior.

thekuffs commented 1 year ago

I haven't forgotten about this issue. Work tasks have just landed me in different parts of our codebase. I need a little bit to get myself re-acquainted to this.