open-feature / dotnet-sdk

.NET implementation of the OpenFeature SDK
https://openfeature.dev
Apache License 2.0
59 stars 17 forks source link

[FEATURE] Promote OTel hooks from contrib to in-the-box #175

Open austindrenski opened 6 months ago

austindrenski commented 6 months ago

Requirements

Following open-feature/dotnet-sdk-contrib#132, the OpenFeature.Contrib.Hooks.Otel package now only depends on OpenFeature and OpenTelemetry.Api, the latter of which in turn has just a single dependency on System.Diagnostics.DiagnosticSource.

I'm proposing that we obsolete the OpenFeature.Contrib.Hooks.Otel package and incorporate its contained code into the main OpenFeature project to provide first-class telemetry support in line with other popular projects in the .NET ecosystem.

Is this a good idea?

From open-telemetry/opentelemetry-dotnet (emphasis added):

The inspiration of the OpenTelemetry project is to make every library observable out of the box by having them call OpenTelemetry API directly. However, many libraries will not have such integration, and as such there is a need for a separate library which would inject such calls, using mechanisms such as wrapping interfaces, subscribing to library-specific callbacks, or translating existing telemetry into OpenTelemetry model.

From open-telemetry/opentelemetry-dotnet (emphasis added):

Application developers and library authors use OpenTelemetry API to instrument their application/library. The API only surfaces necessary abstractions to instrument an application/library. It does not address concerns like how telemetry is exported to a specific telemetry backend, how to sample the telemetry, etc. The API consists of Tracing API, Logging API, Metrics API, Context and Propagation API, and a set of semantic conventions.

What if we think OpenTelemetry.Api is just too scary?

I personally have a lot of faith in the open-telemetry/opentelemetry-dotnet team, and thus place a lot of trust in their promise to maintain OpenTelemetry.Api as a lightweight abstractions package, but for good measure, I do want to call out that we could accomplish this proposed in-the-box'ing without taking a dependency on OpenTelemetry.Api.

To do this we would instead take a direct dependency on System.Diagnostics.DiagnosticSource, and then reimplement-the-wheel for ActivityExtensions.RecordException(...), ActivityExtensions.SetStatus(...), etc.

Again, I'm not endorsing this, but it's an option (e.g. among many other reasons, it would be exceptionally peevish if our impl drifts from the upstream impl and exceptions recorded by another library have different formats in Datadog/Dynatrace/etc).


See: open-feature/dotnet-sdk-contrib#132

askpt commented 6 months ago

I fully endorse moving the metrics, traces and logs into this repo. The consumers later have the chance to include or exclude the data that we collect. If this is accepted, I can work on moving it here.

askpt commented 5 months ago

1 - @open-feature/sdk-dotnet-approvers & @open-feature/sdk-dotnet-maintainers: if this makes sense, I'll abandon https://github.com/open-feature/dotnet-sdk-contrib/pull/122 and move the OTEL stuff here. Anyone against?

2 - What are the steps to deprecate the library in case of approval?

beeme1mr commented 5 months ago

Hooks were introduced to OpenFeature to support use cases such as telemetry without bloating the SDK with dependencies that not everyone needs. In this case, Microsoft has added native support for OTel within the runtime, so dependency bloat isn't a concern. I think it's worth looking into.

In terms of the implementation, are you proposing that the hook is bundled with the rest of the SDK or would OTel be automatically configured within the SDK without requiring a hook at all? Bundling the hook seems like the most straightforward approach to me. The automatic approach is compelling but I think it would introduce additional challenges we would have to consider.

What do you think @askpt @austindrenski @toddbaert?

austindrenski commented 5 months ago

@beeme1mr wrote https://github.com/open-feature/dotnet-sdk/issues/175#issuecomment-1900634701:

In terms of the implementation, are you proposing that the hook is bundled with the rest of the SDK or would OTel be automatically configured within the SDK without requiring a hook at all? Bundling the hook seems like the most straightforward approach to me. The automatic approach is compelling, but I think it would introduce additional challenges we would have to consider.

When I opened this issue last week, I was just proposing the minimally invasive approach of lifting the hooks as-is into the SDK, but with another week of using the SDK in production, I'm finding the lack of complete telemetry from the SDK to be a real peeve.

I'm personally finding it challenging to answer questions about perf, overhead, etc., related to the impact of using the OpenFeature SDK on a range of luke-warn to scorchingly-hot paths.

So, I'm in favor of fully instrumenting the OpenFeature SDK without hooks:

There's prior art for doing this without taking any direct dependencies from the MassTransit project (see: MassTransit/MassTransit@7e9ec245a37745e30c7ba16bf5c04c5f0c3554ea, open-telemetry/opentelemetry-dotnet-contrib#788), but while impressive that they were able to do this (and laudable from the as-few-dependencies-as-possible-please perspective), I would prefer that we just accept a direct dependency on OpenTelemetry.Api.

My reasoning for this is that, given the dotnet/runtime team's efforts to make System.Diagnostics.DiagnosticSource first-class compatible with the OTel spec (+/- some legacy method names), the broader .NET ecosystem is seeing widespread adoption and integration with OTel, and IMO, there will be very few consumers of the OpenFeature SDK that will not already have taken their own dependency on OpenTelemtry.Api elsewhere in their package graph.

I'm open to hearing quibbles with this^ opinion, but as stated elsewhere, I have confidence in the open-telemetry/opentelemetry-dotnet team to maintain OpenTelemetry.Api in line with its stated goals of being a minimal abstractions library with a sole dependency on System.Diagnostics.DiagnosticSource.

The downside to going the MassTransit route is two-fold:

  1. As library authors, we will have to rewrite the wheel for helpful extensions like:
  2. As a library consumers, I will always have to provide my own registration extension which will both hurt discoverability and diminish the first-time, works-out-of-the-box experience:

    • namespace OpenTelemetry.Trace;
      
      /// <summary>
      /// Provides extension methods to enable OpenFeature tracing instrumentation on a <see cref="TracerProviderBuilder"/>.
      /// </summary>
      public static class OpenFeatureTracerProviderBuilderExtensions
      {
       public static TracerProviderBuilder AddOpenFeatureInstrumentation(this TracerProviderBuilder builder)
           => builder.AddSource("OpenFeature");
      }
    • namespace OpenTelemetry.Metrics;
      
      /// <summary>
      /// Provides extension methods to enable OpenFeature metrics instrumentation on a <see cref="MeterProviderBuilder"/>.
      /// </summary>
      public static class OpenFeatureMeterProviderBuilderExtensions
      {
       public static MeterProviderBuilder AddOpenFeatureInstrumentation(this MeterProviderBuilder builder)
           => builder.AddMeter("OpenFeature");
      }

I'm not saying that either of these^ are outright blockers. The MassTransit folks have shown that its more than possible to do this in a way that produces excellent telemetry. But as a consumer of MassTransit, I often wish they would've just taken the dependency on OpenTelemetry.Api so I didn't have to keep copy/pasta'ing these^ snippets into each new project (thus requiring me to take on the OpenTelemetry.Api dependency anyways).

So, I'm open to hearing other opinions, but I'm just highly skeptical that avoiding OpenTelemetry.Api (especially as a fellow CNCF project...) provides much real-world benefit beyond the theater of "reducing dependencies" which the consumer will end up with anyways.

(Note: none of what I've written here is in anyway advocating for a batteries-included, opinionated approach to adding dependencies. I just view OpenTelemetry.Api as having a unique role in the future of the .NET ecosystem.)

tldr; my vote is to instrument the SDK directly without hooks (leaving hooks to contribs and users)

beeme1mr commented 5 months ago

I'm personally finding it challenging to answer questions about perf, overhead, etc., related to the impact of using the OpenFeature SDK on a range of luke-warn to scorchingly-hot paths.

Could you elaborate on this a bit more? What would you like to capture that can't be collected by a hook? It's also worth noting that a provider can include hooks, so that could be a nice way to ensure consistency across implementations.

Just to be clear, I'm not opposed to including native OTel support. I actually think it may be a good idea since the runtime has such good support for it. I just want to make sure we consider the impact and alternatives before committing to the change.

austindrenski commented 5 months ago

@beeme1mr wrote https://github.com/open-feature/dotnet-sdk/issues/175#issuecomment-1904124472:

@austindrenski wrote https://github.com/open-feature/dotnet-sdk/issues/175#issuecomment-1900858323:

I'm personally finding it challenging to answer questions about perf, overhead, etc., related to the impact of using the OpenFeature SDK on a range of luke-warn to scorchingly-hot paths.

Could you elaborate on this a bit more? What would you like to capture that can't be collected by a hook? It's also worth noting that a provider can include hooks, so that could be a nice way to ensure consistency across implementations.

Just to be clear, I'm not opposed to including native OTel support. I actually think it may be a good idea since the runtime has such good support for it. I just want to make sure we consider the impact and alternatives before committing to the change.

[!NOTE]

Disclosure up-front:

There are a couple of camps out there w.r.t. how much trace telemetry is too much trace telemetry, but I tend to land toward the right-hand tail of it. My rough rule-of-thumb is if you're tracing _every_ method, you've gone too far, but anything short of that and I'm pretty open to hearing a good argument for it.

If you take this too far (e.g. as I've done on occasion), you end up with a real-world app where a root trace might contain dozens of thousands of child spans.

Personally, I love having that much data (especially when the app is a background service, not serving direct user traffic, and I can live with a little tracing overhead in exchange for deep, exhaustive performance telemetry). But this has many, many downsides, from the sheer cost of collecting that much data, to the potential runtime overhead, to the fact that major telemetry SaaS providers (e.g. Datadog) have a tendency to crash when loading a trace with more than N spans.

Obviously, this^ would be going wayyyy overboard for a library like the OpenFeature SDK, and that's not what I'm suggesting. (Among other things, I'm only happy with that much telemetry when I'm in control of it, i.e., if a third-party library produced that much telemetry on its own, I'd be peeved.)

At the same time, in order to get to a place where I, as an app owner, can choose to collect an excessive amount of telemetry, I need the libraries that I consume to produce at least some traces of their own.

So my ideal outcome from this issue would be for the OpenFeature SDK to produce at least the following traces:

  1. top-level traces starting in OpenFeatureClient.Get{*}Details()
  2. wrapper traces covering each of the top-level OpenFeatureClient.Trigger*Hooks() calls
  3. wrapper traces covering each of the individual hooks being called by (2)

the effect being something like this (where the number of traces we produce is 3 + the number of before hooks + the number of after hooks:

get_boolean_details  |-------------------------------------|
trigger_before_hooks   |--------|
trigger_before_hook    |--|
trigger_before_hook       |--|
trigger_before_hook          |--|
evaluate                        |----------------|
trigger_after_hooks                              |--------|
trigger_after_hook                               |--|
trigger_after_hook                                  |--|
trigger_after_hook                                     |--|

and potentially a few more spans if we make them sufficiently opt-in (e.g. feature flags for feature flags telemetry, lol):

 get_boolean_details  |-------------------------------------|
+calculate_context    |-|
+calculate_hooks      |-|
 trigger_before_hooks   |--------|
 trigger_before_hook    |--|
 trigger_before_hook       |--|
 trigger_before_hook          |--|
 evaluate                        |----------------|
 trigger_after_hooks                              |--------|
 trigger_after_hook                               |--|
 trigger_after_hook                                  |--|
 trigger_after_hook                                     |--|

What I'm looking to get out of this is to set consistent telemetry expectations for consumers of the OpenFeature SDK, irrespective of what an individual contrib package chooses to do on their own.

For example, I want to be able to add a new contrib hooks package, and if it causes perf regressions, I want the base OpenFeature SDK to be sufficiently instrumented that I can determine that even if the contrib package provides none of their own telemetry.

This example^ is where my idea about the trigger_{before,after}_hooks and trigger_{before,after}_hook spans come in:

If we handle starting those spans here in the SDK, then a hook provider (whether from the contrib repo, or a one-off implemented in the consumer app) can add tags, record exceptions, and set the status on the ambient activity (i.e. Activity.Current?.Set*()) without the consumer app having to opt-into additional telemetry from each contrib package (i.e. .AddSource("OpenFeature") gets you everything you want from an OpenFeatureClient without knowing to also add .AddSource("OpenFeature.Contrib.Hooks.OTelHooks")).

I'm not entirely sure this^ will read as convincing/easy-consensus as I think it will look once we put together some draft code.

Could you elaborate on this a bit more? What would you like to capture that can't be collected by a hook? It's also worth noting that a provider can include hooks, so that could be a nice way to ensure consistency across implementations.

I (hopefully) answered this above, but just to bottom-line it:

tldr; I'm proposing that the OpenFeature SDK should itself wrap hooks in telemetry spans so that consumers can observe the impact of:

  1. the SDK's own hook handling, and
  2. relative perf of individual hooks without relying on contrib hooks to implement their own consistent telemetry.
askpt commented 5 months ago

I'm personally finding it challenging to answer questions about perf, overhead, etc., related to the impact of using the OpenFeature SDK on a range of luke-warn to scorchingly-hot paths.

Could you elaborate on this a bit more? What would you like to capture that can't be collected by a hook? It's also worth noting that a provider can include hooks, so that could be a nice way to ensure consistency across implementations.

Just to be clear, I'm not opposed to including native OTel support. I actually think it may be a good idea since the runtime has such good support for it. I just want to make sure we consider the impact and alternatives before committing to the change.

I have been exploring better the current otel-contrib implementation. There are a few points which I would like to discuss.

  1. The metrics don't use any of the OTEL libraries. It uses the native System.Diagnostics.Metrics API.
  2. The traces use OTEL but can easily migrate to the System.Diagnostics.Trace API. The only method that is not present in this API is the RecordException that can be easily "moved" if we want to remove any external dependency. Even though I trust the Otel team. đŸ˜„
  3. The traces hook is a bit "weird" and I don't know if it's by design. I would expect it to create spans so you would be able to see what is done by the API. Example: span for flag evaluation, span for hook, etc, etc. This level of detail can only be achieved if we move the traces into the main SDK.

P.S.: I think the point 3 is the same as @austindrenski pointed out but he types way faster than me đŸ¤£

beeme1mr commented 5 months ago

Hey @askpt @austindrenski, thanks for your input. I can see how that could be valuable in some situations, but it would likely be overkill most of the time. If we went that route, I would recommend not capturing at that granularity by default.

To provide a bit more context behind the current implementation, please check out the feedback we received when initially adding feature flag semantics to the OTel spec. The TL;DR is the OTel team felt that a feature flag evaluation should have a negligible impact on performance in most situations. In OpenFeature, some providers may perform a remote evaluation, but those calls can typically be captured using out-of-the-box instrumentation (e.g. HTTP, gRPC).

https://github.com/open-telemetry/opentelemetry-specification/pull/2529

austindrenski commented 5 months ago

Hey @askpt @austindrenski, thanks for your input. I can see how that could be valuable in some situations, but it would likely be overkill most of the time. If we went that route, I would recommend not capturing at that granularity by default.

Absolutely. At a minimum, any telemetry we build into the library at this layer will be gated behind the usual OTel opt-in, i.e., consumer apps will need to explicitly call .AddSource("OpenFeature") (+/- a helper method) when configuring their app.

So by default, adding a package reference for OpenFeature != opting into any telemetry.

Taking it a step further, I described two groups (i.e. at least this and also maybe these) in my earlier post, but I'd support going even more granular, especially in a post-#181 world, e.g.:

var builder = WebApplication.CreateBuilder(args);

builder.Services.AddOpenTelemetry()
    .WithMetrics(static builder => builder
        // exporters
        .AddOtlpExporter()
        // instrumentation
        .AddAspNetCoreInstrumentation()
        .AddHttpClientInstrumentation()
        .AddOpenFeature())
    .WithTracing(static builder => builder
        // exporters
        .AddOtlpExporter()
        // instrumentation
        .AddAspNetCoreInstrumentation(static options => options.RecordException = true)
        .AddHttpClientInstrumentation(static options => options.RecordException = true)
        .AddOpenFeature(static options =>
        {
            options.IncludeContexts = true;
            options.IncludeFlags = true;
            options.IncludeHooks = true;
        }));

I want granular telemetry, but I'm incredibly sensitive to my position being an outlier/very much related to the specific workloads I'm building/maintaining.

So very committed to finding the right balance of out-of-the-box telemetry support, without burdening consumers with unwanted noise.

To provide a bit more context behind the current implementation, please check out the feedback we received when initially adding feature flag semantics to the OTel spec. The TL;DR is the OTel team felt that a feature flag evaluation should have a negligible impact on performance in most situations. In OpenFeature, some providers may perform a remote evaluation, but those calls can typically be captured using out-of-the-box instrumentation (e.g. HTTP, gRPC).

open-telemetry/opentelemetry-specification#2529

That's very interesting to hear, will review the earlier discussion, thanks for the xref!

beeme1mr commented 5 months ago

I think this is worth exploring a bit more. I really like the idea. The only downside I see is that a provider would no longer be able to ship with OpenTelemetry support pre-configured, unless I'm missing something. Perhaps there's an elegant way to support that use case too.