Revisit logging to enhance observability

ranweiler commented 3 years ago

Revisit our logging, and move to a model that allows:

Distributed tracing concepts, such as spans with inclusion and correlation
- Ideally this should map to spans at the service level, which would need to be implemented
Structured logging
Fan-out to multiple backends at different levels
- Telemetry ingested by App Insights, ideally via OpenTelemetry
- stderr, controlled by an env var
- A circular VM-local log file

Since we use tokio, the tracing library with an OpenTelemetry backend would achieve all of the above.

AB#36002

ranweiler commented 2 years ago

Current ecosystem support for OpenTelemetry + Application Insights:

Rust

No first-party OpenTelemetry/App Insights support here, even at the Preview level. There is a third-party Application Insights exporter for the opentelemetry SDK crate.

All together, we can use tracing, opentelemetry, tracing-opentelemetry, and opentelemetry-application-insights to generate and export async-compatible span data. We can even export log-style events as App Insights Trace telemetry, correctly-associated with spans.

I don't yet see an off-the-shelf mechanism for Custom Events, but it seems like it'd be easy to add. We could also have a separate telemetry channel that uses the appinsights crate just for specialized telemetry like Custom Events. This may be specifically preferable for the optional non-identifying global telemetry.

Python

First-party support, but only in preview. May get some wins if we focus on spans (without events), or use libraries that are getting early attention for pervasive OpenTelemetry instrumentation (FastAPI?).

We can use OpenTelemetry with Python via opentelemetry-sdk/opentelemetry-api, and export spans to Application Insights via azuremonitor-opentelemetry-exporter. The latter is in preview. It currently appears to drop all span-associated events (#21747). Haven't yet checked if there's a way to auto-instrument logging to be span-aware, but seems unlikely (especially since the OpenTelemetry logging spec is not yet stabilized).

ranweiler commented 2 years ago

I don't yet see an off-the-shelf mechanism for Custom Events, but it seems like it'd be easy to add.

Confirmed, this was very easy to add to the exporter backend. The design question then becomes: how do we determine when a span-parented Event from tracing should be exported as Application Insights "Trace Telemetry", vs. a "Custom Event"? The presence/absence of a level field is not a viable cue, because all normally-created tracing events currently have a Level.

In the long run, OpenTelemetry Logging will make the "event"/"log message" distinction clear in a way that tracing libraries can propagate in a more principled way. In the short term, we wouldn't be any worse off than we already are (most telemetry would become "trace event" items). Exceptions are special-cased in the Rust backend. Also, Custom Events are not displayed in a nicer way in the Transaction Timeline view of Application Insights, nor are they in any way more queryable than Trace Events.

For our "Custom Events" that are more properly treated as metric data, there is a (now-frozen) OpenTelemetry Metrics API that has feature-flagged support in the Rust Application Insights exporter.

microsoft / onefuzz

Revisit logging to enhance observability #312

Rust

Python