Add full OpenTelemetry compatible request tracing for webhooks

NickLarsenNZ commented 9 months ago

Overview

To keep it brief, modern applications should provider a richer way to troubleshoot problems than trawling through log data. Logs give information about what happened (an event), but lack other dimensions such as how long something took and what triggered that. Logging should be a last resort when troubleshooting, and instead metrics and trace data should be utilized (which can then link to applicable logs to reduce noise and simplify the troubleshooting process). Metrics can help answer questions known ahead of time, while trace data can go beyond and answer questions known in the future.

This epic is about implementing tracing and metrics specifically for webhooks (but with the intention of clearing the path to easily instrument the operators, expanding the troubleshooting capabilities).

OpenTelemetry SDKs and the OTLP protocol will be used, but this should be explained futher in the Improve Observability initiative ticket (currently in a non-public repository).

The diagram below gives a high level overview of where the various telemetry data can end up, and will likely become a stack/demo to aid in development and eventually assist Stackable users in getting setup.

Part 1

This is the library side implementation, and does not cover actual operator implementations.

### Tasks
- [ ] https://github.com/stackabletech/demos/pull/35
- [x] Instrument the webhook handlers with [`#[tracing::instrument]`](https://docs.rs/tracing/latest/tracing/attr.instrument.html) and [`tracing::debug!(...)`](https://docs.rs/tracing/latest/tracing/macro.debug.html#examples) (operator-rs) https://github.com/stackabletech/operator-rs/pull/758
- [x] Create tracing subscriber initialization helpers (operator-rs) https://github.com/stackabletech/operator-rs/pull/758
- [ ] https://github.com/stackabletech/operator-rs/pull/767
- [ ] https://github.com/stackabletech/operator-rs/pull/811
- [ ] https://github.com/stackabletech/operator-rs/pull/796
- [ ] https://github.com/stackabletech/operator-rs/pull/801
- [ ] https://github.com/stackabletech/operator-rs/pull/815

Acceptance Criteria

[x] Reusable code for when we do the same to operators.
[x] Using Semantic Conventions (even if done by a library) for span fields.
[x] Uses OTLP as the transport (prefer gRPC over HTTP, avoid JSON serialization).
[x] Instrument something (dummy webhook?) and see traces in Jaeger

Part 2

Moved to https://github.com/stackabletech/issues/issues/598

Part 3

Plan to implement OpenTelemetry Metrics Provider for Operators (Prometheus, and/or OTLP export).

References

NickLarsenNZ commented 6 months ago

Update Re:

Allow env var for trace-filter to be customised (eg: HDFS_OPERATOR_LOG instead of RUST_LOG)

I have a primitive version of this working, but it should be configurable separately for console logs, OTLP logs, and OTLP traces.

Options:

Use my existing implementation to set the prefix (although we use _LOG at the end of our operator's variables, so it would be weird), like:
- HDFS_OPERATOR_LOG (console logs)
- HDFS_OPERATOR_LOG_OTLP_LOG (otlp-logs)
- HDFS_OPERATOR_LOG_OTLP_TRACE (otlp-traces)
We could do the same as above, but just merge OTLP together, so:
- HDFS_OPERATOR_LOG (console logs)
- HDFS_OPERATOR_LOG_OTLP (otlp-logs and otlp-traces)
Have each configurable, so we could use (just example names, because we need to consider other env vars that might be used, eg: enabling traces, or disabling console logging):
- HDFS_OPERATOR_LOG (console logs)
- HDFS_OPERATOR_OTLP_LOG (otlp-logs)
- HDFS_OPERATOR_OTLP_TRACE (otlp-traces)

@Techassi, when you're back, we could chat about it. I'm sure we will be able to come to a good-enough solution.

Edit: We went with the last option (configured by the implementor)

sbernauer commented 5 months ago

@NickLarsenNZ can we close this issue?

lfrancke commented 5 months ago

Is there anything we documented for this or is it "only" groundwork for now?

NickLarsenNZ commented 5 months ago

There's nothing to document here really (other than the code that has doc-comments on it). The parent initiative has an item for writing contributor docs for instrumenting apps.

The stack is without a demo since we would need either a wehbook or an operator to be using this. So the demo docs will come when there is a stackable demo.

stackabletech / issues