stackabletech / issues

This repository is only for issues that concern multiple repositories or don't fit into any specific repository
2 stars 0 forks source link

Add full OpenTelemetry compatible request tracing for webhooks #531

Closed NickLarsenNZ closed 3 months ago

NickLarsenNZ commented 7 months ago

Overview

To keep it brief, modern applications should provider a richer way to troubleshoot problems than trawling through log data. Logs give information about what happened (an event), but lack other dimensions such as how long something took and what triggered that. Logging should be a last resort when troubleshooting, and instead metrics and trace data should be utilized (which can then link to applicable logs to reduce noise and simplify the troubleshooting process). Metrics can help answer questions known ahead of time, while trace data can go beyond and answer questions known in the future.

This epic is about implementing tracing and metrics specifically for webhooks (but with the intention of clearing the path to easily instrument the operators, expanding the troubleshooting capabilities).

OpenTelemetry SDKs and the OTLP protocol will be used, but this should be explained futher in the Improve Observability initiative ticket (currently in a non-public repository).

The diagram below gives a high level overview of where the various telemetry data can end up, and will likely become a stack/demo to aid in development and eventually assist Stackable users in getting setup.

image

Part 1

This is the library side implementation, and does not cover actual operator implementations.

### Tasks
- [ ] https://github.com/stackabletech/demos/pull/35
- [x] Instrument the webhook handlers with [`#[tracing::instrument]`](https://docs.rs/tracing/latest/tracing/attr.instrument.html) and [`tracing::debug!(...)`](https://docs.rs/tracing/latest/tracing/macro.debug.html#examples) (operator-rs) https://github.com/stackabletech/operator-rs/pull/758
- [x] Create tracing subscriber initialization helpers (operator-rs) https://github.com/stackabletech/operator-rs/pull/758
- [ ] https://github.com/stackabletech/operator-rs/pull/767
- [ ] https://github.com/stackabletech/operator-rs/pull/811
- [ ] https://github.com/stackabletech/operator-rs/pull/796
- [ ] https://github.com/stackabletech/operator-rs/pull/801
- [ ] https://github.com/stackabletech/operator-rs/pull/815

Acceptance Criteria

Part 2

Moved to https://github.com/stackabletech/issues/issues/598

Part 3

Plan to implement OpenTelemetry Metrics Provider for Operators (Prometheus, and/or OTLP export).

References

NickLarsenNZ commented 4 months ago

Update Re:

Allow env var for trace-filter to be customised (eg: HDFS_OPERATOR_LOG instead of RUST_LOG)

I have a primitive version of this working, but it should be configurable separately for console logs, OTLP logs, and OTLP traces.

Options:

@Techassi, when you're back, we could chat about it. I'm sure we will be able to come to a good-enough solution.


Edit: We went with the last option (configured by the implementor)

sbernauer commented 3 months ago

@NickLarsenNZ can we close this issue?

lfrancke commented 3 months ago

Is there anything we documented for this or is it "only" groundwork for now?

NickLarsenNZ commented 3 months ago

There's nothing to document here really (other than the code that has doc-comments on it). The parent initiative has an item for writing contributor docs for instrumenting apps.

The stack is without a demo since we would need either a wehbook or an operator to be using this. So the demo docs will come when there is a stackable demo.