Closed NickLarsenNZ closed 5 months ago
Update Re:
Allow env var for trace-filter to be customised (eg:
HDFS_OPERATOR_LOG
instead ofRUST_LOG
)
I have a primitive version of this working, but it should be configurable separately for console logs, OTLP logs, and OTLP traces.
Options:
_LOG
at the end of our operator's variables, so it would be weird), like:
HDFS_OPERATOR_LOG
(console logs)HDFS_OPERATOR_LOG_OTLP_LOG
(otlp-logs)HDFS_OPERATOR_LOG_OTLP_TRACE
(otlp-traces)HDFS_OPERATOR_LOG
(console logs)HDFS_OPERATOR_LOG_OTLP
(otlp-logs and otlp-traces)HDFS_OPERATOR_LOG
(console logs)HDFS_OPERATOR_OTLP_LOG
(otlp-logs)HDFS_OPERATOR_OTLP_TRACE
(otlp-traces)@Techassi, when you're back, we could chat about it. I'm sure we will be able to come to a good-enough solution.
Edit: We went with the last option (configured by the implementor)
@NickLarsenNZ can we close this issue?
Is there anything we documented for this or is it "only" groundwork for now?
There's nothing to document here really (other than the code that has doc-comments on it). The parent initiative has an item for writing contributor docs for instrumenting apps.
The stack is without a demo since we would need either a wehbook or an operator to be using this. So the demo docs will come when there is a stackable demo.
Overview
To keep it brief, modern applications should provider a richer way to troubleshoot problems than trawling through log data. Logs give information about what happened (an event), but lack other dimensions such as how long something took and what triggered that. Logging should be a last resort when troubleshooting, and instead metrics and trace data should be utilized (which can then link to applicable logs to reduce noise and simplify the troubleshooting process). Metrics can help answer questions known ahead of time, while trace data can go beyond and answer questions known in the future.
This epic is about implementing tracing and metrics specifically for webhooks (but with the intention of clearing the path to easily instrument the operators, expanding the troubleshooting capabilities).
OpenTelemetry SDKs and the OTLP protocol will be used, but this should be explained futher in the Improve Observability initiative ticket (currently in a non-public repository).
The diagram below gives a high level overview of where the various telemetry data can end up, and will likely become a stack/demo to aid in development and eventually assist Stackable users in getting setup.
Part 1
This is the library side implementation, and does not cover actual operator implementations.
Acceptance Criteria
Part 2
Moved to https://github.com/stackabletech/issues/issues/598
Part 3
Plan to implement OpenTelemetry Metrics Provider for Operators (Prometheus, and/or OTLP export).
References