stackabletech / operator-rs

A simple wrapper/framework around kube-rs to make implementing Operators/Controllers easier
Apache License 2.0
116 stars 12 forks source link

feat: Add `stackable-telemetry` utility crate #758

Closed Techassi closed 4 months ago

Techassi commented 5 months ago

This set of changes introduces a new crate: stackable-telemetry. So far, this includes:

In this set of changes, we update the stackable_webhook implementation to automatically use the axum tracing middleware.

We intend to add metrics support in a future PR.

Tracked by https://github.com/stackabletech/issues/issues/531

Screenshots

Search for Traces by Service and Span Name

image

Looking at a Trace with its related Spans and Attributes

image

Looking at Trace Events within each Span

image

NickLarsenNZ commented 4 months ago

Thanks @adwk67, @maltesander.

Regarding the issue with dropped traces:

2024-04-18T14:35:47.411Z    error    exporterhelper/queue_sender.go:101    Exporting failed. Dropping data.    {"kind": "exporter", "data_type": "traces", "name": "otlp/tempo", "error": "not retryable error: Permanent error: rpc error: code = FailedPrecondition desc = TRACE_TOO_LARGE: max size of trace (5000000) exceeded while adding 834977 bytes to trace 03175d1b3375c763983f5a0637d17125 for tenant single-tenant", "dropped_items": 701}

I have investagated with a simple axum web server, and it seems to be a tracing loop. So a request comes in, generates traces, traces get sent via OTLP, which cause more traces, which then get sent via OTLP, and so on.

I am able to stop the loop by changing the LevelFilter for h2 (http2), and have asked on the CNCF #otel-rust channel to see if anyone else has experienced it.

# The variable name will be configurable in a future PR
RUST_LOG=trace,h2=off cargo run --release

We could also enforce it by hard-coding a directive.

Because we are not using this crate yet, I'm ok for this to be merged and fixed in a future PR.