prometheus / statsd_exporter

StatsD to Prometheus metrics exporter
Apache License 2.0
915 stars 230 forks source link

Safe logging of problems in incoming events [was: Malformed name tag - Log level should be higher] #552

Open Cruuncher opened 5 months ago

Cruuncher commented 5 months ago

The following log is currently set at debug level: https://github.com/prometheus/statsd_exporter/blob/58769c7b4d128e7ceae1cf8d893260ed5b5afa4d/pkg/line/line.go#L124

This log is very useful for debugging missing tags in metrics. However to get it we have to set statsd-exporter at debug level logging, which also brings in the "incoming line" log which logs way too much causing issues for us.

Can we increase this log to at least info level? Debug doesn't make sense to me for such important info

Cruuncher commented 5 months ago

I can open the PR if the change is approved

matthiasr commented 4 weeks ago

The reason we set it at debug level is because it might emit a log line for more or less every incoming event. As a general rule, we've kept everything O(incoming) out of info-level logs, to avoid logging volume (= cost) surprises in normal operation.

Some people run this exporter with a lot of events coming in, logging a significant fraction of these for even a short period across the whole fleet may be wildly expensive. On the other hand, you're right that we are not giving them a lot of options to find out what's wrong without turning on the firehose on at least a subset.

If you're up for it, I would love to have rate limited logging of problematic event lines. What I roughly imagine is that we identify the non-varying parts of the log line, and only emit one line per some time period for each of these. We'll probably want some settings to tune that, since what's acceptable may vary a lot, but as a default I would say one log line per metric cardinality per minute would be a good starting point. Ideally we'd have some indication how many lines we've rate limited away when we do finally flush them.

In general, this would allow us to elevate many of the "something is wrong with this event" logs to Warning, so they're shown and stand out by default.

On this log line in particular: