Momentary spikes in events received

PrayagS commented 9 months ago

Version: v0.25.0

Configuration,

eventQueueSize: 10000
eventFlushThreshold: 1000
eventFlushInterval: 200ms

We have an external deployment of statsd_exporter where various services ingest their metrics.

Recently, we started seeing brief spikes in the following metrics,

statsd_exporter_events_total
statsd_exporter_sample_errors_total (reason being invalid_sample_factor)

As there was a spike in events received, there was also a spike in UDP packets received and events being flushed.

There was no spike in sampling rate (statsd_exporter_samples_total) though.

What's concerning me is that if these samples are failing while being parsed, why does the corresponding Prometheus metric also have the same spike?

Screenshot of the metrics,

Let me know if there are any more signals that I can look at to debug this. Resource usage has been normal throughout. And I'm not sure how helpful debug logs would be because these spikes are happening randomly for any of the metrics that are being ingested to this exporter.

matthiasr commented 8 months ago

Notice that the sample error spikes are at a much lower magnitude than the overall event spikes. It seems that you have a spiky workload and some of it has sampling factors that the exporter cannot parse.

To understand this further, you will need to run the exporter with --log.level=debug and observe the actual incoming lines. Beware that this can produce a lot of logs in a short amount of time. This will also give you more clues whether the issue is a sampling factor that cannot be parsed as a float (error message "Invalid sampling factor") or whether there is a component to the metric line that the exporter does not expect (error message "Invalid sampling factor or tag section"). Maybe we ought to use different labels for those two cases…?

Unfortunately, we can only log any details about parsing errors because of the volume of messages this can generate :/

PrayagS commented 8 months ago

Thanks a lot for getting back.

I did plot the reason for the sample errors and yes, they're being reported as invalid_sample_factor.

Will enable debug logs to see what's happening here.

matthiasr commented 7 months ago

Closing this issue for now, as it is likely a problem with invalid inputs. If you think the exporter should know how to handle them, please provide sample input lines, and which statsd implementation accepts them!

prometheus / statsd_exporter

Momentary spikes in events received #535