siimon / prom-client

Prometheus client for node.js
Apache License 2.0
3.15k stars 378 forks source link

Event loop lag metric (gauge) does NOT measure average event lag #561

Open shappir opened 1 year ago

shappir commented 1 year ago

The built-in metric (gauge) nodejs_eventloop_lag_seconds is marked as measuring the average event loop lag: the aggregator property value for it is average.

However, what it actually reports is the amount of time required to generate and send the previous metrics response.

This is equivalent to the maximum event loop lag value rather than the average. To get the actual average you need to use nodejs_eventloop_lag_mean_seconds instead (for Node servers that support it). In our case the average event loop lag is under 11ms while the nodejs_eventloop_lag_seconds gets as high as 1 second !

I recommend:

  1. Fixing nodejs_eventloop_lag_seconds to actually report the average event loop lag
  2. Introduce a new metric to report the metrics response generation time, say nodejs_metrics_response_time

Important: the amount of time required to generate the metrics response remains high, and needs to be improved.

shappir commented 1 year ago

This is how nodejs_eventloop_lag_seconds is currently computed:

  1. In eventLoopLag.js a Gauge is created with that name and a collect method
  2. That collect method is invoked whenever the metric's values are requested
  3. The method measures the high-res time when it is invoked. Since it's a built-in metric, it will be one of the first, and so records the time at which metrics collection starts
  4. It then uses setImmediate to schedule another measurement. But since that measurement happens in the next iteration of the event loop, it will take place after the metrics are all collected and reported
  5. Since the collect method doesn't return a promise, the value it measures will only be reported in the next metrics invocation.

Since the delta is from when metrics starts until after the response is sent, this is what it measures. And it's not updated until the next invocation of metrics.

AntunVukicevic commented 1 month ago

I just spent hours figuring out why this metric deviates so much from the rest of event loop metrics. Thank you for clarification @shappir 👏🏻

It would be really nice if this is fixed.