Open shappir opened 1 year ago
This is how nodejs_eventloop_lag_seconds
is currently computed:
eventLoopLag.js
a Gauge
is created with that name and a collect
methodcollect
method is invoked whenever the metric's values are requestedsetImmediate
to schedule another measurement. But since that measurement happens in the next iteration of the event loop, it will take place after the metrics are all collected and reportedcollect
method doesn't return a promise, the value it measures will only be reported in the next metrics
invocation.Since the delta is from when metrics
starts until after the response is sent, this is what it measures. And it's not updated until the next invocation of metrics
.
I just spent hours figuring out why this metric deviates so much from the rest of event loop metrics. Thank you for clarification @shappir 👏🏻
It would be really nice if this is fixed.
The built-in metric (gauge)
nodejs_eventloop_lag_seconds
is marked as measuring the average event loop lag: theaggregator
property value for it isaverage
.However, what it actually reports is the amount of time required to generate and send the previous
metrics
response.This is equivalent to the maximum event loop lag value rather than the average. To get the actual average you need to use
nodejs_eventloop_lag_mean_seconds
instead (for Node servers that support it). In our case the average event loop lag is under 11ms while thenodejs_eventloop_lag_seconds
gets as high as 1 second !I recommend:
nodejs_eventloop_lag_seconds
to actually report the average event loop lagmetrics
response generation time, saynodejs_metrics_response_time
Important: the amount of time required to generate the
metrics
response remains high, and needs to be improved.