knyar / nginx-lua-prometheus

Prometheus metric library for Nginx written in Lua
MIT License
1.46k stars 231 forks source link

Is there anyway to check the nginx shared memory usage #99

Closed rjshrjndrn closed 4 years ago

rjshrjndrn commented 4 years ago

Hi,

Thank you for this awesome plugin. Recently I added nginx caching staus (HIT or MISS) to the nginx_http_reqests_total, and saw that there are inconsistencies in the metrics, and upon further investigation, found that nginx shared memory was complete. Now I update that to 100M and seems working fine. So is there any way to expose the current usage, or any calculation for how many metrics can be stored/MB ? Regards.

dolik-rce commented 4 years ago

Hi @rjshrjndrn,

Calculating number of metrics is quite easy. Here is a piece of code I use for this purpose:

-- Creates nginx_prometheus_metric_count gauge. User should call :update() on the returned metric
-- before collecting, to set up-to-date values from prometheus internals.
function metric_count(prometheus, app_name)
  local metric_count = prometheus:gauge("nginx_prometheus_metric_count",
                                "Number of time series served by the prometheus module",
                                {"app"})
  metric_count.update = function(self)
    self:set(prometheus.key_index.last - prometheus.key_index.deleted, {app_name})
  end

  metric_count:set(0, {app_name})
  return metric_count
end

This will return object which behaves just as regular Gauge, but has an update method which reads number of timeseries that are currently tracked by the prometheus module.

Calculating how much of the dictionary is actually used is bit more difficult. If you're using resty.core, then there is a method for this, called free_space. If you're using vanilla nginx, then you're probably out of luck.

In most cases, you should notice unwanted behavior from the metric_count gauge. Just setup an alert if it exceeds reasonable amount of timeseries and you'll soon know if you are using label with unexpected cardinality.

rjshrjndrn commented 4 years ago

Thanks, I'll try this out @dolik-rce .

knyar commented 4 years ago

@dolik-rce, thanks for responding here!

@rjshrjndrn, I am curious if you have seen the error counter (nginx_metric_errors_total) incremeneted when the shared memory dict got full. While there is no easy way to determine utilization of the shared dict, that metric is designed to help users detect situations when dictionary writes start failing.

rjshrjndrn commented 4 years ago

I didn't check the memory usage, but with an old version, of the library, My counter was increasing and was seeing inconsistencies with the metric data. Then I updated the library and increased the shared memory to 100 mb

knyar commented 4 years ago

Sorry, @rjshrjndrn, I probably have not made my question very clear.

Can you run a query like increase(nginx_metric_errors_total[1h]) in your Prometheus server and see if there are any non-zero values during the period of time when the nginx shared memory was full?

rjshrjndrn commented 4 years ago

@knyar I have values > 1 but the problem is I don't know whether it's indicating shared memory full. But for that time period, I had metrics discrepancies.

knyar commented 4 years ago

Thanks for confirming!

I'd recommend configuring an alert on that metric being > 0. When it fires you should usually be able to understand what's wrong looking at nginx error logs.