siimon / prom-client

Prometheus client for node.js
Apache License 2.0
3.09k stars 372 forks source link

Memory leak #142

Open gwilakers opened 7 years ago

gwilakers commented 7 years ago

Hi!

I currently have an implementation of prom client running that seems to continuously grow the node heap until an eventual crash/service restart when collecting API metrics. Currently, I am collecting the default metrics and a couple custom ones that just count websocket connections. These all work fine. The problems arise from the API metrics I try to collect using a middleware that sits on our Hapi routes. This middleware listens for 'onRequest' and then 'response' for collecting a response time in milliseconds, while adding labels for method, path, statusCode, and usernames.

Here is what the middleware looks like (I do not believe the problem is with this):

/**
* This is the middleware that looks at each request as it comes through the server
* @param {Object} server This is the instantiated Hapi server that we wish to listen on.
* @param {Object} options This is an object that can contain additional data.
* @param {Function} next This is the callback function that allows additional requests to continue.
* @return {Function} Callback function.
*/
const middlewareWrapper = (server, options, next) => {
  server.ext('onRequest', (request, reply) => {
    request.startTime = process.hrtime()
    return reply.continue()
  })
  server.on('response', (request) => {
    logic.apiPerformance(request.response.statusCode, request.startTime, sanitizeRequestData(request))
  })
  return next()
}

The metrics themselves are used inside of that apiPerformance function which looks like this:

  /**
  * This function takes various information about a request, performs some basic calculations, and then sends that off to loggly.
  * @param {Integer} statusCode This is the HTTP statusCode that will be returned to the user with the response.
  * @param {Array} startTime The difference in time from the request recieved to the response given. https://nodejs.org/api/process.html#process_process_hrtime_time
  * @param {Object} responseData An object with sanitized response data.
  */
  apiPerformance: (statusCode, startTime, responseData) => {
    const diff = process.hrtime(startTime)
    const responseTime = ((diff[0] * 1e3) + (diff[1] * 1e-6))
    const time = new Date()
    logger.transports.loggly.log('info', 'Route Log', {
      responseCode: statusCode,
      route: responseData.route,
      host: responseData.host,
      port: responseData.port,
      method: responseData.method,
      userData: responseData.userData,
      responseTime: `${responseTime}ms`,
      params: responseData.params,
      timestamp: time.toISOString(),
      tags: ['events', 'api', 'performance']
    }, (error, response) => {
      if (error) console.log(error)
    })
    if (config.metrics.prometheus.enabled) {
      requestDurationSummary.labels(responseData.method, responseData.route, statusCode, responseData.userData.displayName).observe(responseTime)
      requestBucketsHistogram.labels(responseData.method, responseData.route, statusCode, responseData.userData.displayName).observe(responseTime)
    }
  }

The metrics are registered at the top of that file like so:

const requestDurationSummary = prometheus.createSummary('http_request_duration_milliseconds', 'request duration in milliseconds', ['method', 'path', 'status', 'user'])
const requestBucketsHistogram = prometheus.createHistogram('http_request_buckets_milliseconds', 'request duration buckets in milliseconds.', ['method', 'path', 'status', 'user'], [ 500, 2000 ])

Node: v6.1.0 Hapi: v16.4.3 prom-client: v9.1.1

We are including prom-client as a part of a custom package (its basically a common libs package) that we include with our projects. Inside of this package, we include prom-client and expose only the methods we use (collectDefaultMetrics, registerMetrics, createGauge, createSummary, createHistogram). Another thing to note is that all metrics except http_request_duration_milliseconds and http_request_buckets_milliseconds (the two metrics posted above) are registered inside of the application we are tracking metrics for. Those two are registered in the code I posted above, which is within the custom package itself.

Any other information I can provide please let me know. Any help or suggestions are greatly appreciated. Thanks!

SimenB commented 7 years ago

Does the app leak even if you remove your own metrics?

siimon commented 7 years ago

Could #140 be the culprit?

SimenB commented 7 years ago

No, OP is using v9, the problem was only in v10

gwilakers commented 7 years ago

The leak happens only when using the middleware that registers http_request_duration_milliseconds and http_request_buckets_milliseconds. I have that middleware sitting behind a feature flag, and if I turn it off, the memory leak goes away.

SimenB commented 7 years ago

What happens if you do a delete request.startTime after reading it? Maybe it isn't GCed properly

gwilakers commented 7 years ago

I will give that a try and let you know!

SimenB commented 7 years ago

Should you be using the plugins part of the request? https://hapijs.com/api#request-properties Or maybe app (if app is per request)?

I've never used hapi, just trying to be a good rubber duck 🙂

https://hapijs.com/api#request-properties

gwilakers commented 7 years ago

Sorry for the delayed response. I was able to grab some heap dumps after implementing memwatch-next and heapdump and have compared those heap dumps. The comparison is showing that the requestDurationSummary metric listed in the code samples above seem to be holding up garbage collection.

image

I am testing a few theories for things I think could be a fix, but if that information makes you think of anything else let me know. Thanks again for helping with this!

siimon commented 7 years ago

What do the createSummary function look like? prom-client only exposes the 'raw' metric objects, there are no factory functions that creates new metrics in prom-client.

Just to be sure, you are using the prom-client module right? No other 'wrappers' (like https://www.npmjs.com/package/prometheus-wrapper?) I've seen those names in that module, that's why I'm asking!

gwilakers commented 7 years ago

Yes, we are using prom-client. The way it's being implemented is via a common lib package where only the methods we want to use are exposed. Here is the code for the createSummary function (included some extraneous pieces as well to provide a clearer picture):

/**
 * @overview Prometheus implementation for exposing metrics (NodeJS).
 */
const config = require('../../config')
const prometheusClient = require('prom-client')

module.exports = {
  /**
  * Extend the configuration object to include the default host and port.
  */
  config: _.extend(config.metrics.prometheus, {connection: {host: '0.0.0.0', port: 9500}}),
  /**
  * Exposes default system metrics at a given interval.
  * @param {number} interval Interval at which to expose metrics (milliseconds).
  * @return {defaultMetrics} See definition below.
  */
  collectDefaultMetrics: (interval) => {
    return prometheusClient.collectDefaultMetrics(interval)
  },
  /**
  * Registers all available metrics for consumption by prometheus.
  * @return {string}  Metrics available for prometheus to scrape.
  */
  registerMetrics: () => {
    return prometheusClient.register.metrics()
  },
  /**
  * Creates a new guage with which values can be increased or decreased.
  * @param {string} name The name of the metric (metric_name).
  * @param {string} help The help name of the metric (metric_name_help).
  * @param {Array} labels These are identifiers that help filter through different custom metrics.
  * @return {Object} New guage to set values on.
  */
  createGauge: (name, help, labels = []) => {
    return new prometheusClient.Gauge({name: name, help: help, labelNames: labels})
  },
  /**
  * Creates a new summary with which values can be used to view sample observations over a sliding time window.
  * @param {string} name The name of the metric (metric_name).
  * @param {string} help The help name of the metric (metric_name_help).
  * @param {Array} labels These are identifiers that help filter through different custom metrics.
  * There is also an optional 'percentages' property that can be overridden, but the default values will work fine.
  * Those default values are: percentiles: [ 0.01, 0.05, 0.5, 0.9, 0.95, 0.99, 0.999 ]
  * @return {Object} New summary to set values on.
  */
  createSummary: (name, help, labels = []) => {
    return new prometheusClient.Summary({name: name, help: help, labelNames: labels})
  }
}

This implementation is then imported into the project as a package. I don't believe the way we are implementing this to be the issue though, as I am using the createGauge method with no issue. The problem only arises when using the previously pasted code with summaries and histograms.

SimenB commented 7 years ago

(I just edited to add syntax highlighting).

I agree that that code snippet looks OK.

SkeLLLa commented 7 years ago

I don't familiar with hapi, but it seems that there's some closure that doesn't allow GC to collect some variables. I've done similar thing, but on the lower level (by adding hooks on nodejs http lib) and there were no memory leaks.

Also if you have requests with many different URLs like /api/user/<userId> they will create a lot of metrics that could cause memory leaks too.

BTW, @gwilakers , can you provide a code where middlewareWrapper is called?

PS: if you're interested - you can try my module. I think it should work with hapi. If it also causes memory leak, then the problem is probably in amount of different requests

gwilakers commented 7 years ago

@SkeLLLa Yeah, a closure is the cause for sure but I am having some issues releasing the variable. @SimenB In my initial post here I listed off the code with examples for how I have the metrics setup. I have tried adding:

delete request.startTime

in the response handler of the middleware, as well as

delete diff
delete responseTime

in the apiPerformance function after calling the two metrics in the if statement at the bottom. Neither of those solutions seemed to solve the issue. Is it possible that the variable is not getting released inside of prom-client? Pretty much out of ideas at this point so any suggestions are welcome. :)

SkeLLLa commented 7 years ago

@gwilakers I don't think that something in prom-client could hold variable, because it's doing simple operations like:

total += yourVariable;

You can also try prom-client versions 8.X (I'm still using old one for some reasons :)). We're using that version for a long time with different metrics including those in middleware, whith timers, etc and I can assure you that there are no memory leaks inside prom-client 8.X.

BTW: Also try to comment out that sutff with logger and try to assign values from responseData, for example, to separate variables like const method = responseData.method. I beleive that actually V8 optimizes such cases and passes only values, but just to be sure that responseData object is collected early by GC try to do such trick.

mlarcher commented 6 years ago

@gwilakers did you eventually find the source of the memory leak ? I'm facing the same issue in our Express based app.

tomauty commented 6 years ago

@mlarcher Just finished banging my head against the wall trying to solve a leak until I stopped requiring this lib.

SimenB commented 6 years ago

Help tracking down the leak would be very much appreciated! We're running this in production at work without a leak, so it's probably some special combination of either metrics, labels or something like that triggering it

tomauty commented 6 years ago

I'll definitely try to find something this week because it was working very nicely for a bit. We were requiring it in multiple places, could that do anything?

gwilakers commented 6 years ago

Hey sorry for the hiatus guys, was out of the country for a while. The cause for the memory leak in our specific case was a closure that was maintaining a reference to the request object being passed around. I had originally registered a Hapi plugin that implemented prom-client, attaching some information to the request before passing it along. Doing it this way left a reference to the request object long enough for garbage collection to pass it by, causing a slow but inevitable leak.

@mlarcher @tomauty I would double check your guys' implementations for similar issues as well. It was really hard to find in our case and took a lot of trial and error. I'd recommend using heapdump as a way to see what reference is being held on to after garabage collection as a starting point.

SimenB commented 6 years ago

Shouldn't have any effect as long as you don't mess with the require.cache object so things are loaded multiple times

gwilakers commented 6 years ago

I'm not sure. I wasn't modifying that property, and yet the reference to the request was still being passed around.

mlarcher commented 6 years ago

I played a while with heapdump too, and noticed a first level module function being referenced by itself in cascade... It wasn't clear to me what was happening, but I ended up setting endTimer to null after using it seems to have fixed the issue for us.

atd-schubert commented 5 years ago

In my opinion the summary is leaking memory. In this line the summary is pushing data on every call of the observe method.

At the moment I am not sure if there is any other possibility to implement it without leaking memory. So my question is: do you know that this is leaking memory? Do you know any possibility to prevent memory leaks there? Should one add a warning that summaries are quite expensive and should be used with care because of possible memory-leaks?

SimenB commented 5 years ago

Do you have a reproduction showing that it leaks?

atd-schubert commented 5 years ago

I think the easiest way to show the leaking behavior is this leakage-test I wrote:

const iterate = require("leakage").iterate;
const {Summary, Histogram} = require("prom-client");

describe("prom-client", () => {
    describe("summary", () => {
        const summary = new Summary({name: "test_summary", help: "test for detecting mem-leaks"});
        it("should not leak on Summaries", () => iterate(() => summary.observe(Math.random()))).timeout(5000);
    });
    describe("histogram", () => {
        const histogram = new Histogram({name: "test_histogram", help: "test for detecting mem-leaks"});
        it("should not leak on Histograms", () => iterate(() => histogram.observe(Math.random()))).timeout(5000);
    });

});

Note: It's written with mocha not with jest...

You will see that Summary is leaking, but Histogram does not.

firasdib commented 5 years ago

We're seeing some memory issues in prod as well, and I'm starting to suspect its prom-client as well.

Was this ever confirmed to be an issue?

siimon commented 5 years ago

There is a PR open that might fix it, but I’m waiting for clarifications in it. If you are in an experimental mood you could try that branch out and see if it solves your problem.

iamajoe commented 4 years ago

I'm also experiencing issues. It crashed my production servers and I eventually needed to comment the prom-client out. Also suspecting a memory leak. I've tried to reset from time to time but it seems it wasn't enough.

Here is the code setting up:

  Prometheus.collectDefaultMetrics({
    prefix: `${namespace}`,
    timeout: 5000
  });

  httpReqDurationMs = new Prometheus.Histogram({
    name: `${namespace}http_request_duration_ms`,
    help: 'Duration of HTTP requests in ms',
    labelNames: ['method', 'route', 'routerPath'],
    buckets: [50, 100, 200, 300, 400, 500, 750, 1000, 2000, 3000, 4000, 5000]
  });

  httpReqCounter = new Prometheus.Counter({
    name: `${namespace}http_request_count`,
    help: 'Count of requests',
    labelNames: ['method', 'routerPath'],
  });

  httpReqErrorCounter = new Prometheus.Counter({
    name: `${namespace}http_request_error_count`,
    help: 'Count of error requests',
    labelNames: ['method', 'routerPath', 'statusCode'],
  });

Just giving one more possible test case. I'm under version: 11.5.3

analytik commented 4 years ago

For what it's worth, I can confirm that 12.0.0 works infinitely better than 11.5.3, which either slowly grew in memory over days, or in some cases OOM'd a minute after start. Thanks for fixing this! 🙇‍♀️