Gunicorn/multiprocess with restarting workers, .db files are not cleaned up

Dominick-Peluso-Bose commented 6 years ago

We are using Gunicorn with multiple workers with the Gunicorn max_requests option. We found that memory usage in our containers increases with every worker restart. By reducing the max_requests value so workers restart very frequently, the problem became apparent.

Using the method described in the readme:

from prometheus_client import multiprocess

def child_exit(server, worker):
    multiprocess.mark_process_dead(worker.pid)

While running a performance test, I see the following in the container:

[root@mock-adapter-8568fdcf6d-lz8tx app]# cd /app/prometheus_tmp
[root@mock-adapter-8568fdcf6d-lz8tx prometheus_tmp]# ls
counter_10.db   counter_214.db  counter_319.db  counter_41.db   counter_91.db     histogram_196.db  histogram_296.db  histogram_401.db  histogram_69.db
counter_111.db  counter_222.db  counter_31.db   counter_425.db  counter_97.db     histogram_1.db    histogram_305.db  histogram_409.db  histogram_77.db
counter_117.db  counter_232.db  counter_328.db  counter_433.db  counter_9.db      histogram_202.db  histogram_311.db  histogram_419.db  histogram_84.db
counter_130.db  counter_23.db   counter_338.db  counter_444.db  histogram_10.db   histogram_214.db  histogram_319.db  histogram_41.db   histogram_91.db
counter_138.db  counter_244.db  counter_345.db  counter_450.db  histogram_111.db  histogram_222.db  histogram_31.db   histogram_425.db  histogram_97.db
counter_148.db  counter_251.db  counter_353.db  counter_458.db  histogram_117.db  histogram_232.db  histogram_328.db  histogram_433.db  histogram_9.db
counter_154.db  counter_258.db  counter_361.db  counter_470.db  histogram_130.db  histogram_23.db   histogram_338.db  histogram_444.db
counter_175.db  counter_269.db  counter_36.db   counter_48.db   histogram_138.db  histogram_244.db  histogram_345.db  histogram_450.db
counter_187.db  counter_277.db  counter_374.db  counter_56.db   histogram_148.db  histogram_251.db  histogram_353.db  histogram_458.db
counter_18.db   counter_286.db  counter_384.db  counter_61.db   histogram_154.db  histogram_258.db  histogram_361.db  histogram_470.db
counter_196.db  counter_296.db  counter_401.db  counter_69.db   histogram_175.db  histogram_269.db  histogram_36.db   histogram_48.db
counter_1.db    counter_305.db  counter_409.db  counter_77.db   histogram_187.db  histogram_277.db  histogram_374.db  histogram_56.db
counter_202.db  counter_311.db  counter_419.db  counter_84.db   histogram_18.db   histogram_286.db  histogram_384.db  histogram_61.db

Watching that directory, new files are created with every new worker and old ones are not being cleaned up. Deleting the files in that directory reduce down the memory usage on that container, which then starts building again.

What is the proper way of dealing with these database files? Is this mark_process_dead's responsibility?

And a general question about Prometheus: do we need to keep metrics around after they are collected? Could we wipe our metrics after the collector hits our metrics endpoint? If so, can this be done via the prometheus_client?

brian-brazil commented 6 years ago

This is the expected behaviour, you can only delete these safely when the whole application restarts. I'd suggest increasing max_requests.

Dominick-Peluso-Bose commented 6 years ago

I'd suggest increasing max_requests.

We are using max_requests to deal with some memory leaks introduced with Gunicorn with gevent workers. Not ideal at all.

But even if we didn't have this problem, workers can restart for other reasons. Is it also expected that memory usage would increase like this? Because if so, it seems like eventually over the lifetime of the service, at some point all services using this pattern would run out of memory.

brian-brazil commented 6 years ago

Memory usage for workers should be constant.

mpitt commented 6 years ago

We have the same issue in our uWSGI deployment. We also restart workers every N requests, and they could restart by themselves because of harakiri (if the request takes too long). I don't think we could just keep the same workers forever, some sneaky memory leak might persist and max-requests is widely used to mitigate it.

Is it at all possible that the prometheus client would support this use case? Or should we just look for another solution?

brian-brazil commented 6 years ago

There's not much we can do here if you have high churn I'm afraid.

Dominick-Peluso-Bose commented 6 years ago

We disabled worker restarts and we found that using Gunicorn and Flask, once in a while, we do have unplanned worker restarts infrequently. So for us, we were able to reduce the large memory leak and now have a very slow one that grows over the course of months.

jkemp101 commented 6 years ago

Couldn't mark_process_dead() summarize the data into a single "dead" file so that there wasn't a need to leave a bunch of files hanging around? Are the values in each individual file needed after the process dies?

brian-brazil commented 6 years ago

That'd require locking across processes, which we want to avoid, and cause issues if that pid was reused.

jkemp101 commented 6 years ago

I think locking across processes in mark_process_dead() and/or collect() would probably be acceptable as long as it doesn't require any locking for "active" processes to record their samples. But of course this would get tricky.

I guess reusing pids is another issue if it mixes samples in the same pid file when it should actually be a true unique file per process. That would cause potential loss of data like a min value for a Gauge.

bloodearnest commented 6 years ago

Hey folks.

We have a workaround for this currently which uses gunicorn's child_exit hook to merge the dead worker's counter/histogram metrics into archive files and deletes the pid file. This has kept our file count and thus scrape response times stable. The attached image shows our scrape times (yellow line) when we deployed our workaround.

Our thinking was that using child_exit avoids the need to lock, as it's run only in the master. We still call the current mark_process_dead() function too, so gauge files are removed (although we are not heavy users of gauges).

Regards pid recycling, I think it is only an issue for gauges, which are still deleted anyway in our experiment. Also, given filenames with the pid in should be deleted very soon after the worker death, the chance of a recycling pid clash is very remote? I may have missed something though...

It seems to be working for us, but I would appreciate any feedback on anything we may have missed.

Our experiment involves a slightly altered version of the MultiProcessCollector.collect() function, which you can see here: https://github.com/canonical-ols/talisker/blob/master/talisker/metrics.py#L193

The changes are documented there, but for convenience are copied here:

This almost verbatim from MultiProcessCollector.collect().

The original collects all results in a format designed to be scraped. We
instead need to collect limited results, in a format that can be written
back to disk. To facilitate this, this version of collect() preserves label
ordering, and does not aggregate the histograms.

Specifically, it differs from the original:

1. it takes its files as an argument, rather than hardcoding '*.db'
2. it does not accumulate histograms
3. it preserves label order, to facilitate being inserted back into an mmap file.

It needs to be kept up to date with changes to prometheus_client as much as
possible, or until changes are landed upstream to allow reuse of collect().

We'd like to create a PR that upstreams some of this work, if you think the approach is sensible? Specifically, we'd like to slightly refactor the collect() method to support preserving label order, allow for not accumulating histograms and not hardcoding '*.db'. This would allow us to reuse it, rather than reimplement it.

Also, if this approach is deemed worthwhile, then we'd be happy to work on a PR to expand to include this aggregation as appropriate.

Thanks for reading!

EDIT: typo

brian-brazil commented 6 years ago

Our thinking was that using child_exit avoids the need to lock, as it's run only in the master.

That's not a good assumption. For example if another worker is started up and gets the old pid, that'll break things. We also need this to work on things that aren't Gunicorn.

For this to work, you'd need a lock of some form.

refactor the collect() method to support preserving label order

Does it not already? That's usually something we have.

allow for not accumulating histograms

That violates the Prometheus text format, histograms must be cumulative.

jkemp101 commented 6 years ago

Isn't the current approach already broken if the new worker gets the same PID? How does it currently keep the values for the old and new processes if the PID is the same? Won't the new process start overwriting the old process's values since it uses the same file. Just trying to understand this better to figure out potential workarounds.

brian-brazil commented 6 years ago

Isn't the current approach already broken if the new worker gets the same PID?

No, it's designed to handle it. It reads in the previous values at startup.

bloodearnest commented 6 years ago

Thanks for the prompt response! :)

That's not a good assumption. For example if another worker is started up and gets the old pid, that'll break things.

Right, I think I see the issue. Because the aggregation is not atomic, then there is a small window where a new worker using an old pid can lose data it may have written to pre-exisiting pid file, but after it was read by collect().

Strawman naive solution: add process creation timestamp to the filename? I think that would ensure that they are never reused, regardless of pid recycling? While clever, I suspect the current pid-recylcing solution of file reuse is likely to complicate this problem.

We also need this to work on things that aren't Gunicorn.

OK. But a general purpose multi-process serialization is hard, as you have stated. Using SIGCHLD to the master process is one way - and that is what child_exit boils down to. Would it perhaps be beneficial to offer an opt-in documented solution for gunicorn, than no solution at all?

For this to work, you'd need a lock of some form.

So, if you're talking about a lock in the master, we could add that. But for gunicorn/child_exit (and possibly python in general) I think that it's not strictly necessary, as child_exit is fired as a SIGCHILD signal handler, and the python execution of signal handlers is effectively serialized by the interpreter, AIUI.

Does it not already? That's usually something we have.

Sadly not - the wrapping them in dict() is what loses the order:

https://github.com/prometheus/client_python/blob/master/prometheus_client/multiprocess.py#L93

They are ordered all the way up to that final step.

My implementation simply used OrderedDict here, but I suspect a cleaner approach would be to avoid the dict() at all, and just return a list of tuples?

If that alone could be changed, then our solution would be improved, as we can use collect() directly, and just copy files into a different dir to get round the '*.db' beforehand, and unaccumulate histograms manually afterwards (this is the approach we took at first, until we hit the ordering issue).

That violates the Prometheus text format, histograms must be cumulative.

Right, sorry, was a bit vague there in hindsight. The collect() function would retain its current output, with accumulated histograms. But we'd refactor to allow it not to be accumlated, probably would propose a refactoring into a new function what would not accumulate, and then collect() would call that, and then accumulate. We could then just use the new function rather than collect().

The changes are not big, and I think it would help discussion, so I might take some time today to knock a PoC PR.

bloodearnest commented 6 years ago

Ah, I realised I didn't actually attach the screen shot earlier. This show scrape times, the yellow line it after our workaround was applied.

show

brian-brazil commented 6 years ago

Strawman naive solution: add process creation timestamp to the filename? I think that would ensure that they are never reused, regardless of pid recycling?

That could make the number of files worse. Not everyone can mark things dead.

Using SIGCHLD to the master process is one way - and that is what child_exit boils down to. Would it perhaps be beneficial to offer an opt-in documented solution for gunicorn, than no solution at all?

We need to work on Windows, and we shouldn't interfere with the signal handling of the process we're instrumenting.

Sadly not - the wrapping them in dict() is what loses the order:

Ah yes. That can't really be changed, as that's part of our API. Why do you want to change this?

bloodearnest commented 6 years ago

That could make the number of files worse. Not everyone can mark things dead.

It should only increase the number of files if recycled pids happen to be another worker. And our aggregation would delete them...

We need to work on Windows, and we shouldn't interfere with the signal handling of the process we're instrumenting

Re windows, I suspect we there is no single solution that would work for both, anyway?

Regards interfering, it would be up to the application to ensure the cleanup is called at the right time, just like it is currently?

Ah yes. That can't really be changed, as that's part of our API. Why do you want to change this?

The json key used for the mmap file format has the in label order. To write the collect()ed data back to the aggregated archive files, I need the same order, which sadly the dict() loses.

brian-brazil commented 6 years ago

Re windows, I suspect we there is no single solution that would work for both, anyway?

I don't know offhand.

Regards interfering, it would be up to the application to ensure the cleanup is called at the right time, just like it is currently?

Yes. But there's no concurrency restrictions around that, so for example two different processes could be performing it at once.

The json key used for the mmap file format has the in label order. To write the collect()ed data back to the aggregated archive files, I need the same order, which sadly the dict() loses.

The file format can be changed.

groner commented 6 years ago

I'm also facing rising scrape times with a gunicorn application that has restarting workers. I have a PoC that I worked up before finding this discussion. I can share that if it would be helpful, but I don't want to step on any toes.

I've spent some time thinking about this, and I'd like to see it solved, so I hope you don't mind if I add my thoughts to the discussion.

Regarding timestamped filenames making the problem worse, this would only be a problem when pid reuse actually occurs. I think the users who are likely to be affected will probably have thousands of db files before this happens and scrape times of ten minutes or longer. I'm not saying it's impossible, but it must be vanishingly rare.

If the multiprocess db file can be locked by the writer (when it is opened), then mark_process_dead can test to see if the file is still active by trying to obtain the lock. This could partially work around the issue, but it leaves open the possibility of the livesum/liveall gauge files inheriting values from a dead process.

bloodearnest commented 6 years ago

I don't know offhand.

Looking around, multiprocessing.Lock() might provide a cross-platform solution. I've added it to solve an issue in our solution, seems to work, but it's not yet in production, so unproven.

Yes. But there's no concurrency restrictions around that, so for example two different processes could be performing it at once.

Right, I wonder if documenting what the synchronization requirements are might be a good start? Then applications can use the right approach given their concurrency model?

The file format can be changed.

Right, fair enough. But I don't think it needs to, we just need to avoid losing the order of labels, which seems fairly simple?

brian-brazil commented 6 years ago

Looking around, multiprocessing.Lock() might provide a cross-platform solution.

Hmm, how does that work behind the scenes?

Right, I wonder if documenting what the synchronization requirements are might be a good start? Then applications can use the right approach given their concurrency model?

I think this is something that should Just Work in any environment, I rather not have a system that requires extensive documentation and reasoning to work - as users will get it wrong.

bloodearnest commented 6 years ago

Hmm, how does that work behind the scenes?

From the source, it looks like it's using WaitForSingleObjectEx syscall in windows, and a POSIX semaphore on Unix.

I think this is something that should Just Work in any environment, I rather not have a system that requires extensive documentation and reasoning to work - as users will get it wrong.

Fair enough, it's just a lot more work to get there :).

If multiprocessing.Lock is suitable, then I think mark_process_dead could maybe be expanded to aggregate into a single file on worker death by default, using that lock?

Rough stab at general solution: register an atexit function that calls mark_process_dead. That should work for any multiprocess scenario, with the caveat that processes forcibly killed will not run the atexit handle, and thus leak mmapped files. But that's a common enough issue with killing processes that way, and the leak would be limited to rare error scenarios, rather than every process recycle.

brian-brazil commented 6 years ago

How is the handle/semaphone id shared between the processes?

I think mark_process_dead could maybe be expanded to aggregate into a single file on worker death by default, using that lock?

That only works if you know that no new processes will be created with that pid while you're doing that.

bloodearnest commented 6 years ago

How is the handle/semaphone id shared between the processes?

The lock has to be inherited, so it has to be created in the parent process, and thus only works for a fork()ing model.

That only works if you know that no new processes will be created with that pid while you're doing that.

Right, that could probably be addressed by locking the mmap file opening with the same lock, or as discussed earlier, adding a timestamp or similar to the mmap filename (if we have full worker death cleanup, then extra files are no longer a problem, anyway).

brian-brazil commented 6 years ago

The lock has to be inherited, so it has to be created in the parent process, and thus only works for a fork()ing model.

More particularly it only works if the parent process loads the Prometheus client library, which certain configurations of Gunicorn will not.

bloodearnest commented 6 years ago

More particularly it only works if the parent process loads the Prometheus client library, which certain configurations of Gunicorn will not.

Right. You would probably need to document adding a gunicorn on_starting hook or something that did that.

brian-brazil commented 6 years ago

Gunicorn isn't the only thing in use though, what if it's say mod_python in Apache (which I admit to having never used, and haven't even read the docs) in which a C process is what spawns the children.

bloodearnest commented 6 years ago

Made a PR with the small changes to MultiProcessCollector.collect() I mentioned above.

This doesn't solve the problem, but it makes it easier for others to roll their own solution to the problem, as we have done (we had to duplicate the whole of collect() in our code - this would allow us to just call the new merge() function directly)

arthurprs commented 5 years ago

Did anybody manage to solve this? I'm using uwsgi and it's not obvious how I can call mark_process_dead

bloodearnest commented 5 years ago

We managed to solve this for Gunicorn, the bulk of our (rather complex) solution can be seen here:

https://github.com/canonical-ols/talisker/blob/master/talisker/prometheus.py

It's not pretty :(

The core of the solution is this: The master process runs a cleanup function for each worker that dies. This function aggregates the dead worker's files into some central files, and then deletes the old files. This write and delete is a critical section that needs protecting with a multiprocess lock of some kind, or else workers reading the files could double read.

The Gunicorn specifics are using a SIGCHLD handler on the master process, and other messy details, but it's been working well for us, after a few initial bumps.

However, that solution is very specific to gunicorn's multiprocess model. uWSGI is a whole other kettle of fish. AFAICS, it doesn't support hooking in to SIGCHLD handling on the master, which is in C, so not so easily extensible. One option might be to define a custom worker as documented here:

https://uwsgi-docs.readthedocs.io/en/latest/WorkerOverride.html

You can maybe do the work in the destroy() function, which I assume happens in the worker process, but you'd really need multiprocess locking here. We used the stdlib's multiprocess.Lock(), but as discussed in this thread, it requires a python master process to initialise it. But under the hood, it's just a semaphore based on a shared memory file path, some form of that might be usable, if you can figured out how to ensure each worker is using the same path.

Are you aware of https://github.com/Lispython/pyprometheus ? That is an alternative prometheus client library that has specific support for using uwsgi's shared memory support, as opposed to using mmap files in a shared directory. I have not used it, and it's not the official prometheus client, but it might help here - you may be able to use it's approach with prometheus_client.

HTH

Edit: did some more research, and sharedarea in uswsgi is just a locking wrapper around a shared mmap, but it does to the multiprocess locking for you, which is nice.

akx commented 5 years ago

It should be noted that you only need mark_process_dead if you use livesum or liveall gauge multiprocess types, as that's all it cleans up.

https://github.com/prometheus/client_python/blob/4aa1936c28054a310573ed9656954a159dbedbf2/prometheus_client/multiprocess.py#L120-L127

ados1991 commented 5 years ago

Have you found any solution for this issue ?

xuansontrinh commented 5 years ago

hello @bloodearnest , I have attempted to use your solution to tackle db file number problem. I used the solution on three metrics types: Counter (counter_{pid}.db), Histogram (histogram_{pid}.db), and Summary (summary_{pid}.db). I tried decreasing the max_requests config in Gunicorn and ran some stress tests and the result is promising, the number of files are kept at minimum. However, I have some questions about the procedure:

I occasionally got the error RuntimeError: Read beyond file size detected, /opt/prometheus/summary_archive.db is corrupted. and that made the db files not merged correctly, leaving some of dead pid's summary db files behind. The error has not happened to counter_archive.db and histogram_archive.db yet from my observation. Is that the reason you did not archive the summary db files in your code?
Assuming the db files are all merged well, will there be a day when {metrics_type}_archive.db grows too large and causes problem?
Will this procedure slow down and/or block the gunicorn master at some points? I am concerned about the impact of this when it comes to production.

Thank you for your time.

Edit: Below is the image describing the 1st question. My gunicorn is configured to run with 3 workers:

tino commented 5 years ago

I'm using

if "prometheus_multiproc_dir" in os.environ:
    import uwsgi

    prometheus_client.values.ValueClass = prometheus_client.values.MultiProcessValue(
        uwsgi.worker_id
    )

which makes it so that it doesn't use the pid but the worker id, which is reused on worker restart.

This looks like it works in dev. @brian-brazil would there be problems with this setup?

The only thing that looks suspicious to me is not having a _created gauge for counters. Is that to be expected with multiprocess setup?

brian-brazil commented 4 years ago

This issue has gotten a bit long covering many related issues. The initial report was not a bug, so let's open new issues for any new discussion around improvements here.

bloodearnest commented 4 years ago

Hi, sorry, I know this issue is now closed, but I wanted to reply to @xuansontrinh, I'd missed the question addressed to me.

hello @bloodearnest , I have attempted to use your solution to tackle db file number problem. I used the solution on three metrics types: Counter (counter_{pid}.db), Histogram (histogram_{pid}.db), and Summary (summary_{pid}.db). I tried decreasing the max_requests config in Gunicorn and ran some stress tests and the result is promising, the number of files are kept at minimum. However, I have some questions about the procedure:

I occasionally got the error RuntimeError: Read beyond file size detected, /opt/prometheus/summary_archive.db is corrupted. and that made the db files not merged correctly, leaving some of dead pid's summary db files behind. The error has not happened to counter_archive.db and histogram_archive.db yet from my observation. Is that the reason you did not archive the summary db files in your code?

No - the reason is that we don't use summaries! I expect you'll need to extend the code to do so. If the summaries are a different format from counters/histograms, then you may need to write code to merge them.

Assuming the db files are all merged well, will there be a day when {metrics_type}_archive.db grows too large and causes problem?

I don't think this should be an issue. I think the file size should depend on the number of metrics/labels your app has, not the length of time it's been around, as the merge should collate all of them into one entry per metric/label combination.

Also, we avoid this by using a new tmpdir for prometheus_multiproc_dir when the gunicorn service restarts, to meet the requirement of clearing it between service starts as in the documentation. And as each deploy is a new service, it will only grow for the lifetime of a deployment, which is typically only a few weeks at most.

Will this procedure slow down and/or block the gunicorn master at some points? I am concerned about the impact of this when it comes to production.

The gunicorn master does do more work. But it spends most of its time idle anyway, as it handles no traffic. I guess if you have many workers, there might be an issue, especially if you SIGHUP gunicorn, as all the current workers will die. FTR, we use 64 workers, and regularly SIGHUP, with out any issue that we've detected.

Thank you for your time.

Edit: Below is the image describing the 1st question. My gunicorn is configured to run with 3 workers:

ya-pekatoros commented 2 months ago

Hi everyone! Sorry for updating it after two years but I will don't understand what should I do with flooding prometheus multiproc dir with .db files? It slows the /metrics collection. Now we just clean up the dir for each deploy but if where are not deploys for some days it leads to timeouts for /metrics request.

What is the best practice here? Cron cleaning job?

rsamadzade commented 1 month ago

Hi All. I am also facing the same issue. Memory will reach the limit and application will restart each time. I tried to use all types of workers but still no success. I tried to run a cron job to delete files but after each deletion I will get a pod error for prometheus. Also if a file was being used by some worker metrics endpoint would give an exception and again error will be registered via prometheus.

For the suggestion above to restart the server is not suitable for me as I need it to run without any restart and would not want application depend of metrics. Any help is appreciated. Thanks!

prometheus / client_python

Gunicorn/multiprocess with restarting workers, .db files are not cleaned up #275