prometheus / client_python

Prometheus instrumentation library for Python applications
Apache License 2.0
3.98k stars 798 forks source link

High cpu when using Gunicorn with multiprocess #568

Open onzo-mateuszzakarczemny opened 4 years ago

onzo-mateuszzakarczemny commented 4 years ago

I run flask app on Gunicorn 20.0.4 and Kubernetes. My python is v3.7.5. I have a problem with increasing CPU over a time. image After investigating stack traces using py-spy I noticed that the issue is caused by prometheus metrics lib. It doesn't clean up metric files from old workers. Therefore over a time merge process take more and more CPU.

After deleting following files the CPU usages dropped significantly:

ll -lh
total 7.0M
drwxr-xr-x 2 root root 4.0K Jul 24 06:13 ./
drwxrwxrwt 1 root root 4.0K Jul 19 14:14 ../
-rw-r--r-- 1 root root 1.0M Jul 24 08:14 counter_106.db
-rw-r--r-- 1 root root 1.0M Jul 23 18:41 counter_112.db
-rw-r--r-- 1 root root 1.0M Jul 24 04:07 counter_118.db
-rw-r--r-- 1 root root 1.0M Jul 24 04:54 counter_136.db
-rw-r--r-- 1 root root 1.0M Jul 24 08:40 counter_142.db
-rw-r--r-- 1 root root 1.0M Jul 20 16:44 counter_16.db
-rw-r--r-- 1 root root 1.0M Jul 20 11:24 counter_17.db
-rw-r--r-- 1 root root 1.0M Jul 21 01:40 counter_18.db
-rw-r--r-- 1 root root 1.0M Jul 21 20:14 counter_40.db
-rw-r--r-- 1 root root 1.0M Jul 21 17:17 counter_52.db
-rw-r--r-- 1 root root 1.0M Jul 21 21:29 counter_58.db
-rw-r--r-- 1 root root 1.0M Jul 23 07:19 counter_70.db
-rw-r--r-- 1 root root 1.0M Jul 22 19:49 counter_82.db
-rw-r--r-- 1 root root 1.0M Jul 22 18:59 counter_88.db
-rw-r--r-- 1 root root 1.0M Jul 24 08:43 histogram_106.db
-rw-r--r-- 1 root root 1.0M Jul 24 04:15 histogram_112.db
-rw-r--r-- 1 root root 1.0M Jul 24 05:02 histogram_118.db
-rw-r--r-- 1 root root 1.0M Jul 24 08:43 histogram_136.db
-rw-r--r-- 1 root root 1.0M Jul 24 08:43 histogram_142.db
-rw-r--r-- 1 root root 1.0M Jul 20 16:46 histogram_16.db
-rw-r--r-- 1 root root 1.0M Jul 20 11:45 histogram_17.db
-rw-r--r-- 1 root root 1.0M Jul 21 01:51 histogram_18.db
-rw-r--r-- 1 root root 1.0M Jul 21 22:41 histogram_40.db
-rw-r--r-- 1 root root 1.0M Jul 21 17:45 histogram_52.db
-rw-r--r-- 1 root root 1.0M Jul 22 01:44 histogram_58.db
-rw-r--r-- 1 root root 1.0M Jul 23 07:37 histogram_70.db
-rw-r--r-- 1 root root 1.0M Jul 23 01:01 histogram_82.db
-rw-r--r-- 1 root root 1.0M Jul 22 23:40 histogram_88.db

image

The issue is related to https://github.com/prometheus/client_python/issues/275

Can we avoid that somehow without periodically restating the k8s pod? Maybe multiprocess should use PID + UUID generated on worker rather than just PID for file names? So master could remove/merge files from dead workes?

brian-brazil commented 4 years ago

Maybe multiprocess should use PID + UUID generated on worker rather than just PID for file names?

That's a new idea, which could avoid the problems of #518. Do you want to try it out and let me know how it goes? Keep in mind that that this library supports all forms of multi-process deployments, not just gunicorn.

akx commented 4 years ago

Maybe multiprocess should use PID + UUID generated on worker rather than just PID for file names?

That's a new idea, which could avoid the problems of #518. Do you want to try it out and let me know how it goes? Keep in mind that that this library supports all forms of multi-process deployments, not just gunicorn.

Not exactly a new idea... I discussed using UUIDs back in https://github.com/prometheus/client_python/pull/421#issuecomment-499450153 :)

stepanovsh commented 4 years ago

Any updates on this issue? We still have this issue and we need a periodical reboot for our instances

analytically commented 3 years ago

Seeing the same.

tabu-a commented 3 years ago

Seeing similar issue.

ljcjc commented 3 years ago

function for Gunicorn: child_exit. Did it not work? It is not should to history files from old worker. child_exit function could clean up the files when worker exited.

igor-susic commented 3 years ago

Yes now that you have mentioned it this part should do it:

from prometheus_client import multiprocess

def child_exit(server, worker):
    multiprocess.mark_process_dead(worker.pid)
Aelentel commented 3 years ago

had the same problem, multiprocess.mark_process_dead() only clean "gauge" files, not the others. as a possible solution indeed gauge should be removed but all other (counter, histogram, etc...) should be summed in a aggregate file to avoid file explosion.

benjimin commented 3 years ago

FWIW solved a similar looking problem here where CPU and transmit bandwidth were both ramping up over time (until either the pods were recreated afresh or the deployment would fail) on a web server (with multiprocess Gunicorn and with child_exit configured) because the Prometheus flask exporter was mistakenly left to default to labelling metrics by path (but the number of unique paths for that service was nearly unbounded) and recurring Prometheus scrapes collecting hundreds of thousands of metrics was too demanding.

agsha commented 10 months ago

Any update on this? this issue is blocking us from using prometheus and forcing us to consider something else.

benjimin commented 10 months ago

Hi @agsha, this sounds like a very common misconfiguration of your individual project, rather than a problem with prometheus (or this python client).

This is very easy to confirm or rule out: just go to the metrics endpoint of your app, and perform a line count. If, especially after the app has been operating in production for a while, the line count is large (say >200 lines) then this is the problem. All of the documentation for prometheus (and the python client here) clearly states that you are responsible for tightly limiting the cardinality of the metrics and labels. Often an application is configured to export too many separate metrics (in other words, the app is misusing prometheus by exporting using too many distinct string-values and combinations of labels, resulting in too many separate measurements which prometheus has to try to keep separate time-series of each), commonly because of confusion between the conceptual models of metrics vs logs (unfamiliar users inadvertently attempt to attach too fine grained metadata to their metrics, causing a cardinality explosion).

See here for another recent example of the trivial one-line configuration fix, assuming you're using the prometheus flask exporter. (There's a more detailed analysis in the link I posted previously in this thread.) If you're not using prometheus flask exporter, but instead using this python client directly, then the details of the fix for your project will depend on your code. (Do you have a link to share, to the code and the output from the metrics endpoint?)

To resolve the repo issue here, I propose that every hundredth increment of the total count of metrics, this client should emit a warning message that includes the current tally and an explanation of the potential hazard. Also, the docs could be updated to include an even more prominent admonition (expounding on this concern of the intended usage) than what the docs currently include. Also, a downstream linked issue should be opened in prometheus flask exporter, to make coarser grouping the default behaviour. Also, this issue could be tagged not a bug, as it is the desired behaviour that this instrumentation attempts to keep reporting every separate metric that has previously been encountered. (That's the purpose, for the scraper to be able to convert each known metric into a dense time series, so that even slow or irregular fluctuations in a time series can be monitored and resolved clearly without gaps in the collected data. It isn't this library's fault if a downstream project tries to autogenerate arbitrarily many new label values or new metric names, causing the set of known metrics to grow unmanageably large. And even if this client were modified, to remember how long ago each value last changed and drop ones that vary infrequently, the central prometheus database would still get overloaded because it still has to separately store each separately-labelled time series that has ever been encountered. Any cardinality explosion will eventually cause performance failures at some level of the user's infrastructure.)

OTOH, if anyone can give a reproducible example where the line count of the exported metrics endpoint remains small even though the resource usage associated with .db files keeps growing, that would be a bug (failure to consolidate metrics that are encountered by short-lived workers).

benjimin commented 10 months ago

I took another look at this in more detail. TL;DR:

The code for this client seemed strange to me, as if someone started to implement a central shared memory structure for inter-process communication, but left it unfinished. So instead each separate process writes its state to a separate file (or to multiple files if using multiple types of metric) but using a bespoke binary format. On the plus side it makes it possible to externally interrogate the running state and alleviates most need for synchronisation (although synchronisation appears already implemented)..

Here I prepared for a simple test of this prometheus python client (with just a counter metric) without using prometheus flask exporter:

import os, random, multiprocessing
os.environ["PROMETHEUS_MULTIPROC_DIR"] = os.getcwd()
import prometheus_client.multiprocess
registry = prometheus_client.CollectorRegistry()
prometheus_client.multiprocess.MultiProcessCollector(registry)
metric = prometheus_client.Counter("My_Counter", "An example measurement with label", ["Some_Field"])

Then I simulated a large number of separate worker processes while still limiting the number of distinct labels to be small:

def low_cardinality_worker():
    metric.labels(str(random.randint(0, 5))).inc() # only half a dozen possible distinct metrics
def spawn_one_worker_process():
    subprocess = multiprocessing.Process(target=low_cardinality_worker)
    subprocess.start()
    subprocess.join() # wait for process to exit
    return subprocess.pid
pids = [spawn_one_worker_process() for i in range(100)] # test having many worker processes

This hundred separate processes has generated exactly a hundred separate .db files, which are each less than a few KB in size. We can aggregate the output as follows (which simulates scraping the endpoint, and confirms we have only a half-dozen distinct sets of label values):

>>> print(prometheus_client.generate_latest(registry).decode('utf-8'))
# HELP My_Counter_total An example measurement with label
# TYPE My_Counter_total counter
My_Counter_total{Some_Field="0"} 18.0
My_Counter_total{Some_Field="3"} 14.0
My_Counter_total{Some_Field="2"} 15.0
My_Counter_total{Some_Field="1"} 19.0
My_Counter_total{Some_Field="5"} 16.0
My_Counter_total{Some_Field="4"} 18.0

We can also inspect the raw contents of any of these .db files as follows. (Each records the partial metric from just one process, and so needs to be summed with matching records in other files to arrive at the aggregate values.)

>>> print(list(prometheus_client.mmap_dict.MmapedDict.read_all_values_from_file(f'counter_{str(pid)}.db')))
[('["My_Counter", "My_Counter_total", {"Some_Field": "2"}, "An example measurement with label"]', 1.0, 0.0, 112)]

Note that even if the counter gets incremented thousands of times, the length of the .db file doesn't change, it just updates the value tallied within the stored record. So here, the .db file is still no more than a few KB.

for i in range(10000):
    metric.labels("foo").inc() # low cardinality test

However, if we allow many different label strings then we have the cardinality explosion. The following example produces a 1MB .db file containing nearly 10k separate records.

for i in range(10000):
    metric.labels(str(random.randint(0, 1e6))).inc() # cardinality explosion

If we run this loop ten times longer then the .db file grows to over 10MB (as it will now be storing nearly 100k separate metrics).

Next, note that running prometheus_client.multiprocess.mark_process_dead(pid) will not clean up any of the old .db files, except if you are using a specific type of live gauge metric for which the values from expired processes are intended to be excluded from the aggregate calculation. (You can verify this from the code.) Marking dead processes has no effect on any other metric types, such as counters etc.

Finally, I performed some simple timing. In the first test, with over a hundred .db files but less than a dozen distinct metrics, prometheus_client.generate_latest(registry) only took 2 or 3 milliseconds. However, in the last test, with a single .db file containing 100k distinct metrics, it took 1.2 seconds.

In other words, if the prometheus client is using too much CPU time, the problem is most likely that your application is misconfigured to track too many separately-labelled metrics. Having dozens of .db files is a red herring (and would only become a problem if your application was constantly killing and respawning new worker processes instead of maintaining a fixed pool of worker processes).

thomasjohn9294 commented 6 months ago

https://prometheus.github.io/client_python/multiprocess/

I am not sure, if this is the end of the story or if we have a workaround to get the CPU-MEM metrics.

Mohammad-YN commented 6 months ago

I've had this issue for a while since I'm trying to monitor a django project which uses gunicorn, where the number of files in the multiproc mode keeps increasing and it might cause some issues. I've read most of the proposed solutions that I've found and I came around using the idea of using a set of limited and reusable uuids to prevent the number of files created from growing. I've tested this for my metrics which only include Counters and Histograms and it seems to be working but since I'm not even remotely fluent in the prometheus_client's and gunicorn's source code I wanted to see if there might be any issues with my solution. It's a very naive and tailored to my project's specifics but I'd be happy if I can use this instead of putting a lot of time and effort to come up with a better but more complicated solution. The idea is to use gunicorn hooks to have a pool of uuids and set a uuid for each worker from that pool then use os.environ to expose the uuid to prometheus client and add the uuid back to the pool when the worker exists. It works okay with my limited amount of testings and only with histograms and counters(I haven't tested gauges and summaries yet) but I'd appreciate it if anyone can point out any flaws or scenarios where something might go wrong.

I have this in my gunicorn config file:

class GunicornWorkerIDsPool:
    def __init__(self):
        self._reserved_IDs_pool = []

    def get_id(self):
        if len(self._reserved_IDs_pool) == 0:
            return str(uuid.uuid4())
        return self._reserved_IDs_pool.pop()

    def add_id(self, worker_id):
        self._reserved_IDs_pool.append(worker_id)

gunicorn_worker_ids_pool = GunicornWorkerIDsPool()

def pre_fork(server, worker):
    setattr(worker, 'worker_id', gunicorn_worker_ids_pool.get_id())

def post_fork(server, worker):
    os.environ['GUNICORN_WORKER_ID'] = getattr(worker, 'worker_id', None)

def child_exit(server, worker):
    worker_id = getattr(worker, 'worker_id', None)
    if worker_id is not None:
        gunicorn_worker_ids_pool.add_id(worker_id)

and then this piece of code in my prometheus settings:

def get_worker_id():
    return os.environ.get('GUNICORN_WORKER_ID', os.getpid())

prometheus_client.values.ValueClass = prometheus_client.values.MultiProcessValue(
        process_identifier=get_worker_id)