korfuri / django-prometheus

Export Django monitoring metrics for Prometheus.io
Apache License 2.0
1.42k stars 244 forks source link

uwsgi sample configuration #12

Closed c-castillo closed 4 years ago

c-castillo commented 8 years ago

I was looking for a configuration to run with nginx + uwsgi.

The only thing you need to do in order to make this work is adding the following line in uwsgi.ini: enable-threads=True This will enable threads raised by the app in uwsgi.

BUT, When I go to the expression browser or promdash, it doesn't seem to report anything from the app, it seems it's instrumenting from nowhere.

audax commented 8 years ago

I've got the same problem. If I run the application in uwsgi in a single process, the /metrics export vial urls.py works just fine. If I have more than one process, I get the usual problem of only every hitting one worker.

If I run the exporter in an extra thread, it just doesn't collect any metrics. The "solution" for now is to run multiple uwsgi processes with each a single django worker. That is not nice.

korfuri commented 8 years ago

For WSGI (uwsgi, gunicorn, others too) exports, exporting via the urls.py doesn't work. It may work if you use a patched version of the prometheus client (see the multiproc branch: https://github.com/prometheus/client_python/tree/multiproc).

The simplest way is to configure the port-based exporter, so each worker process will export on a different port. See https://github.com/korfuri/django-prometheus/blob/master/documentation/exports.md#exporting-metrics-in-a-wsgi-application-with-multiple-processes on how to do that. If you have N workers you should have at least N ports in the range (you can have more). Then you need to configure each worker as a separate target in Prometheus (http://prometheus.io/docs/operating/configuration/#), and use the rules language (http://prometheus.io/docs/querying/rules/) to aggregate data from multiple workers together.

audax commented 8 years ago

This is my uwsgi config which finally works:

[uwsgi]
processes = 4
master = true

# needed so that the processes are forked _before_ the exporter starts
lazy = true 
# enable-threads as mentioned in the docs
enable-threads=true

# and the usual fluff
die-on-term = true
plugins = python3
home = <home>
chdir =<dir>
module = <module>
env=DJANGO_SETTINGS_MODULE=<settings>
socket = /tmp/django-foobar-uwsgi.sock
vacuum = true

The missing piece in the docs was the 'lazy' option, which apparently fixes my problem. Without it, only 1 exporter was started which only exported nonsense.

analytik commented 8 years ago

So, if I have uWSGI configured to use 8 processes and 8 threads, I need to expose metrics on 64 different ports? o_O

Even if I have to do this per process*thread, I would prefer to just use uWSGI cache, aggregate them in a separate uWSGI vassal, and expose them together on one address. So every Django thread would just need to do whatever is the python equivalent of something like

setTimeout(function() { set_cache(pid_and_thread_number, exported_stats) }, 10000);

Or am I going in a completely wrong direction? Are you people really scraping several ports per server to get accurate metrics?

audax commented 8 years ago

Yes, we do and yes, we will change it someday to let them aggregate their stats into one vassal per host.

analytik commented 8 years ago

OK, maybe this will help someone, here's what I did:

I'll be happy to provide more details if it would help someone, it's just the code isn't tidied up.

geor-g commented 7 years ago

@analytik I'm really interested in this. Could you please share some more details / code? Thanks!

analytik commented 7 years ago

@ge-fa - sure.

First, we run uWSGI with uwsgi --enable-threads --emperor /foo/bar/emperor/$ENV --disable-logging - we keep slightly different configurations for dev vs stage vs prod.

In each emperor/env folder, we keep two ini files - one for the app itself:

[uwsgi]
chdir           = /foo/bar
module          = wsgi
pidfile         = /tmp/uwsgi.pid
master          = true
http-socket     = 0.0.0.0:80
vacuum          = true
enable-threads  = true
processes       = 2
lazy            = false
threads         = 4
post-buffering  = true
harakiri        = 30
max-requests    = 5000
buffer-size     = 65535
stats           = 127.0.0.1:1717
stats-http      = true

and one for metrics service:

[uwsgi]
chdir           = /foo/bar
module          = metrics:api
pidfile         = /tmp/uwsgi-metrics.pid
http-socket     = 0.0.0.0:9090
vacuum          = true
enable-threads  = true
threads         = 1
processes       = 3
post-buffering  = true
harakiri        = 10
max-requests    = 10
buffer-size     = 65535
disable-logging = true

These can be adjusted of course, but do not turn on lazy mode! The app will start leaking memory horribly. Now you serve on 3 ports - 80 for Django, 1717 for uWSGI metrics, and 9090 for Prometheus.

Now metrics.py should contain a simple app with something like this - in this case:

import falcon
from prometheus_client import generate_latest
from prometheus_client.core import REGISTRY

from your_app.metrics import metrics as custom_metrics # this is just a dictionary with optional business or other app metrics, can be empty
from prometheus_django_redis import metrics as django_metrics
from prometheus_django_utils import process_redis_stuff, startup_prometheus

class MetricsResource(object):
    def on_get(self, req, resp):
        process_redis_stuff(django_metrics)
        process_redis_stuff(custom_metrics)
        resp.content_type = 'text/plain'
        resp.status = falcon.HTTP_200
        resp.body = generate_latest(REGISTRY)

api = startup_prometheus(MetricsResource, HealthzResource) # I omitted HealthzResource here

Now, the functionality in prometheus_django_redis is a bit hacky. I'm not sure if I can share the whole code, but the gist of it is this:

import time
from pickle import dumps

import redis
from prometheus_client import Gauge, Histogram

r = redis.Redis()

metrics = {
    'requests_total': Gauge(
        'django_http_requests_before_middlewares_total',
        'Total count of requests before middlewares run.'),
# many others
}

def get_time():
    return time.time()

def time_since(t):
    return get_time() - t

def incr_with_labels(metric, labels, amount=1):
    r.hincrby(metric, dumps(labels), amount)

# and then the middleware itself
class PrometheusBeforeMiddleware(object):
    """Monitoring middleware that should run before other middlewares."""

    def process_request(self, request):
        r.incr('requests_total')
        request.prometheus_before_middleware_event = get_time()

    def process_response(self, request, response):
        r.incr('responses_total')
        if hasattr(request, 'prometheus_before_middleware_event'):
            r.rpush('requests_latency_before', time_since(request.prometheus_before_middleware_event))
        else:
            r.incr('requests_unknown_latency_before')
        return response

And then the rules for writing to Redis instead of directly to prometheus are as follows:

To read them, have some utility file, like


import logging
import traceback
from collections import defaultdict
from pickle import loads

import falcon
import redis
import requests_unixsocket
from prometheus_client.core import GaugeMetricFamily, REGISTRY

r = redis.Redis()
session = requests_unixsocket.Session()
PREFIX = "uwsgi"
EXCLUDE_FIELDS = {"pid", "uid", "cwd", "vars"}
LABEL_VALUE_FIELDS = {"id", "name"}

def object_to_prometheus(prefix, stats_dict, labels, label_name=None):
    label_value = next((stats_dict[field] for field in LABEL_VALUE_FIELDS if field in stats_dict), None)
    if label_name is not None and label_value is not None:
        label_name = label_name.rstrip("s")
        labels = labels + [(label_name, str(label_value))]

    for name, value in stats_dict.items():
        name = name.replace(" ", "_")
        if name.isupper() or name in EXCLUDE_FIELDS:
            # If isupper - it is request vars. No need to save it.
            continue
        if isinstance(value, list):
            yield from list_to_prometheus("{}_{}".format(prefix, name), value, labels, name)
        elif name not in LABEL_VALUE_FIELDS and isinstance(value, (int, float)):
            yield "{}_{}".format(prefix, name), sorted(labels), value

def list_to_prometheus(prefix, stats_list, labels, label_name):
    for stats in stats_list:
        yield from object_to_prometheus(prefix, stats, labels, label_name)

def build_prometheus_stats(stats_addr):
    uwsgi_stats = get_stats(stats_addr)
    stats = object_to_prometheus(PREFIX, uwsgi_stats, [])
    grouped_stats = defaultdict(list)
    # Need to group all values by name, otherwise prometheus do not accept it
    for metric_name, labels, value in stats:
        grouped_stats[metric_name].append((labels, value))
    for metric_name, stats in grouped_stats.items():
        label_names = [name for name, _ in stats[0][0]]
        g = GaugeMetricFamily(metric_name, "", labels=label_names)
        for labels, value in stats:
            g.add_metric([value for _, value in labels], value)
        yield g

def get_stats_collector(stats_getter):
    class StatsCollector:
        def collect(self):
            yield from stats_getter()
    return StatsCollector()

def get_stats(stats_addr):
    resp = session.get(stats_addr)
    resp.raise_for_status()
    return resp.json()

def handle_error(e, req, resp, params):
    logging.error(traceback.format_exc())
    try:
        raise e
    except falcon.HTTPError:
        raise e
    except Exception:
        raise falcon.HTTPInternalServerError('Internal Server Error', str(e))

class PongResource(object):
    def on_get(self, req, resp):
        resp.status = falcon.HTTP_200
        resp.content_type = 'text/plain'
        resp.body = 'PONG'

def startup_prometheus(MetricsResource, HealthzResource,
                       stats_address="http://127.0.0.1:1717"):
    REGISTRY.register(get_stats_collector(lambda: build_prometheus_stats(stats_address)))
    api = falcon.API()
    api.add_error_handler(Exception, handler=handle_error)
    api.add_route('/metrics', MetricsResource())
    api.add_route('/healthz/ping', PongResource())
    api.add_route('/healthz/', HealthzResource())
    return api

def process_redis_stuff(metrics):
    """ Read metrics saved by several processes/threads in Redis, and turn them into Prometheus metrics

    if type is Gauge, read and set
    if Gauge with labels, hgetall and set
    if Histogram, read and empty the list, observe values one by one
    """
    for (metric_name, metric) in metrics.items():
        metric_type = type(metric).__name__
        # logging.debug('Investigating metric %s typed %s' % (metric_name, metric_type))
        if metric_type == 'Gauge':
            value = r.get(metric_name) or 0
            # logging.debug('Setting %s to %s' % (metric_name, value))
            metric.set(value)
        elif metric_type == '_LabelWrapper':
            # for simplicity, assume all labeled classes are Gauge - to change, check _wrappedClass
            labels_and_values = r.hgetall(metric_name)
            for (labels, value) in labels_and_values.items():
                value = float(value)
                clean_labels = {}
                for (lab, val) in loads(labels).items():
                    lab = type(lab) == bytes and lab.decode('utf-8') or lab
                    val = type(val) == bytes and val.decode('utf-8') or val
                    clean_labels[lab] = val
                # logging.debug('Setting %s to %s with labels %s' % (metric_name, value, clean_labels))
                metric.labels(clean_labels).set(value)
        elif metric_type == 'Histogram':
            # get all values in the list (Array)
            values = r.lrange(metric_name, 0, -1)
            # cut those values out from Redis
            r.ltrim(metric_name, len(values), -1)
            # logging.debug('Observing %s values for %s' % (len(values), metric_name))
            for val in values:
                metric.observe(float(val))

See? Simple!

Except... not at all. I mean, I'm sure there are better ways to do it, but I did whatever butchered way was easy enough to develop and deliver.

In other news, I am incredibly happy to develop in Node.js, where asynchronous programming is a breeze, I can start infinite number of http servers in a few lines, and don't need nasty multi-threading / multiprocessing that eats gigabytes of memory to achieve all that. (That said, of course Python has its uses, but I no longer feel like http servers should be one of them, at least not unless you do something special like stackless/httptools/uvloop.)

Hope it helps!

EDIT: I should also note that we run each instance as a Docker container / Kubernetes pod, so there isn't any problem with allocating the same ports for many different applications. The Redis also runs locally to the pod, started simply with redis & which I know is barbaric, but so far has worked reliably.

asherf commented 4 years ago

Closed the issue due to inactivity. Feel free to reopen if needed