korfuri / django-prometheus

Export Django monitoring metrics for Prometheus.io
Apache License 2.0
1.44k stars 244 forks source link

BUG: Port cannot be reused even when the process that opened it has been terminated. #337

Open Routhinator opened 1 year ago

Routhinator commented 1 year ago

I've been fighting a problem with my graphs from this module because of strange decimal values for number of inserts/deletes/updates of models. After digging into it more, it was observed that the counters for these metrics will randomly 0, and then come back to their original values. There are no restarts of the application. Simply refreshing the /metrics/ endpoint, if 2 users were created since the last restart of the app, the count will be 2 - then suddenly 0 for up to a minute, then back to 2 again.

This results in highly unreliable metrics. I'm uncertain where the problem could lie.

In my models, I am using the ExportModelOperationsMixin in all my models as per the docs:

Example:

class Member(ExportModelOperationsMixin('member'), AbstractBaseUser, PermissionsMixin,
             RulesModelMixin, TimestampedModel, metaclass=RulesModelBase):
    """
    Main Member model
    """

And I am using the django_prometheus.db.backends.postgresql engine.

Versions:

django_prometheus: 2.2 Django: 4.0.8 Python: 3.9

=========

Update

The remaining problem with this implementation is outlined in https://github.com/korfuri/django-prometheus/issues/337#issuecomment-1279839293 - Once a port has been opened, it cannot be reused until the host is rebooted, even after the container running it has been killed and reaped.

Routhinator commented 1 year ago

So, this seems to be related to #325 - and is resolved by setting the PROMETHEUS_MULTIPROC_DIR environment variable.

A couple of points here:

Routhinator commented 1 year ago

I spoke too soon. PROMETHEUS_MULTIPROC_DIR does not seem to solve the issue.

Routhinator commented 1 year ago

Ruling out other things, I am using Gunicorn, not uWSGI, so the lazy-apps behaviour setting for uWSGI is default behaviour for Gunicorn.

Routhinator commented 1 year ago

Ok, I managed to sort this out by switching to Gunicorn/Gevent instead of Gunicorn/Gthread and dropping to one worker per container, as well as defining the PROMETHEUS_MULTIPROC_DIR - Metrics are at least stable now.

I also need to leverage the PROMETHEUS_METRICS_EXPORT_PORT_RANGE = range(8001, 8002) setting, as I need the metrics export on a dedicated thread in order to be reliable. They work without it however if the threads are all tied up it stops answering, as alluded to in the docs.

I am having one problem with this though. Using docker, once a port has been opened it does not seem to be able to be reused until the host is rebooted. After a container has used it; stopping, restarting, deleting that container does not free the port. The port is not in use, but rather this seems to be related to the same file descriptor being reused over and over as the python HTTP client that prometheus_client uses is not cleaning up the FD or marking it reusable as mentioned in comments on https://github.com/prometheus/client_python/issues/155

Unfortunately this is making it nigh impossible to nail down this Django module and ensure production readiness.

Routhinator commented 1 year ago

This behaviour seems to be related to https://peps.python.org/pep-0446/#non-inheritable-file-descriptors