prometheus / client_python

Prometheus instrumentation library for Python applications
Apache License 2.0
3.93k stars 794 forks source link

Exception in `mmap_dict.py` with multiprocess in 0.8.0 #599

Closed claudinoac closed 3 years ago

claudinoac commented 3 years ago

The issue

JSON unicode decode error after few days running a production server (using multiprocess mode).

Traceback (most recent call last):

  File "/usr/local/venv/.../lib64/python3.6/site-packages/django/core/handlers/exception.py", line 34, in inner
    response = get_response(request)
  File "/usr/local/venv/.../lib64/python3.6/site-packages/django/core/handlers/base.py", line 115, in _get_response
    response = self.process_exception_by_middleware(e, request)
  File "/usr/local/venv/.../lib64/python3.6/site-packages/django/core/handlers/base.py", line 113, in _get_response
    response = wrapped_callback(request, *callback_args, **callback_kwargs)
  File "/usr/local/venv/.../lib64/python3.6/site-packages/django_prometheus/exports.py", line 125, in ExportToDjangoView
    metrics_page = prometheus_client.generate_latest(registry)
  File "/usr/local/venv/.../lib64/python3.6/site-packages/prometheus_client/exposition.py", line 106, in generate_latest
    for metric in registry.collect():
  File "/usr/local/venv/.../lib64/python3.6/site-packages/prometheus_client/registry.py", line 82, in collect
    for metric in collector.collect():
  File "/usr/local/venv/.../lib64/python3.6/site-packages/prometheus_client/multiprocess.py", line 149, in collect
    return self.merge(files, accumulate=True)
  File "/usr/local/venv/.../lib64/python3.6/site-packages/prometheus_client/multiprocess.py", line 41, in merge
    metrics = MultiProcessCollector._read_metrics(files)
  File "/usr/local/venv/.../lib64/python3.6/site-packages/prometheus_client/multiprocess.py", line 69, in _read_metrics
    for key, value, pos in file_values:
  File "/usr/local/venv/.../lib64/python3.6/site-packages/prometheus_client/mmap_dict.py", line 44, in _read_all_values
    yield encoded_key.decode('utf-8'), value, pos
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x9c in position 156: invalid start byte

The configuration

What happens

The server throws a JSON unicode decode error after few days running. The server has a minimal load (below 300req/min) The server doesn't run without cleaning the multiprocess directory, returning the same error when restarting uWSGI. Also this server exports some metrics within crons, with the same file prefix for all crons. (can this be the problem?)

At first, these errors were happening after the VM was forcibly restarted.

I found a similar issue (https://github.com/prometheus/client_python/issues/357), but idk if it is about the same problem, since that problem was fixed on previous versions.

brian-brazil commented 3 years ago

I need a bit more information, as this appears to be a data corruption but you've not given much for me to work with beyond that your use multiprocess mode. For example, does your code fork?

claudinoac commented 3 years ago

Well, I've checked that we're using only one process and one thread for the server, but the crons run in separate processes. And, no, there's no forks within the code. And we're delivering the metrics through an endpoint within the django application (/metrics)

Another thing that I've observed, is that the crons are superposing themselves sometimes (because of the amount of data being processed). Since the crons are using the same files (..._cron.db), can this be the origin of this corruption? Should I use different prefixes for each cron (like the PID)?

brian-brazil commented 3 years ago

Since the crons are using the same files (..._cron.db), can this be the origin of this corruption? Should I use different prefixes for each cron (like the PID)?

Can you explain more about what this is? This sounds like you've broken things by poking around internals, as that is not a filename this code can produce out of the box.

claudinoac commented 3 years ago

I did this -> https://github.com/korfuri/django-prometheus/blob/2.0.0/documentation/exports.md#exporting-metrics-in-a-wsgi-application-with-multiple-processes-globally to avoid having lots of file descriptors, having only one per uWSGI process and one for the crons.

brian-brazil commented 3 years ago

I'm not deeply familiar with WSGI and cron, what's the worker id in that case?

claudinoac commented 3 years ago

Each worker in uWSGI is a process, spawned by the master process. The worker id is the id defined by the uWSGI to its child processes (0, 1, 2 if it has three processes/workers). The crons I'm referring are the django custom commands or batch jobs (python manage.py <command>) that are scheduled through crontab to run periodically (4 hours, I guess). These batch jobs have its own metrics, and I configured the server to use file descriptors which has the prefix "cron" instead of the PID of the cron which is running (gauge_all_cron.db, histogram_all_cron.db,...). Since each cron has a new PID, using the default configs I would have lots of file descriptors in a short time. (gauge_all_<PID>, ...) Does this make sense?

brian-brazil commented 3 years ago

Two processes can't safely share an ID, so there's your problem.

claudinoac commented 3 years ago

I start using the PID of the crons as prefix for the files, so they (the processes) won't share any db files. The application is stable since then, so I guess we can close the issue. Thanks!