korfuri / django-prometheus

Export Django monitoring metrics for Prometheus.io
Apache License 2.0
1.44k stars 244 forks source link

PrometheusEndpointServer throws an exception, after which the endpoint is not available and not restart #415

Open NitroLine opened 1 year ago

NitroLine commented 1 year ago

I have django app on uvicorn. I use PROMETHEUS_METRICS_EXPORT_PORT_RANGE=range(8001, 8011) to start metrics on each uvicorn worker. It works fine.

But after some netowork error on server, some workers print execption:

Exception occurred during processing of request from ('106.75.72.22', 52046)
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/socketserver.py", line 316, in _handle_request_noblock
    self.process_request(request, client_address)
  File "/usr/local/lib/python3.10/socketserver.py", line 347, in process_request
    self.finish_request(request, client_address)
  File "/usr/local/lib/python3.10/socketserver.py", line 360, in finish_request
    self.RequestHandlerClass(request, client_address, self)
  File "/usr/local/lib/python3.10/socketserver.py", line 747, in __init__
    self.handle()
  File "/usr/local/lib/python3.10/http/server.py", line 433, in handle
    self.handle_one_request()
  File "/usr/local/lib/python3.10/http/server.py", line 421, in handle_one_request
    method()
  File "/usr/local/lib/python3.10/site-packages/prometheus_client/exposition.py", line 276, in do_GET
    self.wfile.write(output)
  File "/usr/local/lib/python3.10/socketserver.py", line 826, in write
    self._sock.sendall(b)
BrokenPipeError: [Errno 32] Broken pipe

Or

Exception occurred during processing of request from ('162.142.125.223', 51380)
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/socketserver.py", line 316, in _handle_request_noblock
    self.process_request(request, client_address)
  File "/usr/local/lib/python3.10/socketserver.py", line 347, in process_request
    self.finish_request(request, client_address)
  File "/usr/local/lib/python3.10/socketserver.py", line 360, in finish_request
    self.RequestHandlerClass(request, client_address, self)
  File "/usr/local/lib/python3.10/socketserver.py", line 747, in __init__
    self.handle()
  File "/usr/local/lib/python3.10/http/server.py", line 433, in handle
    self.handle_one_request()
  File "/usr/local/lib/python3.10/http/server.py", line 421, in handle_one_request
    method()
  File "/usr/local/lib/python3.10/site-packages/prometheus_client/exposition.py", line 276, in do_GET
    self.wfile.write(output)
  File "/usr/local/lib/python3.10/socketserver.py", line 826, in write
    self._sock.sendall(b)
ConnectionResetError: [Errno 104] Connection reset by peer

And some targets in prometheus start show error context deadline exceeded. (I saw such Traceback four times in logs, and four targets are down now)

So I think the PrometheusEndpointServer process has crashed and won't restart, I'm losing some metrics because of that. It would be cool if the exporter server automatically restarted if it became unavailable.