500 Error from WPS requests (when prometheus enabled)

benjimin commented 2 years ago

At present, wps/?service=WPS&request=DescribeProcess&version=1.0.0&Identifier=WOfSDrill consistently returns 500 Internal Server Error "Either the server is overloaded or there is an error in the application.". Thanks @GypsyBojangles for reporting.

Similarly, wps/?service=WPS&request=getcapabilities&version=1.0.0 gives the same 500 error.

However, wps/metrics works (and looks vaguely sensible). Likewise, wps/ping claims the system is healthy. kubectl does not note any pod restarts.

benjimin commented 2 years ago

Grafana is showing the pod count has been incrementing every couple of days, despite being almost completely idle. For example, around 3pm yesterday it increased from 6 pods to 7 (the number that are currently running). It had been reset to 4 (due to a manual config change which made 4 the new minimum replica count) mid last week.

benjimin commented 2 years ago

All seven pods have log files brimming with the error (which appears to recur every 30 seconds, e.g. due to a probe). Note that the readiness probe does getcapabilities whilst the liveness probe does targets the ping endpoint. (Kubectl does assert each of these pods is ready.)

[2021-10-19 00:47:29,090] ERROR in app: Exception on / [GET]
Traceback (most recent call last):
  File "/env/lib/python3.8/site-packages/flask/app.py", line 2447, in wsgi_app
    response = self.full_dispatch_request()
  File "/env/lib/python3.8/site-packages/flask/app.py", line 1952, in full_dispatch_request
    rv = self.handle_user_exception(e)
  File "/env/lib/python3.8/site-packages/flask/app.py", line 1821, in handle_user_exception
    reraise(exc_type, exc_value, tb)
  File "/env/lib/python3.8/site-packages/flask/_compat.py", line 39, in reraise
    raise value
  File "/env/lib/python3.8/site-packages/flask/app.py", line 1950, in full_dispatch_request
    rv = self.dispatch_request()
  File "/env/lib/python3.8/site-packages/flask/app.py", line 1936, in dispatch_request
    return self.view_functions[rule.endpoint](**req.view_args)
  File "/env/lib/python3.8/site-packages/prometheus_flask_exporter/__init__.py", line 701, in func
    return current_app.handle_user_exception(ex)
  File "/env/lib/python3.8/site-packages/flask/app.py", line 1821, in handle_user_exception
    reraise(exc_type, exc_value, tb)
  File "/env/lib/python3.8/site-packages/flask/_compat.py", line 39, in reraise
    raise value
  File "/env/lib/python3.8/site-packages/prometheus_flask_exporter/__init__.py", line 699, in func
    raise exception
  File "/env/lib/python3.8/site-packages/prometheus_flask_exporter/__init__.py", line 657, in func
    response = current_app.handle_user_exception(ex)
  File "/env/lib/python3.8/site-packages/flask/app.py", line 1821, in handle_user_exception
    reraise(exc_type, exc_value, tb)
  File "/env/lib/python3.8/site-packages/flask/_compat.py", line 39, in reraise
    raise value
  File "/env/lib/python3.8/site-packages/prometheus_flask_exporter/__init__.py", line 654, in func
    response = f(*args, **kwargs)
  File "/code/datacube_wps/impl.py", line 51, in wps
    service.append(Service(read_process_catalog('datacube-wps-config.yaml'), ['pywps.cfg']))
  File "/env/lib/python3.8/site-packages/pywps/app/Service.py", line 50, in __init__
    config.load_configuration(cfgfiles)
  File "/env/lib/python3.8/site-packages/pywps/configuration.py", line 70, in load_configuration
    CONFIG = configparser.ConfigParser(os.environ)
  File "/usr/lib/python3.8/configparser.py", line 639, in __init__
    self._read_defaults(defaults)
  File "/usr/lib/python3.8/configparser.py", line 1219, in _read_defaults
    self.read_dict({self.default_section: defaults})
  File "/usr/lib/python3.8/configparser.py", line 752, in read_dict
    raise DuplicateOptionError(section, key, source)
configparser.DuplicateOptionError: While reading from '<dict>': option 'prometheus_multiproc_dir' in section 'DEFAULT' already exists
[2021-10-19 00:47:59,090] ERROR in app: Exception on / [GET]

Potentially implicates prometheus.. #117

benjimin commented 2 years ago

When pods are restarted, the logs commence like:

[2021-10-19 01:05:40 +0000] [9] [INFO] Starting gunicorn 20.1.0
[2021-10-19 01:05:40 +0000] [9] [INFO] Listening at: http://0.0.0.0:8000 (9)
[2021-10-19 01:05:40 +0000] [9] [INFO] Using worker: sync
[2021-10-19 01:05:40 +0000] [11] [INFO] Booting worker with pid: 11
[2021-10-19 01:06:25,024] ERROR in app: Exception on / [GET]

Checking inside running pods, it seems like the processes are there like normal:

USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
wps          1  0.0  0.0   2620   608 ?        Ss   01:05   0:00 /bin/sh -c gunicorn -b 0.0.0.0:8000 datacube_wps:app
wps          9  0.0  0.0  26988 21900 ?        S    01:05   0:00 /env/bin/python /env/bin/gunicorn -b 0.0.0.0:8000 datacube_wps:app
wps         11  0.3  0.1 1144020 198148 ?      Sl   01:05   0:02 /env/bin/python /env/bin/gunicorn -b 0.0.0.0:8000 datacube_wps:app

Running curl from inside the pod container (targeting port 8000) still gives the same result pages (ping works, getcapabilities fails). So really, seems like the readiness probe should be failing (but that isn't registering to kubernetes)?

benjimin commented 2 years ago

Just noticed that our deployment is configured to use a raw curl command in the probe, rather than a native kubernetes httpGet.

This is potentially problematic because curl command only exits with nonzero status if it is unable to connect to the host, not if the host responds with an error code (unless we specify --fail).

This may not fix the underlying problem, but still helps if probes aren't ineffectual.

pindge commented 2 years ago

try replacing with https://github.com/opendatacube/datacube-wps/blob/0384331a59ab0b5187c4c47c7d7430dc3fc33949/datacube_wps/startup_utils.py#L21-L27

https://github.com/opendatacube/datacube-ows/blob/11e0f59cc80a1f6a2096bf492753c8c01318d0a8/datacube_ows/gunicorn_config.py

benjimin commented 2 years ago

Still unclear on the root cause of the issue..

Custom metrics accumulating in number until prometheus overloaded?
- At first glance it appeared this app was intending to save each individual query as a differently-labelled time-series contrary to prometheus docs cautioning against unbounded sets of labels. However, looking at e.g. count(flask_wps_request_full_url_count) in prometheus UI, it seems doubtful there would have been enough distinct traffic for this to have overflowed yet.
Prometheus misconfigured for gunicorn multiprocessing, causing leak?
- Despite prom flask exporter documenting this as a hazard, unable to find indication of a leak (container seemed to report very minimal memory use; grafana finds records showing no significant change in memory usage per pod; no obvious explosion of files in prometheus_multiproc_dir /dev/shm).
Gunicorn invalid configuration (same option in multiple places)?
- This would be taking the error message at face value. The chart attempts to to insert the key-value only once into the env field of the deployment's wps container spec; the prometheus_multiproc_dir key/variable seems not to have been already configured in this repo nor visible when gunicorn --print-config datacube_wps:app runs in the container by itself (separate from kubernetes deployment). The container entrypoint script does attempt to manipulate config such as datacube-wps-config.yaml and pywps.cfg (both names mentioned in the error log), but they seem not to contain anything relevant.
Flask exporter misconfigured?
- This is involved in the error trace..

Running gunicorn --check-config datacube_wps:app in the deployed containers seems ok now (I thought this printed a replication of the error report earlier??).

Toggling the commenting-out of prometheus_multiproc_dir in the helm chart does seem to toggle whether running e.g. curl '127.0.0.1:8000/wps/?service=WPS&version=1.0.0&request=getcapabilities' in the deployed container will 500 (and reproduce the trace into the kubernetes log) or work fine (giving details of WPS processes).

Currently seeing a slightly altered form of the trace than before:

[2021-10-23 00:23:32,647] ERROR in app: Exception on / [GET]
Traceback (most recent call last):
  File "/env/lib/python3.8/site-packages/flask/app.py", line 2447, in wsgi_app
    response = self.full_dispatch_request()
  File "/env/lib/python3.8/site-packages/flask/app.py", line 1952, in full_dispatch_request
    rv = self.handle_user_exception(e)
  File "/env/lib/python3.8/site-packages/flask/app.py", line 1821, in handle_user_exception
    reraise(exc_type, exc_value, tb)
  File "/env/lib/python3.8/site-packages/flask/_compat.py", line 39, in reraise
    raise value
  File "/env/lib/python3.8/site-packages/flask/app.py", line 1950, in full_dispatch_request
    rv = self.dispatch_request()
  File "/env/lib/python3.8/site-packages/flask/app.py", line 1936, in dispatch_request
    return self.view_functions[rule.endpoint](**req.view_args)
  File "/code/datacube_wps/impl.py", line 50, in wps
    service.append(Service(read_process_catalog('datacube-wps-config.yaml'), ['pywps.cfg']))
  File "/env/lib/python3.8/site-packages/pywps/app/Service.py", line 50, in __init__
    config.load_configuration(cfgfiles)
  File "/env/lib/python3.8/site-packages/pywps/configuration.py", line 70, in load_configuration
    CONFIG = configparser.ConfigParser(os.environ)
  File "/usr/lib/python3.8/configparser.py", line 639, in __init__
    self._read_defaults(defaults)
  File "/usr/lib/python3.8/configparser.py", line 1219, in _read_defaults
    self.read_dict({self.default_section: defaults})
  File "/usr/lib/python3.8/configparser.py", line 752, in read_dict
    raise DuplicateOptionError(section, key, source)
configparser.DuplicateOptionError: While reading from '<dict>': option 'prometheus_multiproc_dir' in section 'DEFAULT' already exists

benjimin commented 2 years ago

Not sure how to interpret the trace exactly..

Presumably gunicorn organises separate processes, for the flask app to handle web requests. Depending on which resource is being requested, these may be dispatched to datacube_wps.impl.create_app().wps. This function starts out by ensuring a pywps Service is instantiated (with an appropriate config file), which prompts pywps to run configparser.ConfigParser(os.environ) to pre-populate a fresh config's "DEFAULT" section from a dictionary of environment variables. I'm struggling to replicate the failure of that line. The parser is supposed to create an empty set, and iteratively add lowercased variable names to it, checking none is encountered twice.. You don't expect a duplicate environment variable to even be possible, but apparently environments are just lists of strings (potentially leaving it up to applications to interpret as key-value pairs and decide handling of duplicate keys if any occur)..

benjimin commented 2 years ago

I deployed an instrumented version of the image (from a different docker hub repo). The logs revealed that this package is imported by the gunicorn worker process, and both variables prometheus_multiproc_dir and PROMETHEUS_MULTIPROC_DIR were identically set (by the time the package had finished importing).

The lowercase version was configured by the helm chart. I'm not sure why lowercase was originally specified (or advised) as this violates the convention (that applications interpret the environment as a list of key-value pairs using uppercase key names) and so contributes to the failure of PyWPS.

The uppercase duplication was set by prometheus_client.multiprocess.MultiProcessCollector, which is called by the prometheus flask exporter. It turns out this is supposed to be associated with a deprecation warning, but by default python filters (ignores) all deprecation warnings, so it isn't visible in our logs.

The fix is to only use uppercase.

opendatacube / datacube-wps

500 Error from WPS requests (when prometheus enabled) #129