Closed benjimin closed 2 years ago
Grafana is showing the pod count has been incrementing every couple of days, despite being almost completely idle. For example, around 3pm yesterday it increased from 6 pods to 7 (the number that are currently running). It had been reset to 4 (due to a manual config change which made 4 the new minimum replica count) mid last week.
All seven pods have log files brimming with the error (which appears to recur every 30 seconds, e.g. due to a probe). Note that the readiness probe does getcapabilities
whilst the liveness probe does targets the ping
endpoint. (Kubectl does assert each of these pods is ready.)
[2021-10-19 00:47:29,090] ERROR in app: Exception on / [GET]
Traceback (most recent call last):
File "/env/lib/python3.8/site-packages/flask/app.py", line 2447, in wsgi_app
response = self.full_dispatch_request()
File "/env/lib/python3.8/site-packages/flask/app.py", line 1952, in full_dispatch_request
rv = self.handle_user_exception(e)
File "/env/lib/python3.8/site-packages/flask/app.py", line 1821, in handle_user_exception
reraise(exc_type, exc_value, tb)
File "/env/lib/python3.8/site-packages/flask/_compat.py", line 39, in reraise
raise value
File "/env/lib/python3.8/site-packages/flask/app.py", line 1950, in full_dispatch_request
rv = self.dispatch_request()
File "/env/lib/python3.8/site-packages/flask/app.py", line 1936, in dispatch_request
return self.view_functions[rule.endpoint](**req.view_args)
File "/env/lib/python3.8/site-packages/prometheus_flask_exporter/__init__.py", line 701, in func
return current_app.handle_user_exception(ex)
File "/env/lib/python3.8/site-packages/flask/app.py", line 1821, in handle_user_exception
reraise(exc_type, exc_value, tb)
File "/env/lib/python3.8/site-packages/flask/_compat.py", line 39, in reraise
raise value
File "/env/lib/python3.8/site-packages/prometheus_flask_exporter/__init__.py", line 699, in func
raise exception
File "/env/lib/python3.8/site-packages/prometheus_flask_exporter/__init__.py", line 657, in func
response = current_app.handle_user_exception(ex)
File "/env/lib/python3.8/site-packages/flask/app.py", line 1821, in handle_user_exception
reraise(exc_type, exc_value, tb)
File "/env/lib/python3.8/site-packages/flask/_compat.py", line 39, in reraise
raise value
File "/env/lib/python3.8/site-packages/prometheus_flask_exporter/__init__.py", line 654, in func
response = f(*args, **kwargs)
File "/code/datacube_wps/impl.py", line 51, in wps
service.append(Service(read_process_catalog('datacube-wps-config.yaml'), ['pywps.cfg']))
File "/env/lib/python3.8/site-packages/pywps/app/Service.py", line 50, in __init__
config.load_configuration(cfgfiles)
File "/env/lib/python3.8/site-packages/pywps/configuration.py", line 70, in load_configuration
CONFIG = configparser.ConfigParser(os.environ)
File "/usr/lib/python3.8/configparser.py", line 639, in __init__
self._read_defaults(defaults)
File "/usr/lib/python3.8/configparser.py", line 1219, in _read_defaults
self.read_dict({self.default_section: defaults})
File "/usr/lib/python3.8/configparser.py", line 752, in read_dict
raise DuplicateOptionError(section, key, source)
configparser.DuplicateOptionError: While reading from '<dict>': option 'prometheus_multiproc_dir' in section 'DEFAULT' already exists
[2021-10-19 00:47:59,090] ERROR in app: Exception on / [GET]
Potentially implicates prometheus.. #117
When pods are restarted, the logs commence like:
[2021-10-19 01:05:40 +0000] [9] [INFO] Starting gunicorn 20.1.0
[2021-10-19 01:05:40 +0000] [9] [INFO] Listening at: http://0.0.0.0:8000 (9)
[2021-10-19 01:05:40 +0000] [9] [INFO] Using worker: sync
[2021-10-19 01:05:40 +0000] [11] [INFO] Booting worker with pid: 11
[2021-10-19 01:06:25,024] ERROR in app: Exception on / [GET]
Checking inside running pods, it seems like the processes are there like normal:
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
wps 1 0.0 0.0 2620 608 ? Ss 01:05 0:00 /bin/sh -c gunicorn -b 0.0.0.0:8000 datacube_wps:app
wps 9 0.0 0.0 26988 21900 ? S 01:05 0:00 /env/bin/python /env/bin/gunicorn -b 0.0.0.0:8000 datacube_wps:app
wps 11 0.3 0.1 1144020 198148 ? Sl 01:05 0:02 /env/bin/python /env/bin/gunicorn -b 0.0.0.0:8000 datacube_wps:app
Running curl from inside the pod container (targeting port 8000) still gives the same result pages (ping works, getcapabilities fails). So really, seems like the readiness probe should be failing (but that isn't registering to kubernetes)?
Just noticed that our deployment is configured to use a raw curl
command in the probe, rather than a native kubernetes httpGet
.
This is potentially problematic because curl
command only exits with nonzero status if it is unable to connect to the host, not if the host responds with an error code (unless we specify --fail
).
This may not fix the underlying problem, but still helps if probes aren't ineffectual.
Still unclear on the root cause of the issue..
count(flask_wps_request_full_url_count)
in prometheus UI, it seems doubtful there would have been enough distinct traffic for this to have overflowed yet.prometheus_multiproc_dir
/dev/shm
).prometheus_multiproc_dir
key/variable seems not to have been already configured in this repo nor visible when gunicorn --print-config datacube_wps:app
runs in the container by itself (separate from kubernetes deployment).
The container entrypoint script does attempt to manipulate config such as datacube-wps-config.yaml
and pywps.cfg
(both names mentioned in the error log), but they seem not to contain anything relevant.Running gunicorn --check-config datacube_wps:app
in the deployed containers seems ok now (I thought this printed a replication of the error report earlier??).
Toggling the commenting-out of prometheus_multiproc_dir
in the helm chart does seem to toggle whether running e.g. curl '127.0.0.1:8000/wps/?service=WPS&version=1.0.0&request=getcapabilities'
in the deployed container will 500 (and reproduce the trace into the kubernetes log) or work fine (giving details of WPS processes).
Currently seeing a slightly altered form of the trace than before:
[2021-10-23 00:23:32,647] ERROR in app: Exception on / [GET]
Traceback (most recent call last):
File "/env/lib/python3.8/site-packages/flask/app.py", line 2447, in wsgi_app
response = self.full_dispatch_request()
File "/env/lib/python3.8/site-packages/flask/app.py", line 1952, in full_dispatch_request
rv = self.handle_user_exception(e)
File "/env/lib/python3.8/site-packages/flask/app.py", line 1821, in handle_user_exception
reraise(exc_type, exc_value, tb)
File "/env/lib/python3.8/site-packages/flask/_compat.py", line 39, in reraise
raise value
File "/env/lib/python3.8/site-packages/flask/app.py", line 1950, in full_dispatch_request
rv = self.dispatch_request()
File "/env/lib/python3.8/site-packages/flask/app.py", line 1936, in dispatch_request
return self.view_functions[rule.endpoint](**req.view_args)
File "/code/datacube_wps/impl.py", line 50, in wps
service.append(Service(read_process_catalog('datacube-wps-config.yaml'), ['pywps.cfg']))
File "/env/lib/python3.8/site-packages/pywps/app/Service.py", line 50, in __init__
config.load_configuration(cfgfiles)
File "/env/lib/python3.8/site-packages/pywps/configuration.py", line 70, in load_configuration
CONFIG = configparser.ConfigParser(os.environ)
File "/usr/lib/python3.8/configparser.py", line 639, in __init__
self._read_defaults(defaults)
File "/usr/lib/python3.8/configparser.py", line 1219, in _read_defaults
self.read_dict({self.default_section: defaults})
File "/usr/lib/python3.8/configparser.py", line 752, in read_dict
raise DuplicateOptionError(section, key, source)
configparser.DuplicateOptionError: While reading from '<dict>': option 'prometheus_multiproc_dir' in section 'DEFAULT' already exists
Not sure how to interpret the trace exactly..
Presumably gunicorn organises separate processes, for the flask app to handle web requests. Depending on which resource is being requested, these may be dispatched to datacube_wps.impl.create_app().wps
. This function starts out by ensuring a pywps
Service
is instantiated (with an appropriate config file), which prompts pywps
to run configparser.ConfigParser(os.environ)
to pre-populate a fresh config's "DEFAULT" section from a dictionary of environment variables. I'm struggling to replicate the failure of that line. The parser is supposed to create an empty set, and iteratively add lowercased variable names to it, checking none is encountered twice.. You don't expect a duplicate environment variable to even be possible, but apparently environments are just lists of strings (potentially leaving it up to applications to interpret as key-value pairs and decide handling of duplicate keys if any occur)..
I deployed an instrumented version of the image (from a different docker hub repo). The logs revealed that this package is imported by the gunicorn worker process, and both variables prometheus_multiproc_dir
and PROMETHEUS_MULTIPROC_DIR
were identically set (by the time the package had finished importing).
The lowercase version was configured by the helm chart. I'm not sure why lowercase was originally specified (or advised) as this violates the convention (that applications interpret the environment as a list of key-value pairs using uppercase key names) and so contributes to the failure of PyWPS.
The uppercase duplication was set by prometheus_client.multiprocess.MultiProcessCollector
, which is called by the prometheus flask exporter. It turns out this is supposed to be associated with a deprecation warning, but by default python filters (ignores) all deprecation warnings, so it isn't visible in our logs.
The fix is to only use uppercase.
At present,
wps/?service=WPS&request=DescribeProcess&version=1.0.0&Identifier=WOfSDrill
consistently returns 500 Internal Server Error "Either the server is overloaded or there is an error in the application.". Thanks @GypsyBojangles for reporting.Similarly,
wps/?service=WPS&request=getcapabilities&version=1.0.0
gives the same 500 error.However,
wps/metrics
works (and looks vaguely sensible). Likewise,wps/ping
claims the system is healthy.kubectl
does not note any pod restarts.