Closed adberglund closed 5 years ago
So firstly the good, if I stop masu worker or deploy without masu worker everything works as expected. I can hit the masu endpoint and it informs me that "No running Celery workers were found." which is correct. If I deploy without celery then masu still starts up correctly and the api endpont is accessible.
The potential issues that might need further investigation are the following.
If I try to deploy without rabbit then masu wont start and I see errors in the logs. After talking with @adberglund we think it should still be able to start the pod correctly. Similarly if I stop rabbit then masu's endpoint is no longer accessible, we hit internal 500 error. Not sure if this has any relevance but while rabbit is down I can still hit koku's endpoint.
Also just noticed that after rabbits been down for 5 mins or so masu starts to half work again, I see this in the logs:
OSError: failed to resolve broker hostname
10.131.0.1 - - [04/Dec/2018:18:33:36 +0000] "GET /api/v1/status/ HTTP/1.1" 500 291 "-" "kube-probe/1.11+"
10.131.0.1 - - [04/Dec/2018:18:33:42 +0000] "GET /api/v1/status/?liveness HTTP/1.1" 200 15 "-" "kube-probe/1.11+"
10.131.0.1 - - [04/Dec/2018:18:33:52 +0000] "GET /api/v1/status/?liveness HTTP/1.1" 200 15 "-" "kube-probe/1.11+"```
And hitting the endpoint now shows 'Application not available' and the masu logs show this:
```[2018-12-04 18:35:36,358] ERROR in app: Exception on /api/v1/status/ [GET]
Traceback (most recent call last):
File "/opt/app-root/lib/python3.6/site-packages/amqp/transport.py", line 125, in _connect
host, port, family, socket.SOCK_STREAM, SOL_TCP)
File "/opt/rh/rh-python36/root/usr/lib64/python3.6/socket.py", line 745, in getaddrinfo
for res in _socket.getaddrinfo(host, port, family, type, proto, flags):
socket.gaierror: [Errno -2] Name or service not known
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/opt/app-root/lib/python3.6/site-packages/flask/app.py", line 2292, in wsgi_app
response = self.full_dispatch_request()
File "/opt/app-root/lib/python3.6/site-packages/flask/app.py", line 1815, in full_dispatch_request
rv = self.handle_user_exception(e)
File "/opt/app-root/lib/python3.6/site-packages/flask/app.py", line 1718, in handle_user_exception
reraise(exc_type, exc_value, tb)
File "/opt/app-root/lib/python3.6/site-packages/flask/_compat.py", line 35, in reraise
raise value
File "/opt/app-root/lib/python3.6/site-packages/flask/app.py", line 1813, in full_dispatch_request
rv = self.dispatch_request()
File "/opt/app-root/lib/python3.6/site-packages/flask/app.py", line 1799, in dispatch_request
return self.view_functions[rule.endpoint](**req.view_args)
File "/opt/app-root/src/masu/api/status.py", line 52, in get_status
'celery_status': app_status.celery_status,
File "/opt/app-root/src/masu/api/status.py", line 85, in celery_status
conn.heartbeat_check()
File "/opt/app-root/lib/python3.6/site-packages/kombu/connection.py", line 290, in heartbeat_check
return self.transport.heartbeat_check(self.connection, rate=rate)
File "/opt/app-root/lib/python3.6/site-packages/kombu/connection.py", line 802, in connection
self._connection = self._establish_connection()
File "/opt/app-root/lib/python3.6/site-packages/kombu/connection.py", line 757, in _establish_connection
conn = self.transport.establish_connection()
File "/opt/app-root/lib/python3.6/site-packages/kombu/transport/pyamqp.py", line 130, in establish_connection
conn.connect()
File "/opt/app-root/lib/python3.6/site-packages/amqp/connection.py", line 302, in connect
self.transport.connect()
File "/opt/app-root/lib/python3.6/site-packages/amqp/transport.py", line 79, in connect
self._connect(self.host, self.port, self.connect_timeout)
File "/opt/app-root/lib/python3.6/site-packages/amqp/transport.py", line 136, in _connect
"failed to resolve broker hostname"))
OSError: failed to resolve broker hostname```
Also just a little background I was deploying everything with the iqe oc deploy scripts. I also cheated slightly as it deploys everything I was quickly stopping pods before they actually _started_, I don't think this has any adverse affect just thought it was worth mentioning.
@lcouzens As a refresher from a month ago, the status endpoint should now work without rabbit being present.
Discussed with @adberglund still seeing similar issues.
Do we know what is left to do here? Feels like its just an open issue where we should really have any specific bugs we are seeing.
User Story
As a dev I want to ensure that celery is configured for production so that we are doing task processing in the most efficient manner.
Impacts
Backend, Ops
Implementation Details
Acceptance Criteria