project-koku / masu

This is a READ ONLY repo. See https://github.com/project-koku/koku for current masu implementation
GNU Affero General Public License v3.0
5 stars 6 forks source link

Get celery worker production ready #169

Closed adberglund closed 5 years ago

adberglund commented 6 years ago

User Story

As a dev I want to ensure that celery is configured for production so that we are doing task processing in the most efficient manner.

Impacts

Backend, Ops

Implementation Details

Acceptance Criteria

lcouzens commented 5 years ago

So firstly the good, if I stop masu worker or deploy without masu worker everything works as expected. I can hit the masu endpoint and it informs me that "No running Celery workers were found." which is correct. If I deploy without celery then masu still starts up correctly and the api endpont is accessible.

The potential issues that might need further investigation are the following.

If I try to deploy without rabbit then masu wont start and I see errors in the logs. After talking with @adberglund we think it should still be able to start the pod correctly. Similarly if I stop rabbit then masu's endpoint is no longer accessible, we hit internal 500 error. Not sure if this has any relevance but while rabbit is down I can still hit koku's endpoint.

Also just noticed that after rabbits been down for 5 mins or so masu starts to half work again, I see this in the logs:


OSError: failed to resolve broker hostname
10.131.0.1 - - [04/Dec/2018:18:33:36 +0000] "GET /api/v1/status/ HTTP/1.1" 500 291 "-" "kube-probe/1.11+"
10.131.0.1 - - [04/Dec/2018:18:33:42 +0000] "GET /api/v1/status/?liveness HTTP/1.1" 200 15 "-" "kube-probe/1.11+"
10.131.0.1 - - [04/Dec/2018:18:33:52 +0000] "GET /api/v1/status/?liveness HTTP/1.1" 200 15 "-" "kube-probe/1.11+"```

And hitting the endpoint now shows 'Application not available' and the masu logs show this:

```[2018-12-04 18:35:36,358] ERROR in app: Exception on /api/v1/status/ [GET]
Traceback (most recent call last):
  File "/opt/app-root/lib/python3.6/site-packages/amqp/transport.py", line 125, in _connect
    host, port, family, socket.SOCK_STREAM, SOL_TCP)
  File "/opt/rh/rh-python36/root/usr/lib64/python3.6/socket.py", line 745, in getaddrinfo
    for res in _socket.getaddrinfo(host, port, family, type, proto, flags):
socket.gaierror: [Errno -2] Name or service not known

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/opt/app-root/lib/python3.6/site-packages/flask/app.py", line 2292, in wsgi_app
    response = self.full_dispatch_request()
  File "/opt/app-root/lib/python3.6/site-packages/flask/app.py", line 1815, in full_dispatch_request
    rv = self.handle_user_exception(e)
  File "/opt/app-root/lib/python3.6/site-packages/flask/app.py", line 1718, in handle_user_exception
    reraise(exc_type, exc_value, tb)
  File "/opt/app-root/lib/python3.6/site-packages/flask/_compat.py", line 35, in reraise
    raise value
  File "/opt/app-root/lib/python3.6/site-packages/flask/app.py", line 1813, in full_dispatch_request
    rv = self.dispatch_request()
  File "/opt/app-root/lib/python3.6/site-packages/flask/app.py", line 1799, in dispatch_request
    return self.view_functions[rule.endpoint](**req.view_args)
  File "/opt/app-root/src/masu/api/status.py", line 52, in get_status
    'celery_status': app_status.celery_status,
  File "/opt/app-root/src/masu/api/status.py", line 85, in celery_status
    conn.heartbeat_check()
  File "/opt/app-root/lib/python3.6/site-packages/kombu/connection.py", line 290, in heartbeat_check
    return self.transport.heartbeat_check(self.connection, rate=rate)
  File "/opt/app-root/lib/python3.6/site-packages/kombu/connection.py", line 802, in connection
    self._connection = self._establish_connection()
  File "/opt/app-root/lib/python3.6/site-packages/kombu/connection.py", line 757, in _establish_connection
    conn = self.transport.establish_connection()
  File "/opt/app-root/lib/python3.6/site-packages/kombu/transport/pyamqp.py", line 130, in establish_connection
    conn.connect()
  File "/opt/app-root/lib/python3.6/site-packages/amqp/connection.py", line 302, in connect
    self.transport.connect()
  File "/opt/app-root/lib/python3.6/site-packages/amqp/transport.py", line 79, in connect
    self._connect(self.host, self.port, self.connect_timeout)
  File "/opt/app-root/lib/python3.6/site-packages/amqp/transport.py", line 136, in _connect
    "failed to resolve broker hostname"))
OSError: failed to resolve broker hostname```

Also just a little background I was deploying everything with the iqe oc deploy scripts. I also cheated slightly as it deploys everything I was quickly stopping pods before they actually _started_, I don't think this has any adverse affect just thought it was worth mentioning.
adberglund commented 5 years ago

@lcouzens As a refresher from a month ago, the status endpoint should now work without rabbit being present.

lcouzens commented 5 years ago

Discussed with @adberglund still seeing similar issues.

chambridge commented 5 years ago

Do we know what is left to do here? Feels like its just an open issue where we should really have any specific bugs we are seeing.