Get celery worker production ready

adberglund commented 6 years ago

User Story

As a dev I want to ensure that celery is configured for production so that we are doing task processing in the most efficient manner.

Impacts

Backend, Ops

Implementation Details

Separate the beat scheduler from the worker
Get concurrency in line with OpenShift limits (i.e. only 1 CPU)
Eliminate start up errors

Acceptance Criteria

[x] The beat scheduler is created as it's own process
[ ] Concurrency is configured to be in line with OpenShift
[ ] Connections do not crash on startup

lcouzens commented 5 years ago

So firstly the good, if I stop masu worker or deploy without masu worker everything works as expected. I can hit the masu endpoint and it informs me that "No running Celery workers were found." which is correct. If I deploy without celery then masu still starts up correctly and the api endpont is accessible.

The potential issues that might need further investigation are the following.

If I try to deploy without rabbit then masu wont start and I see errors in the logs. After talking with @adberglund we think it should still be able to start the pod correctly. Similarly if I stop rabbit then masu's endpoint is no longer accessible, we hit internal 500 error. Not sure if this has any relevance but while rabbit is down I can still hit koku's endpoint.

Also just noticed that after rabbits been down for 5 mins or so masu starts to half work again, I see this in the logs:


OSError: failed to resolve broker hostname
10.131.0.1 - - [04/Dec/2018:18:33:36 +0000] "GET /api/v1/status/ HTTP/1.1" 500 291 "-" "kube-probe/1.11+"
10.131.0.1 - - [04/Dec/2018:18:33:42 +0000] "GET /api/v1/status/?liveness HTTP/1.1" 200 15 "-" "kube-probe/1.11+"
10.131.0.1 - - [04/Dec/2018:18:33:52 +0000] "GET /api/v1/status/?liveness HTTP/1.1" 200 15 "-" "kube-probe/1.11+"```

And hitting the endpoint now shows 'Application not available' and the masu logs show this:

```[2018-12-04 18:35:36,358] ERROR in app: Exception on /api/v1/status/ [GET]
Traceback (most recent call last):
  File "/opt/app-root/lib/python3.6/site-packages/amqp/transport.py", line 125, in _connect
    host, port, family, socket.SOCK_STREAM, SOL_TCP)
  File "/opt/rh/rh-python36/root/usr/lib64/python3.6/socket.py", line 745, in getaddrinfo
    for res in _socket.getaddrinfo(host, port, family, type, proto, flags):
socket.gaierror: [Errno -2] Name or service not known

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/opt/app-root/lib/python3.6/site-packages/flask/app.py", line 2292, in wsgi_app
    response = self.full_dispatch_request()
  File "/opt/app-root/lib/python3.6/site-packages/flask/app.py", line 1815, in full_dispatch_request
    rv = self.handle_user_exception(e)
  File "/opt/app-root/lib/python3.6/site-packages/flask/app.py", line 1718, in handle_user_exception
    reraise(exc_type, exc_value, tb)
  File "/opt/app-root/lib/python3.6/site-packages/flask/_compat.py", line 35, in reraise
    raise value
  File "/opt/app-root/lib/python3.6/site-packages/flask/app.py", line 1813, in full_dispatch_request
    rv = self.dispatch_request()
  File "/opt/app-root/lib/python3.6/site-packages/flask/app.py", line 1799, in dispatch_request
    return self.view_functions[rule.endpoint](**req.view_args)
  File "/opt/app-root/src/masu/api/status.py", line 52, in get_status
    'celery_status': app_status.celery_status,
  File "/opt/app-root/src/masu/api/status.py", line 85, in celery_status
    conn.heartbeat_check()
  File "/opt/app-root/lib/python3.6/site-packages/kombu/connection.py", line 290, in heartbeat_check
    return self.transport.heartbeat_check(self.connection, rate=rate)
  File "/opt/app-root/lib/python3.6/site-packages/kombu/connection.py", line 802, in connection
    self._connection = self._establish_connection()
  File "/opt/app-root/lib/python3.6/site-packages/kombu/connection.py", line 757, in _establish_connection
    conn = self.transport.establish_connection()
  File "/opt/app-root/lib/python3.6/site-packages/kombu/transport/pyamqp.py", line 130, in establish_connection
    conn.connect()
  File "/opt/app-root/lib/python3.6/site-packages/amqp/connection.py", line 302, in connect
    self.transport.connect()
  File "/opt/app-root/lib/python3.6/site-packages/amqp/transport.py", line 79, in connect
    self._connect(self.host, self.port, self.connect_timeout)
  File "/opt/app-root/lib/python3.6/site-packages/amqp/transport.py", line 136, in _connect
    "failed to resolve broker hostname"))
OSError: failed to resolve broker hostname```

Also just a little background I was deploying everything with the iqe oc deploy scripts. I also cheated slightly as it deploys everything I was quickly stopping pods before they actually _started_, I don't think this has any adverse affect just thought it was worth mentioning.

adberglund commented 5 years ago

@lcouzens As a refresher from a month ago, the status endpoint should now work without rabbit being present.

lcouzens commented 5 years ago

Discussed with @adberglund still seeing similar issues.

chambridge commented 5 years ago

Do we know what is left to do here? Feels like its just an open issue where we should really have any specific bugs we are seeing.

project-koku / masu