teamhephy / controller

Hephy Workflow Controller (API)
https://teamhephy.com
MIT License
14 stars 26 forks source link

(app::deploy): list index out of range while dais pull #84

Closed edisonwang closed 5 years ago

edisonwang commented 5 years ago

Got this issue from yesterday and couldnt find out the reason, attached logs below. I use self hosated Gitlab-CI to auto build and depoly to deis cluster, and everything works fine until this happened. it happens accross all my deis apps now ( tried 3, all have same problem). I tried remove this app and create a new one ( below log shows the fresh created app), as the log shows, it can success sometime but most time just error out.

Error:

$deis pull -a xxxxx-staging gitlab.xxxxx.com:4567/edison/xxxxx:1fbd2888
Creating build... Error: Unknown Error (400): {"detail":"(app::deploy): list index out of range"}
$deis releases -a xxxxx-staging
=== chromahub-staging Releases
v14 2018-11-08T22:38:17Z    edison deployed 5ec34b8 which failed
v13 2018-11-08T22:29:21Z    gitlab-ci deployed 4158375 which failed
v12 2018-11-08T12:48:12Z    gitlab-ci deployed gitlab.edisonnotes.com:4567/edison/xxxxx:81f13522
v11 2018-11-08T11:22:12Z    gitlab-ci deployed e897978 which failed
v10 2018-11-08T10:53:11Z    edison deployed 046364a which failed
v9  2018-11-08T10:51:45Z    gitlab-ci deployed b4ca123 which failed
v8  2018-11-08T10:14:48Z    edison deployed gitlab.edisonnotes.com:4567/edison/xxxxx:70a84faa
v7  2018-11-08T10:07:27Z    edison changed DATABASE_URL
v6  2018-11-08T09:45:31Z    edison deployed gitlab.edisonnotes.com:4567/edison/xxxxx:67639b16
v5  2018-11-08T09:45:18Z    edison added registry info username, password
v4  2018-11-08T09:29:12Z    gitlab-ci deployed 85d18f4 which failed
v3  2018-11-08T09:28:36Z    edison added DJANGO_AWS_SECRET_ACCESS_KEY, STRIPE_TEST_PUBLIC_KEY, PORT, MAILGUN_API_KEY, STRIPE_LIVE_SECRET_KEY, DJANGO_ACCOUNT_ALLOW_REGISTRATION, DJANGO_SECURE_SSL_REDIRECT, STRIPE_LIVE_MODE, DJANGO_SECRET_KEY, REDIS_URL, STRIPE_TEST_SECRET_KEY, DJANGO_ALLOWED_HOSTS, MAILGUN_DOMAIN, DJANGO_AWS_STORAGE_BUCKET_NAME, DJANGO_SERVER_EMAIL, STRIPE_LIVE_PUBLIC_KEY, DJANGO_SETTINGS_MODULE, DATABASE_URL, DJANGO_ADMIN_URL, DJANGO_AWS_ACCESS_KEY_ID, SENTRY_DSN, WEB_CONCURRENCY
v2  2018-11-08T09:27:43Z    gitlab-ci deployed 833a474 which failed
v1  2018-11-08T09:24:32Z    edison created initial release

Controller log:

INFO [xxxxx-staging]: build xxxx-staging-4158375 created
INFO:api.models.app:[xxxxx-staging]: build xxx-staging-4158375 created
INFO [xxxxx-staging]: gitlab-ci deployed gitlab.xxxxx.com:4567/edison/xxxxx:1fbd2888
INFO:api.models.app:[xxxx-staging]: gitlab-ci deployed gitlab.xxxxx.com:4567/edison/chromacon_artist_hub:1fbd2888
INFO [xxxxx-staging]: gitlab.xxxxx:4567/edison/xxxxx:1fbd2888 exists in the target registry. Using image for release 13 of app xxxxx-staging
INFO:api.models.app:[xxxxx-staging]: gitlab.xxxxx.com:4567/edison/xxxxx:1fbd2888 exists in the target registry. Using image for release 13 of app xxxxx-staging
INFO [xxxxx-staging]: adding 5s on to the original 120s timeout to account for the initial delay specified in the liveness / readiness probe
INFO:scheduler:[xxxxx-staging]: adding 5s on to the original 120s timeout to account for the initial delay specified in the liveness / readiness probe
INFO [xxxxx-staging]: This deployments overall timeout is 125s - batch timeout is 125s and there are 1 batches to deploy with a total of 1 pods
INFO:scheduler:[xxxxx-staging]: This deployments overall timeout is 125s - batch timeout is 125s and there are 1 batches to deploy with a total of 1 pods
ERROR [xxxxx-staging]: (app::deploy): list index out of range
ERROR:api.models.app:[xxxxx-staging]: (app::deploy): list index out of range
ERROR:root:(app::deploy): list index out of range
Traceback (most recent call last):
  File "/app/api/models/app.py", line 543, in deploy
    async_run(tasks)
  File "/app/api/utils.py", line 169, in async_run
    raise error
  File "/usr/lib/python3.5/asyncio/tasks.py", line 241, in _step
    result = coro.throw(exc)
  File "/app/api/utils.py", line 181, in async_task
    await loop.run_in_executor(None, params)
  File "/usr/lib/python3.5/asyncio/futures.py", line 361, in __iter__
    yield self  # This tells Task to wait for completion.
  File "/usr/lib/python3.5/asyncio/tasks.py", line 296, in _wakeup
    future.result()
  File "/usr/lib/python3.5/asyncio/futures.py", line 274, in result
    raise self._exception
  File "/usr/lib/python3.5/concurrent/futures/thread.py", line 55, in run
    result = self.fn(*self.args, **self.kwargs)
  File "/app/scheduler/__init__.py", line 270, in deploy
    namespace, name, image, entrypoint, command, **kwargs
  File "/app/scheduler/resources/deployment.py", line 138, in update
    self.wait_until_ready(namespace, name, **kwargs)
  File "/app/scheduler/resources/deployment.py", line 336, in wait_until_ready
    self._check_for_failed_events(namespace, labels=labels)
  File "/app/scheduler/resources/deployment.py", line 373, in _check_for_failed_events
    'involvedObject.name': data['items'][0]['metadata']['name'],
IndexError: list index out of range

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/app/api/models/build.py", line 65, in create
    self.app.deploy(new_release)
  File "/app/api/models/app.py", line 562, in deploy
    raise ServiceUnavailable(err) from e
api.exceptions.ServiceUnavailable: (app::deploy): list index out of range

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/usr/local/lib/python3.5/dist-packages/rest_framework/views.py", line 486, in dispatch
    response = handler(request, *args, **kwargs)
  File "/app/api/views.py", line 185, in create
    return super(AppResourceViewSet, self).create(request, **kwargs)
  File "/usr/local/lib/python3.5/dist-packages/rest_framework/mixins.py", line 21, in create
    self.perform_create(serializer)
  File "/app/api/viewsets.py", line 21, in perform_create
    self.post_save(obj)
  File "/app/api/views.py", line 268, in post_save
    self.release = build.create(self.request.user)
  File "/app/api/models/build.py", line 81, in create
    raise DeisException(str(e)) from e
api.exceptions.DeisException: (app::deploy): list index out of range
10.40.0.88 "POST /v2/apps/xxxxx-staging/builds/ HTTP/1.1" 400 51 "Deis Client v2.18.0"
Cryptophobia commented 5 years ago

Hi @edisonwang , looks like the failure is here in the controller api code:

    def _check_for_failed_events(self, namespace, labels):
        """
        Request for new ReplicaSet of Deployment and search for failed events involved by that RS
        Raises: KubeException when RS have events with FailedCreate reason
        """
        response = self.rs.get(namespace, labels=labels)
        data = response.json()
        fields = {
            'involvedObject.kind': 'ReplicaSet',
            'involvedObject.name': data['items'][0]['metadata']['name'],
            'involvedObject.namespace': namespace,
            'involvedObject.uid': data['items'][0]['metadata']['uid'],
        }

I noticed the top comments that Raises: KubeException when RS have events with FailedCreate reason? Do your ReplicaSets have events that FailedCreate for some reason? This could be a reason why your deployments are failing with the above exceptions.

Cryptophobia commented 5 years ago

@edisonwang Can you check your ReplicaSets for any of the failures that could possibly happen? Here is a list of possible failures:

https://kukulinski.com/10-most-common-reasons-kubernetes-deployments-fail-part-2/

Could it be number (6) Resource Quotas or (7) Insufficient Cluster Resources...

Best way to check is to do kubectl describe rs ... and look at the events on the replicasets when the errors are happening.

edisonwang commented 5 years ago

Hi Thanks for your the answer, the problem is gone after a restart but I reconfigured hephy to use S3 backend and lost the database... then I end up reinstalled the whole thing.... I also suspect it's cluster failure or events related, not necessary a deis issue, but the error message rather confusing, I'll keep an eye on this when it happens again and get back with more logs.

Cryptophobia commented 5 years ago

Hi Thanks for your the answer, the problem is gone after a restart but I reconfigured hephy to use S3 backend and lost the database... then I end up reinstalled the whole thing.... I also suspect it's cluster failure or events related, not necessary a deis issue, but the error message rather confusing, I'll keep an eye on this when it happens again and get back with more logs.

Alright, sounds good! From the exception that is thrown it looks like it could be resource limits or some other event thrown on the ReplicaSet. I'm going to close this issue for now and feel free to open it again if it reoccurs.