mozilla / addons

☂ Umbrella repository for Mozilla Addons ✨
Other
125 stars 41 forks source link

More graceful git-extraction failure recovery #1880

Open wagnerand opened 2 years ago

wagnerand commented 2 years ago

We have seen git-extraction fail for some reasons. At least for some, we should try automated recovery, and if that fails (or doesn't make the extraction success during the next turn), alert a human.

One example is the following error:

OSError("failed to lock file '/mnt/efs/addons.mozilla.org/git-storage/97/0697/2720697/addon/.git/refs/heads/listed.lock' for writing: ")

could potentially be recovered by:

AddonGitRepository(2720697).delete()

Another example:

"Uncaught exception:
  File "/usr/local/lib/python3.9/site-packages/celery/app/trace.py", line 468, in trace_task
    I, R, state, retval = on_error(task_request, exc, uuid)
  File "/usr/local/lib/python3.9/site-packages/celery/app/trace.py", line 379, in on_error
    R = I.handle_error_state(
  File "/usr/local/lib/python3.9/site-packages/celery/app/trace.py", line 178, in handle_error_state
    return {
  File "/usr/local/lib/python3.9/site-packages/celery/app/trace.py", line 225, in handle_failure
    task.backend.mark_as_failure(
  File "/usr/local/lib/python3.9/site-packages/celery/backends/base.py", line 220, in mark_as_failure
    self._call_task_errbacks(request, exc, traceback)
  File "/usr/local/lib/python3.9/site-packages/celery/backends/base.py", line 243, in _call_task_errbacks
    errback(request, exc, traceback)
  File "/usr/local/lib/python3.9/site-packages/celery/canvas.py", line 168, in __call__
    return self.type(*args, **kwargs)
  File "/usr/local/lib/python3.9/site-packages/sentry_sdk/integrations/celery.py", line 200, in _inner
    reraise(*exc_info)
  File "/usr/local/lib/python3.9/site-packages/sentry_sdk/_compat.py", line 54, in reraise
    raise value
  File "/usr/local/lib/python3.9/site-packages/sentry_sdk/integrations/celery.py", line 195, in _inner
    return f(*args, **kwargs)
  File "/usr/local/lib/python3.9/site-packages/celery/app/trace.py", line 735, in __protected_call__
    return orig(self, *args, **kwargs)
  File "/usr/local/lib/python3.9/site-packages/celery/app/task.py", line 392, in __call__
    return self.run(*args, **kwargs)
  File "/data/olympia/src/olympia/amo/decorators.py", line 121, in wrapper
    return f(*args, **kw)
  File "/data/olympia/src/olympia/git/tasks.py", line 90, in on_extraction_error
    remove_git_extraction_entry(addon_pk)
  File "/usr/local/lib/python3.9/site-packages/celery/local.py", line 188, in __call__
    return self._get_current_object()(*a, **kw)
  File "/usr/local/lib/python3.9/site-packages/sentry_sdk/integrations/celery.py", line 200, in _inner
    reraise(*exc_info)
  File "/usr/local/lib/python3.9/site-packages/sentry_sdk/_compat.py", line 54, in reraise
    raise value
  File "/usr/local/lib/python3.9/site-packages/sentry_sdk/integrations/celery.py", line 195, in _inner
    return f(*args, **kwargs)
  File "/usr/local/lib/python3.9/site-packages/celery/app/trace.py", line 735, in __protected_call__
    return orig(self, *args, **kwargs)
  File "/usr/local/lib/python3.9/site-packages/celery/app/task.py", line 392, in __call__
    return self.run(*args, **kwargs)
  File "/data/olympia/src/olympia/amo/decorators.py", line 121, in wrapper
    return f(*args, **kw)
  File "/data/olympia/src/olympia/git/tasks.py", line 24, in remove_git_extraction_entry
    GitExtractionEntry.objects.filter(addon_id=addon_pk, in_progress=True).delete()
  File "/usr/local/lib/python3.9/site-packages/django/db/models/query.py", line 746, in delete
    deleted, _rows_count = collector.delete()
  File "/usr/local/lib/python3.9/site-packages/django/db/models/deletion.py", line 400, in delete
    with transaction.atomic(using=self.using, savepoint=False):
  File "/usr/local/lib/python3.9/site-packages/django/db/transaction.py", line 207, in __enter__
    connection.set_autocommit(False, force_begin_transaction_with_broken_autocommit=True)
  File "/usr/local/lib/python3.9/site-packages/django/db/backends/base/base.py", line 415, in set_autocommit
    self._set_autocommit(autocommit)
  File "/usr/local/lib/python3.9/site-packages/django/db/backends/mysql/base.py", line 272, in _set_autocommit
    self.connection.autocommit(autocommit)
  File "/usr/local/lib/python3.9/site-packages/django/db/utils.py", line 90, in __exit__
    raise dj_exc_value.with_traceback(traceback) from exc_value
  File "/usr/local/lib/python3.9/site-packages/django/db/backends/mysql/base.py", line 272, in _set_autocommit
    self.connection.autocommit(autocommit)
  File "/usr/local/lib/python3.9/site-packages/MySQLdb/connections.py", line 239, in autocommit
    _mysql.connection.autocommit(self, on)
<class 'django.db.utils.InterfaceError'>
InterfaceError(0, '')
"

(https://sentry.io/organizations/mozilla/issues/3190120152/?project=6310819&query=is%3Aunresolved) likely caused by https://sentry.io/organizations/mozilla/issues/3190120144/?project=6310819

Our understanding is that extraction might have succeeded, but in any case, the task took so long that the server reached the threshold for open database connections and closed it. I am not sure what we could try doing here without human intervention. I am open to any suggestions.

Others might not be as easily or at all automatedly recoverable, in which we case we should alert a human.

┆Issue is synchronized with this Jira Task

KevinMind commented 4 months ago

Old Jira Ticket: https://mozilla-hub.atlassian.net/browse/ADDSRV-96

wagnerand commented 1 month ago

@diox @willdurand will we still need git-extraction once code-manager has been decommissioned? Related tickets: