Closed dccurtis closed 5 years ago
Looking into this to see if the exception paths are contributing to #315
Couple of options to address this:
None
on all database accessors to gracefully handle the case when the provider is removed while the masu celery tasks are running.in progress
flag on the provider table. Update koku to block provider removals for when in progress
is True
.Orchestrator
to be some sort of an event loop (i.e. asyncio) which can use a Postgres advisory lock or something similar to lock the provider entry while celery tasks are in progress so that koku can not remove until all async operations have finished.Moving this issue back to the backlog for now since I'd like to have more of a justification before tackling this.
Provider CRUD testing is hitting this a lot and preventing some tests being ran as smoke tests. I found a easy way to block removing providers while processing is underway by looking at the processing statistics.
My solution is to wrap transactions in a savepoint. When an IntegrityError occurs, we catch the exception, rollback the transaction, and continue processing any other valid transactions.
When a provider is deleted, all of it's data becomes invalid. An in-progress task to process new report data will inevitably fail at some point when it performs a transaction against data that no longer exists. That's expected. It's not really a concern where it fails - the entire task is invalid the moment the provider is deleted - where the task fails isn't an indication of a problem with that portion of the task.
The main thing in this particular race condition that needs to be addressed is ensuring that when the task fails, that it fails gracefully. Catching the IntegrityError and ensuring a safe rollback path for the transaction meets that requirement.
The second part of solving this problem will needs more involved work - the Provider Deletion code in Koku needs trigger a scan of Celery for tasks running or queued that involve the deleted provider, and kill those tasks.
There is a race condition that exists between the masu async worker and when providers are deleted in koku. Since both services are operating on the same database and there is no locking mechanism in place between the two it is relatively easy to face a situation where masu is in the middle of the download/process pipeline and someone deletes the provider in koku.
This has manifested it self with a few different backtraces in the masu worker log. Some of them include:
and
This last one was hit by waiting for the masu download polling to start and then removing the provider as soon as I saw the activity in the worker log.
Up to this point the impact of this is worker log angst. When it happens the service continues to operate as normal with the remaining accounts.