Handle database access and provider delete race condition

dccurtis commented 5 years ago

There is a race condition that exists between the masu async worker and when providers are deleted in koku. Since both services are operating on the same database and there is no locking mechanism in place between the two it is relatively easy to face a situation where masu is in the middle of the download/process pipeline and someone deletes the provider in koku.

This has manifested it self with a few different backtraces in the masu worker log. Some of them include:

[2018-12-19 14:59:40,540: ERROR/ForkPoolWorker-1] Task masu.processor.tasks.get_report_files[5d700cd3-2c7c-4e27-aba2-2d230d6992ab] raised unexpected: AttributeError("'NoneType' object has no attribute 'setup_complete'",)
Traceback (most recent call last):
  File "/opt/app-root/lib/python3.6/site-packages/celery/app/trace.py", line 382, in trace_task
    R = retval = fun(*args, **kwargs)
  File "/opt/app-root/src/masu/celery/__init__.py", line 75, in __call__
    return self.run(*args, **kwargs)
  File "/opt/app-root/src/masu/processor/tasks.py", line 72, in get_report_files
    provider_uuid)
  File "/opt/app-root/src/masu/processor/_tasks/download.py", line 60, in _get_report_files
    reports_processed = provider_accessor.get_setup_complete()
  File "/opt/app-root/src/masu/database/provider_db_accessor.py", line 147, in get_setup_complete
    return obj.setup_complete
AttributeError: 'NoneType' object has no attribute 'setup_complete'
[2018-12-19 14:59:40,543: INFO/MainProcess] Received task: masu.processor.tasks.remove_expired_data[879c1796-4371-4d8a-b790-2710756a28fb]  
[2018-12-19 14:59:41,116: ERROR/ForkPoolWorker-1] Task masu.processor.tasks.get_report_files[569f059d-659c-4ed1-9965-d5cb9a6d2187] raised unexpected: AttributeError("'NoneType' object has no attribute 'setup_complete'",)
Traceback (most recent call last):
  File "/opt/app-root/lib/python3.6/site-packages/celery/app/trace.py", line 382, in trace_task
    R = retval = fun(*args, **kwargs)
  File "/opt/app-root/src/masu/celery/__init__.py", line 75, in __call__
    return self.run(*args, **kwargs)
  File "/opt/app-root/src/masu/processor/tasks.py", line 72, in get_report_files
    provider_uuid)
  File "/opt/app-root/src/masu/processor/_tasks/download.py", line 60, in _get_report_files
    reports_processed = provider_accessor.get_setup_complete()
  File "/opt/app-root/src/masu/database/provider_db_accessor.py", line 147, in get_setup_complete
    return obj.setup_complete

and

[2018-12-19 15:58:16,998: INFO/ForkPoolWorker-2] Downloading /var/tmp/insights_local/my-ocp-cluster-1/20181201-20190101/48737b9d-b770-408a-99fa-dfb3b122ec7d_openshift_usage_report.csv to /var/tmp/masu/acct10001/ocp/my-ocp-cluste[370/4687]
01-20190101_48737b9d-b770-408a-99fa-dfb3b122ec7d_openshift_usage_report.csv
[2018-12-19 15:58:17,071: ERROR/ForkPoolWorker-2] masu.processor.tasks.get_report_files[d1a8ffb4-28f8-4890-ac65-7296836dc3c5]: (psycopg2.IntegrityError) insert or update on table "reporting_common_costusagereportstatus" violates foreign k
ey constraint "reporting_common_cos_manifest_id_62ef64b9_fk_reporting"
DETAIL:  Key (manifest_id)=(6) is not present in table "reporting_common_costusagereportmanifest".
 (Background on this error at: http://sqlalche.me/e/gkpj)
[2018-12-19 15:58:17,078: ERROR/ForkPoolWorker-2] Task masu.processor.tasks.get_report_files[d1a8ffb4-28f8-4890-ac65-7296836dc3c5] raised unexpected: ReportDownloaderError('(psycopg2.IntegrityError) insert or update on table "reporting_co
mmon_costusagereportstatus" violates foreign key constraint "reporting_common_cos_manifest_id_62ef64b9_fk_reporting"\nDETAIL:  Key (manifest_id)=(6) is not present in table "reporting_common_costusagereportmanifest".\n (Background on this
 error at: http://sqlalche.me/e/gkpj)',)
Traceback (most recent call last):
  File "/Users/curtisd/.local/share/virtualenvs/masu_fork-nZ_KUZg1/lib/python3.6/site-packages/sqlalchemy/engine/base.py", line 724, in _commit_impl
    self.engine.dialect.do_commit(self.connection)
  File "/Users/curtisd/.local/share/virtualenvs/masu_fork-nZ_KUZg1/lib/python3.6/site-packages/sqlalchemy/engine/default.py", line 462, in do_commit
    dbapi_connection.commit()
psycopg2.IntegrityError: insert or update on table "reporting_common_costusagereportstatus" violates foreign key constraint "reporting_common_cos_manifest_id_62ef64b9_fk_reporting"
DETAIL:  Key (manifest_id)=(6) is not present in table "reporting_common_costusagereportmanifest".

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/Users/curtisd/projects/repos/masu_fork/masu/external/report_downloader.py", line 118, in get_reports
    reports = reports + self.download_report(calculated_month)
  File "/Users/curtisd/projects/repos/masu_fork/masu/external/report_downloader.py", line 150, in download_report
    stats_recorder.commit()
  File "/Users/curtisd/projects/repos/masu_fork/masu/database/report_stats_db_accessor.py", line 69, in commit
    self._session.commit()
  File "/Users/curtisd/.local/share/virtualenvs/masu_fork-nZ_KUZg1/lib/python3.6/site-packages/sqlalchemy/orm/session.py", line 943, in commit
    self.transaction.commit()
  File "/Users/curtisd/.local/share/virtualenvs/masu_fork-nZ_KUZg1/lib/python3.6/site-packages/sqlalchemy/orm/session.py", line 471, in commit
    t[1].commit()
  File "/Users/curtisd/.local/share/virtualenvs/masu_fork-nZ_KUZg1/lib/python3.6/site-packages/sqlalchemy/engine/base.py", line 1643, in commit
    self._do_commit()
  File "/Users/curtisd/.local/share/virtualenvs/masu_fork-nZ_KUZg1/lib/python3.6/site-packages/sqlalchemy/engine/base.py", line 1674, in _do_commit
    self.connection._commit_impl()
  File "/Users/curtisd/.local/share/virtualenvs/masu_fork-nZ_KUZg1/lib/python3.6/site-packages/sqlalchemy/engine/base.py", line 726, in _commit_impl
    self._handle_dbapi_exception(e, None, None, None, None)
  File "/Users/curtisd/.local/share/virtualenvs/masu_fork-nZ_KUZg1/lib/python3.6/site-packages/sqlalchemy/engine/base.py", line 1413, in _handle_dbapi_exception
    exc_info
  File "/Users/curtisd/.local/share/virtualenvs/masu_fork-nZ_KUZg1/lib/python3.6/site-packages/sqlalchemy/util/compat.py", line 265, in raise_from_cause
    reraise(type(exception), exception, tb=exc_tb, cause=cause)
  File "/Users/curtisd/.local/share/virtualenvs/masu_fork-nZ_KUZg1/lib/python3.6/site-packages/sqlalchemy/util/compat.py", line 248, in reraise
    raise value.with_traceback(tb)
  File "/Users/curtisd/.local/share/virtualenvs/masu_fork-nZ_KUZg1/lib/python3.6/site-packages/sqlalchemy/engine/base.py", line 724, in _commit_impl
    self.engine.dialect.do_commit(self.connection)
  File "/Users/curtisd/.local/share/virtualenvs/masu_fork-nZ_KUZg1/lib/python3.6/site-packages/sqlalchemy/engine/default.py", line 462, in do_commit
    dbapi_connection.commit()
sqlalchemy.exc.IntegrityError: (psycopg2.IntegrityError) insert or update on table "reporting_common_costusagereportstatus" violates foreign key constraint "reporting_common_cos_manifest_id_62ef64b9_fk_reporting"
DETAIL:  Key (manifest_id)=(6) is not present in table "reporting_common_costusagereportmanifest".
 (Background on this error at: http://sqlalche.me/e/gkpj)

During handling of the above exception, another exception occurred:

This last one was hit by waiting for the masu download polling to start and then removing the provider as soon as I saw the activity in the worker log.

Up to this point the impact of this is worker log angst. When it happens the service continues to operate as normal with the remaining accounts.

dccurtis commented 5 years ago

Looking into this to see if the exception paths are contributing to #315

dccurtis commented 5 years ago

Couple of options to address this:

Check for None on all database accessors to gracefully handle the case when the provider is removed while the masu celery tasks are running.
Add an in progress flag on the provider table. Update koku to block provider removals for when in progress is True.
Change masu Orchestrator to be some sort of an event loop (i.e. asyncio) which can use a Postgres advisory lock or something similar to lock the provider entry while celery tasks are in progress so that koku can not remove until all async operations have finished.

Moving this issue back to the backlog for now since I'd like to have more of a justification before tackling this.

dccurtis commented 5 years ago

Provider CRUD testing is hitting this a lot and preventing some tests being ran as smoke tests. I found a easy way to block removing providers while processing is underway by looking at the processing statistics.

blentz commented 5 years ago

My solution is to wrap transactions in a savepoint. When an IntegrityError occurs, we catch the exception, rollback the transaction, and continue processing any other valid transactions.

When a provider is deleted, all of it's data becomes invalid. An in-progress task to process new report data will inevitably fail at some point when it performs a transaction against data that no longer exists. That's expected. It's not really a concern where it fails - the entire task is invalid the moment the provider is deleted - where the task fails isn't an indication of a problem with that portion of the task.

The main thing in this particular race condition that needs to be addressed is ensuring that when the task fails, that it fails gracefully. Catching the IntegrityError and ensuring a safe rollback path for the transaction meets that requirement.

The second part of solving this problem will needs more involved work - the Provider Deletion code in Koku needs trigger a scan of Celery for tasks running or queued that involve the deleted provider, and kill those tasks.

project-koku / masu

Handle database access and provider delete race condition #306