project-koku / masu

This is a READ ONLY repo. See https://github.com/project-koku/koku for current masu implementation
GNU Affero General Public License v3.0
5 stars 6 forks source link

Investigate potential for race condition created by polling and SNS #69

Closed chambridge closed 6 years ago

chambridge commented 6 years ago

Review potential for race condition created by polling and SNS both trying to trigger download and processing to avoid double download and processing.

Is https://github.com/project-koku/masu/issues/68 enough?


Associated with: https://github.com/project-koku/masu/issues/7

dccurtis commented 6 years ago

Background During the development of SNS handling I came across a situation where we were hitting a database integrity error. This was because AWS would notify us of each file that changed (I was getting 3 notifications per event)

Conceptually, this is exactly the same situation as a polling task and SNS notification colliding since it would result in two queueing download/process requests for the same file.

As for the SNS implementation, the situation was mitigated by handling only top-level manifest file changes. At the end of the day all we need to know is the S3 bucket that changed.

Some other factors that has changed since the initial integrity error sitings. Masu now works asynchronously which is a significant change from how things were working during the initial SNS development.

Fast forward to today. I wanted to see if I could reproduce the same integrity errors now that Masu is working asynchronously.

Findings

  1. I was able to reproduce the problem 1 time. The failure was not a hard-failure in the sense that I didn't have to clear the database to recover.
  2. I tried as many torture test situations as I could such as not deleting underlying files each iteration, deleting files each iteration, populating new S3 bucket contents in rapid succession, populating new S3 bucket contents while processing was still underway. The outcome of all this is that I could not reproduce the problem.

In summary, there is a timing window that exists that will lead to a database integrity error. With the async changes the situation is considerably better. My artificial setup is way more extreme than the production system would ever be (I was not filtering events, and the S3 bucket updates were happening at intervals that AWS simply doesn't do today)

Next Steps I'm going to open an issue on the integrity error so we can keep an eye on it. If it turns out to be more of a problem down the road, we can prioritize it's fix. To add to that priority calculation SNS is also completely optional and not required for prototype. So my guess is that the earliest we would address this is in the post-prototype timeframe.

adberglund commented 6 years ago

A quick explanation of what I see as the race condition problem, using products as an example.

For every line item in a cost usage report, we pull the referenced product.

When a batch of line items have been processed we commit the products and all other relevant tables to the database.

The race condition being seen here could come about from the following scenario:

Task 1 starts processing a file. It sees new products that are not yet in the database and puts them on the session.

Task 2 starts processing a file. It sees the same new products that are not yet in the database and puts them on the session.

Task 1 commits to the database.

Task 2 attempts to commit to the database, but it tries to insert the products that were just inserted by task 1, and we get the integrity error in the traceback.

dccurtis commented 6 years ago

@chambridge I think this one can be closed.