pypi / warehouse

The Python Package Index
https://pypi.org
Apache License 2.0
3.6k stars 963 forks source link

Prevent Race Conditions by Utilizing Transaction Isolation #405

Closed dstufft closed 9 years ago

dstufft commented 9 years ago

We need to ensure that we don't get non-repeatable reads and phantom reads on some of the API endpoints. In particular anything that Bandersnatch is going to use that has a serial number we need to make sure that nothing changes between the query that gets the serial number and the quer(y|ies) that get the actual data.

dstufft commented 9 years ago

Ok, so I've been experimenting with this and I've come across a few problems.

These can be worked around, in the first case it can be worked around by doing:

engine = engine.execution_options(isolation_level="SERIALIZABLE")

@event.listens_for(engine, "begin")
def _set_read_only(conn):
    conn.execute("SET TRANSACTION READ ONLY DEFERRABLE")

This has the downside that it requires an extra query in order to actually enable the read only option. I've verified that this works to disable writing to this transaction however I'm unsure if the DEFERRABLE attribute will take effect in this scenario. If it doesn't it's not the end of the world, however it does mean that, if it doesn't, PostgreSQL cannot optimize away maintaining open locks while the transaction is open.

In the second case, this can be worked around by having a read only decorator that does something like:

def read_only(view):
    @functools.wraps(view)
    def wrapped(context, request):
        # Stash the original session
        original_session = request.db

        # We need to create a new session for this request, this session must
        # be read-only.
        session = _create_session(request, read_only=True)

        # Now we want to merge the context object (if we have one) with the
        # new session.
        if (context is not None and
                isinstance(context, ModelBase) and
                _Session.object_session(context) is original_session):
            context = session.merge(context)

        # Now that we've gotten a new session we want to assign it to
        # request.db so that the view will use it instead of the original
        # session.
        request.db = session

        return view(context, request)

    return wrapped

This works as far as I can tell, however I'm a bit concerned about this solution to the problem:

Thinking about this more, I'm wondering if maybe we don't just want to use SERIALIZABLE queries for PyPI always. This isolation level will make PyPI function as if the queries were executes serially and not concurrently. The major downside being that it makes it more likely that if you have two concurrent queries one of them will fail and we'll need code to handle that fact and to do something to retry the transaction. The major upside is that we eliminate any class of bug that stems from non-repeatable reads and phantom reads. These bugs are not normally a big deal however they were a cause of problems on PyPI legacy already that caused me to implement the ability to set a particular transaction to SERIALIZABLE READ ONLY DEFERRABLE.

The biggest downside to SERIALIZABLE is the fact that your application needs to be prepared to retry transactions. Luckily since we're using pyramid_tm we get this "sort of" free because we can configure that with a set number of attempts (defaults to just 1) where if the transaction fails it'll simply re-execute the view and try to commit the transaction again. This can cause problem if the view has side effects that aren't tied into the transaction system, but ideally we tie everything we can into the transaction system.

In addition I believe the risks themselves should be fairly small because for PyPI the read:write ratio is heavily skewed towards reads. In particular, on an example day in February there were only 560 total new releases on PyPI in the entire day. Obviously that's not the entirety of the write operations on PyPI however in general I think the chances of conflict are going to be fairly small because of how unlikely it is we have two things trying to make modifications to the same row.

I'm not the greatest expert at databases here though, there might be some downside to SERIALIZABLE that I'm not aware of.

dstufft commented 9 years ago

To be clear, the fundamental problem we have here is that we have views that (essentially) look like this:

# This part is actually outside of the view, and we can't easily change it in a view specific way.
project = sqla_session.query(Project).filter(Project.name == name).one()
# This part *may* by outside of the view, depending on which view it is.
release = sqla_session.query(Release).filter((Release.project == project) & (Release.version == version)).one()
# This part is absolutely inside of our view
serial =  (
    sqla_session.query(func.max(JournalEntry.id))
                .filter(JournalEntry.name == file_.name)
                .scalar()
)

In between any of these three queries the state of the database can change and we can get inconsistent results. This is most noticeable with the serial value, where if something happens between fetching that and the other queries we can get a serial number that doesn't match everything else however we have systems which depend on that matching the current state. Changing the order doesn't help because all that does is mean instead of getting a newer serial than we expected we get an older serial.

dstufft commented 9 years ago

Ok, I now have two different solutions:

Solution 1: Read Only Route Predicate

This is the solution that is implemented in #437. It essentially boils down to being able to mark a particular route as "read only".

Upsides:

Downsides:

Solution 2: SERIALIZABLE all the things

This is the solution implemented in #436. It essentially boils down to using the SERIALIZABLE isolation level across the board.

Upsides:

Downsides:

cjw296 commented 9 years ago

Question about the "fundamental problem" you describe in the three parts, how come these won't all be in the same transaction?

dstufft commented 9 years ago

They will be in the same transaction, but the default isolation in PostgreSQL is READ COMMITTED. That means that each individual query gets it's own consistent snapshot of the database that includes everything that's been committed + everything that's changed in this particular transaction. So that means if another transaction changes the state of the database between the execution of these queries and commits it, then our transaction will see that since it's READ COMMITTED.

One of the solutions here is to switch the default isolation level to SERIALIZABLE which is the strictest isolation mode, and it essentially means that the database will take a snapshot at the beginning of the transaction and it won't update what data you can see inside that transaction when other queries commit. This means that if two concurrent transactions touch the same rows, one of them is going to raise an error.

dstufft commented 9 years ago

Talked this over with @r1chardj0n3s, we've decided to go with #436 since that'll ensure every transaction has a consistent snapshot of the database and we'll consider any view that can't handle pyramid_tm doing a retry as a bug.

cjw296 commented 9 years ago

Would be fascinated to know what conditions READ COMMITTED won't be enough...

dstufft commented 9 years ago

We have a data model where we essentially have one table that we uses as an append only log of all of the meaingful actions that have been done to the database and we use the ID column of that table as a sort of "this is the version of the data we are currently in". We have mirrors that replicate the PyPI data and they use that ID number so they know, given the list of changes from the audit log, what changes they've applied and what they haven't.

So the basic problem is this:

Transaction 1             | Transaction 2             |
SELECT ... FROM packages  |                           | audit_log version is 1
                          | INSERT INTO audit_log ... | audit_log version is 2
SELECT ... FROM audit_log |                           | audit_log version is 2

So you can see, we do two queries in transaction 1, the first one we select some data from the database and in the second one we get what "version" the data is in by select from the audit_log. However in between these two something has INSERT'd an extra row into the audit_log so while the audit_log version was 1 when we selected the data, we'll see it as 2 when we do the selection.

This causes problems because if we select from the audit_log after selecting the data, then the mirroring system thinks that it's processed audit log entries which it actually hasn't. However if we select from the audit log first then the mirroring system will think it hasn't processed some audit log entries even though it actually has.

cjw296 commented 9 years ago

Did you have a look at SELECT FOR UPDATE? Either way, for another set of very valuable eyes on this, I'd suggest asking on the sqlalchemy mailing list, @zzzeek will almost certainly have wise words of wisdom :-)