Random Tablet Selection causes "time inconsistent" data reads

arthurschreiber commented 1 year ago

vtgate performs random tablet selection (with preference for local cell tablets) when deciding which tablet to use for serving a query. On larger MySQL setups, where there might be a high variance in replication lag between replica nodes, this behaviour can lead to data inconsistencies that is hard to understand and work around.

Here's an example:

Let's suppose we have 2 tables, check_suites and check_runs.
A check_suites row has many associated check_runs rows.
A row in check_runs can only be created after creating the parent check_suites row.
Data is never deleted from check_suites.

Based on this description, an application that runs against a "regular" MySQL primary/replica setup, where connections are load balanced using e.g. haproxy can rely on the fact that it can read from check_runs, and every row will have a reference to a check_suites row that is guaranteed to exist.

The application will have a "time consistent" view of the data, where a read query will always see data at the same point of a previous query, or a later point in time.

When sending requests through vtgate, that expectation is broken if queries are distributed across more than one replica. 😔

The first query that is reading from check_runs might hit replica A, which has almost no replication lag. The second query, reading from check_suites, might hit replica B, which has over a second of replication lag. This can lead to the second query not being able to find any matching row in the check_suites table, which in turn can lead to errors or other weird behaviour in our application.

The application no longer has a "time consistent" view of the data, instead read queries see data at newer and older points in time.

deepthi commented 1 year ago

Have you explored using transactional reads? We do support that.

arthurschreiber commented 1 year ago

I'm not sure this would solve the problem we're facing in a satisfactory way. I might be wrong on this, so here's what I think the problems with transaction reads would be.

First, it's unclear what parts of our codebase need to be wrapped in transactions, and with a codebase as large as ours, it'd be a herculean effort to figure that out.

We could wrap every web request that comes in with transaction, but that will lead to gravely reducing the multiplexing done on the vttablet. We'd hit the vttablet connection limits that we've set very quickly, as connections would be "locked" for the duration of that web request. This would be made even worse with scatter or other multi shard queries.

It's also hard for me to estimate what the change in semantics would mean for our application, and which isolation level would be the correct one to use.

Overall, it feels like a much heavier solution to this problem than what we would require. 🤔

vitessio / vitess

Random Tablet Selection causes "time inconsistent" data reads #11971