harshit-gangal commented 4 years ago

Background

There is a common use case from our users to have the ability to make sure that reads happen on a replica that received updates up to a specific point i.e. execute a query after it has applied a certain value and never sees an old value thereafter. This is a special case of the causal consistency model.

Feature

User executes DML on primary and then reads the same consistent data from any one of the replicas without overloading primary with these kinds of use cases.

Solution

MySQL has an option to return GTIDs for each update in the OK packet. Vitess can return the same GTID back to the client. The client then can send this GTID from any session and Vitess will store this information in session. When the client sends the read queries, wait for gtid (feature of mysql) will be executed along with the query on a replica.

Usage

Client needs to set Capability flag _CLIENT_SESSIONTRACK when connecting to VTGate via mysql protocol. This will enable VTGate to send the tracking information back to the client.
Client also needs to issue _SET SESSION_TRACK_GTIDS = OWNGTID to indicate VTGate and underlying mysql to return GTID in the OK packet. This system variable tracks the last DML and DDL commit GTID.
When the client issues _SET @@read_after_write_gtid = \, @@read_after_write_timeout = \_. VTGate will store this information in the session.
Following, the client sends read queries, VTGate will tag a replica to this session and issue _WAIT_FOR_EXECUTED_GTID_SET(gtidset[, timeout]) on the VTTablet along with the query.
Client needs to issue # 3 to refresh the session with a new GTID and replica or an empty string to reset it.

1 and 2 to be executed on a primary connection. 3, 4 and 5 to be executed on a replica connection.

Internals

Vitess will not just pass on the GTID from mysql. It needs to add information about which shard and keyspace this GTID belongs to. This will be called VTGTID and hence will not be usable outside of scope of Vitess.

Replica selection is based on random selection from available healthy replica tablets i.e. those who's replication lag is less than defined unhealthy_threshold on vttablet.

Limitation

If the wait_for_executed_gtid_set times out, should the query be sent to primary or it still just get executed on that replica? ~Some users would want the query to be executed on primary and some would need the query still to be executed on replica. As this timeout would happen mostly when the system is under duress, it would be better to still continue to execute the query on the replica itself and return the state data to avoid any effect of executing the query on primary.~ Initially, Vitess will fail the query on timeout returning an appropriate error code and error message. Later we can expose a client defined policy for handling timeouts.

Task Breakdown

More tasks to be added.

[x] Read/Write OK Packet with session tracking information. #6783
[x] Add SET management for SESSION_TRACK_GTIDS, read_after_write_gtid and read_after_write_timeout #6871
[ ] Add VTGATE logic so that a connection gets "sticky" replica connections per shard/keyspace (lost replica?)
[ ] Add plan support for WAIT_FOR_EXECUTED_GTID_SET both on vtgate and if necessary to the tablet as well (timeouts)
[ ] Add VTTablet flag for enabling CLIENT_SESSION_TRACK
[ ] Document the feature

References

MySQL worklog items: #W6128, and #W6972
MySQL server test file
ProxySQL blog post

Open question:

Connection pools Today, vttablet keeps two connection pools for the user - one for normal connections and one for connetions that have the found_rows connection setting enabled. The question is - should we now have four different pools? One for normal connections, one for found_rows enables connections, one for session_track_gtids enabled connections and for both both? GTID tracking is required only for DML or DDL statements, so it is needed to be enabled for TxEngine pool connections. Every client does not need this feature so we will put this behind a flag on vttablet.

Call for feedback

We're looking for community's feedback on the above suggestions/flow. Thank you for taking the time to read and respond!

systay commented 4 years ago

Future Work

Causal consistency - More Options

Option 1:

Add this information as a query hint:

SELECT /*vt+ READ_AFTER_WRITE_GTID=<GTID>, READ_AFTER_WRITE_TIMEOUT=<timeout_in_seconds> */ col1, col2 …

In this case, replica will not be tagged to a session. Each query will be required to provide this information to achieve causal consistency.

Option 2:

Use WAIT_FOR_EXECUTED_GTID_SET(gtid_set[, timeout])

SELECT WAIT_FOR_EXECUTED_GTID_SET(gtid_set[, timeout])

Client will send the VTGTID received in the OK packet. This will be executed by VTGate and the replica will be tagged to the session. Any query on this session will use the tagged replica.

Vitess managed Read My Writes

Client can indicate Vitess to manage Read My Writes (RMW) on their behalf by executing

USE @rmw

GTIDs will be stored and updated in the Vitess session.
SELECT queries outside of the transaction will go for RMW through tagged replica, other queries will go to primary.

Tracking GTIDs via Healthcheck

VTGate would be able to pick the replica based on the information held about the last GTID executed on each of them.

dweitzman commented 4 years ago

As this timeout would happen mostly when the system is under duress, it would be better to still continue to execute the query on the replica itself and return the state data to avoid any effect of executing the query on master.

Seems like another useful option would be to return some kind of read timeout error instead of potentially overloading master or returning stale data.

A specific scenario I have in mind is a world where vgtids from vstreams might be used by pipelines to run queries with joins that need data at least as recent as the observed vstream rows. Those sorts of pipelines wouldn't want stale data and may prefer to keep their traffic on rdonly, just backing off temporarily when there's a retryable error because data was stale.

pH14 commented 4 years ago

This is very exciting work!

A big +1 to everything @dweitzman suggested. I could imagine us making heavy use of this feature as well, specifically in the scenario of tailing the VStream and reading at-least-as-recent rows for each update that comes through.

If the wait_for_executed_gtid_set times out, should the query be sent to master or it still just get executed on that replica?

At least speaking for our apps, we'd virtually always want to time out and retry later if the desired data is unavailable.

Today, vttablet keeps two connection pools for the user - one for normal connections and one for connetions that have the found_rows connection setting enabled. The question is - should we now have four different pools? One for normal connections, one for found_rows enables connections, one for session_track_gtids enabled connections and for both both?

Hm, I'm not super familiar with what MySQL is doing under the hood to make this happen, but could it be always enabled (or disabled)? Then usage of the feature can be toggled at the VTGate level. Not sure if there's particular downside with that approach.

dbussink commented 4 years ago

3. When the client issues _SET @@vitess_metadata.read_my_write = , @@vitess_metadata.read_my_write_timeout = _. VTGate will store this information in the session.

Is it possible to have automatic fallback to the primary with a very short timeout? At least for us the kind of pattern we're interested in being able to build would be:

Run query with a very short timeout replica check timeout (more like 100ms) against Vitess.
In case it fails, fall back to the primary
Our application itself can handle the logic to not overload the primary here, because after say 5 seconds since the last GTID write, we always default to the replicas anyway and accept stale reads in that case.

I'm not entirely sure this can be built yet with the proposed setup? Maybe an additional READ_MY_WRITE_FALLBACK option could indicate what to do when the timeout expires? To read from a replica or from the primary? I think with that we can build the above mentioned strategy.

shlomi-noach commented 4 years ago

If the wait_for_executed_gtid_set times out, should the query be sent to master or it still just get executed on that replica?

There is a third option, which is the ideal answer for the user: pick another replica that does have that GTID. How smart is vitess oging to be in choosing which replica to read from? If there's three REPLICA tablets, will it rotate round-robin? Least recently used? Last used? Least lagging? Is there a heuristic that will give best wait latency?

systay commented 4 years ago

@dbussink wrote:

Is it possible to have automatic fallback to the primary with a very short timeout? At least for us the kind of pattern we're interested in being able to build would be:

Run query with a very short timeout replica check timeout (more like 100ms) against Vitess.

In case it fails, fall back to the primary

In these use cases, we are still thinking that the user has to tell us if she wants to hit the primary or the replica, and we would not automatically choose it. Would it be possible to fail here and allow the application to switch to the primary and re-read? We have ideas for future work where vtgate could chose for the user depending on the type of query, but we wanted to save that for future work and not bake it into this first version.

dbussink commented 4 years ago

In these use cases, we are still thinking that the user has to tell us if she wants to hit the primary or the replica, and we would not automatically choose it.

@systay As long as it's clear in the query somehow why it failed and that it was specifically due to a lack of a GTID, it would be totally workable yeah.

There is a third option, which is the ideal answer for the user: pick another replica that does have that GTID.

This would be great. If Vitess already tracks per replica at which GTID it is, doesn't this become relatively simply? If Vitess doesn't already have that information / it is hard to add, this becomes much more complex of course to build.

shlomi-noach commented 4 years ago

If Vitess already tracks per replica at which GTID it is, doesn't this become relatively simply?

@dbussink Right now it does not... As you suggest this will be more complex and in the longer run.

demmer commented 4 years ago

Meta-comment: this is exciting and I'm glad the community is looking into this. I have some thoughts both on the specifics and some edge cases we've thought of over the years at Slack...

demmer commented 4 years ago

Today, vttablet keeps two connection pools for the user - one for normal connections and one for connetions that have the found_rows connection setting enabled. The question is - should we now have four different pools? One for normal connections, one for found_rows enables connections, one for session_track_gtids enabled connections and for both both?

IIRC there are actually three separate pools used for application queries -- the two you mentioned above for reads, plus a separate tx pool for writes. It seems to me that we could always have the primary vttablet set the CLIENT_SESSION_TRACK capability in the tx pool so that for steps 1 & 2 you don't need to do anything special, right?

demmer commented 4 years ago

Vitess will not just pass on the GTID from mysql. It needs to add information about which shard and keyspace this GTID belongs to. This will be called VTGTID and hence will not be usable outside of scope of Vitess.

This actually gets at one of the thornier problems in this whole area:

Transactions may span multiple shards, potentially spanning multiple keyspaces. And in each there may have been multiple primaries in the past. So this VTGTID would need to contain all the GTID sets from every primary affected by the transaction.
The topology could change (split / merge / table migration / etc) in between the write and the read, which would invalidate any routing information embedded in the VGTID.
This GTID set is potentially really large, so we should consider the performance overhead of passing it around everywhere.

FWIW I think there's still plenty of value in this feature even if we don't handle these corner cases, but it's something to consider.

dweitzman commented 4 years ago

The topology could change (split / merge / table migration / etc) in between the write and the read, which would invalidate any routing information embedded in the VGTID.

I think at some point a shard migration ledger was added in the shard metadata table with a history of what GTIDs were split/merged into which new GTIDs. The information from that shard split ledger might be enough correctly handle shard splits and merges, or at least answer the question of whether a particular gtid is a pre-split gtid or a gtid that's never been seen before.

shlomi-noach commented 4 years ago

Does it make sense to have some hiccup at the time of resharding cut-over such that more load is put on the masters?

zmagg commented 4 years ago

This is super exciting!

Can the timeout error be different for: WAIT_FOR_EXECUTED_GTID_SET(gtid_set[, timeout]) than it would be for a timeout for other reasons? Having that disambiguation will be useful for client retry logic.

+1 to some other comments:

It's critical that whatever we choose works for resharding + reparenting cases.
The GTID sets are gigantic. Since we're using a prefix that's unique to Vitess anyway (vGTID), could we contract what we expect the client to pass around? Is just the very latest GTID sufficient information?

e.g. GTID set on one of our shards pulled today:

08507406-cde2-11ea-9509-0604580efd2b:1-13830428, 0b6ae794-265a-11e9-ab7f-0a4ea4782e92:1-93, 176407eb-702c-11e9-adf5-0649e1c07ece:1-161569636, 3303b178-8d67-11e9-bd0a-12732d230266:1-45830006, 3716f10d-265a-11e9-bc4a-0e58623c081c:1-15628, 697596b3-265a-11e9-8130-1279d9f84bd6:1-4169188, 72a33a8d-702c-11e9-bccd-128e89b8f56c:1-112539512, 78012db4-702c-11e9-a8e2-0ec7ad937616:1-236827946, b5b1af5f-8b94-11ea-8830-128fcd088aef:1-76276630, c5e75ad3-7df7-11e9-98f9-0ac460790b74:1-130093732, d33cba0e-cde1-11ea-b453-0e424a76630d:1-29851090, d3aba06c-cde1-11ea-ac8b-12b47adb170d:1-151695467, d93745d3-2659-11e9-83e5-06355879e0a8:1-76187084

Would it be possible for the client to just pass in shard/keyspace/d93745d3-2659-11e9-83e5-06355879e0a8:1-76187084 and have Vitess handle hydrating the rest of the GTID set?

zmagg commented 4 years ago

Do you have any plans for how the vtgate will choose the replica it sends the query to? Will it be different than the existing replica choosing logic and will it take into consideration replication lag on the tablets in the shard?

zmagg commented 4 years ago

If the wait_for_executed_gtid_set times out, should the query be sent to master or it still just get executed on that replica?

It would be awesome if we could configure and choose this on a per query level, as a query comment, then we could change this behavior live depending on how the system is performing and the query in question.

shlomi-noach commented 4 years ago

The GTID sets are gigantic.

@zmagg the returned GTID_OWN is only the GTID entry that was generated, hence it's just the master's UUID+number, e.g. d93745d3-2659-11e9-83e5-06355879e0a8:76187084. The client will then provide vitess with this single-valued GTID set, and there is no need to involve the entire bloated executed_gtid.

Disclaimer: I'm not the one to develop the feature, so I'm just proposing my thoughts here.

harshit-gangal commented 4 years ago

I have updated more details in Internals and Limitation Section. Also, how we will enable CLIENT_SESSION_TRACK at connection level with MySQL.

harshit-gangal commented 4 years ago

This actually gets at one of the thornier problems in this whole area:

Transactions may span multiple shards, potentially spanning multiple keyspaces. And in each there may have been multiple primaries in the past. So this VTGTID would need to contain all the GTID sets from every primary affected by the transaction.

Yes, VTGID will contain GTIDs from all shards on which the transaction was commited.

The topology could change (split / merge / table migration / etc) in between the write and the read, which would invalidate any routing information embedded in the VGTID.

Yes, As a first version we are thinking of failing the query if the shard tagged to a GTID is not available. We can iterate over it with a better implementation.

This GTID set is potentially really large, so we should consider the performance overhead of passing it around everywhere.

VTGate will collate all the GTIDs and generate VTGTID. This will be sent to client and vice-versa on doing casual consistency read. Vttablet only gets one GTID to be send. The current interaction will be between client and vtgate. This makes writes and reads not dependent on same session to be used which seems like a common behaviour to have a write pool and a read pool.

harshit-gangal commented 4 years ago

@dbussink

Run query with a very short timeout replica check timeout (more like 100ms) against Vitess.

We will make READ_MY_WRITE_TIMEOUT to accept float so 0.1 will also be acceptable. Keeping the units as seconds as what MySQL interprets.

systay commented 4 years ago

Quick update: instead of "read my writes", we are calling this "read after write". The system variables used will be updated to reflect this

dbussink commented 4 years ago

We will make READ_MY_WRITE_TIMEOUT to accept float so 0.1 will also be acceptable. Keeping the units as seconds as what MySQL interprets.

Thanks! Fwiw, I did also have to fix this in MySQL itself for WAIT_FOR_EXECUTED_GTID_SET: https://github.com/mysql/mysql-server/pull/230 :smile:.

matt-ullmer commented 5 months ago

What's the status of this RFC?

vitessio / vitess

RFC: Read After Write #6843

Background

Feature

Solution

Usage

Internals

Limitation

Task Breakdown

References

Call for feedback

Future Work

Causal consistency - More Options

Option 1:

Option 2:

Vitess managed Read My Writes

Tracking GTIDs via Healthcheck