github / gh-ost

GitHub's Online Schema-migration Tool for MySQL
MIT License
12.42k stars 1.26k forks source link

Add `--transaction-isolation` flag #1441

Open timvaillancourt opened 3 months ago

timvaillancourt commented 3 months ago

A Pull Request should be associated with an Issue.

Related issue: https://github.com/github/gh-ost/issues/1262

Further notes in https://github.com/github/gh-ost/blob/master/.github/CONTRIBUTING.md Thank you! We are open to PRs, but please understand if for technical reasons we are unable to accept each and any PR

Description

This PR resolves https://github.com/github/gh-ost/issues/1262 by adding a --transaction-isolation flag that supports both REPEATABLE-READ (default - what GitHub tests) and READ-COMMITTED

In case this PR introduced Go code changes:

arthurschreiber commented 1 month ago

Do we really need this flag? I'm absolutely sure that mysql replication uses READ_COMMITTED when applying RBR changes from the binlog. As gh-ost does not support statement based replication, there's no point in using REPEATABLE_READ for the changelog applier and we should be able to use READ_COMMITTED always.

For the table copy part, I don't see a reason how READ_COMMITTED would have any negative side-effects either. Right now, REPEATABLE_READ might copy an "old" version of the row data, but the changelog applier will fix that up afterwards.

With READ_COMMITTED, we'll always read the "latest" version of the data, so in theory there could be less changes that need to be applied by the changelog applier, but I don't see any negative sides to this. 🤔

timvaillancourt commented 1 month ago

@arthurschreiber I pondered using READ_COMMITTED 100% in an earlier thread when setting transaction isolation was introduced, but there was some concerns around using a new isolation level. Allowing users that previously were using READ_COMMITTED before the enforcement was introduced seemed like the easiest way forward

But for the most part I agree, REPEATABLE_READ isolation is usually not required and can actually introduce stale results as you mention. There is one spot where a snapshot read might have a benefit, however: the calculation of the min/max chunk ranges. A snapshot isolation guarantees both the min and max query are operating on the same data - which sounds like a good thing but I'm not 100% sure