MyonKeminta commented 3 years ago

As a solution to schema version check issue (https://github.com/tikv/sig-transaction/issues/51), we added max_commit_ts limit to async commit's prewrite requests. When the calculated min_commit_ts exceeds the max_commit_ts, the CommitTsTooLarge error will be thrown. We need to find a proper way to handle the CommitTsTooLarge error. Otherwise, when the load is high, the failure rate of async commit might be significant.

Solution 1:

When TiDB receives CommitTsTooLarge error, check the schema version again.

If the schema version is not changed, update the max_commit_ts and retry (don't need to retry for already-successfully-prewritten keys)
Otherwise, if the transaction is amended, then update the primary lock first to update the secondary list, then continue prewritting remaining keys. Note that the primary must be updated before all keys being prewritten to guarantee consistency.

Solution 2:

In solution 1, if the load is high enough, it's still likely to fail after retry. Another choice is to fallback to non-async-commit transaction when CommitTsTooLarge error occurs. This might be more complicated to implement than solution 1. If we always rewrite the primary lock to non-async-commit lock first when falling back, the implementation might be easier. We should confirm the correctness first before adopting this way.

MyonKeminta commented 3 years ago

cc @coocood @sticnarf @cfzjywxk

coocood commented 3 years ago

Solution 2 looks more safe.

sticnarf commented 3 years ago

Implementation details

TiKV

Auto fallback in prewrite

When TiKV receives an async-commit or 1PC prewrite request, it calculates the max_commit_ts key by key. Then for a single prewrite request, some key may pass the commit_ts constraint check while some may not.

After a CommitTsTooLarge error is encountered, subsequent mutations will be prewritten using the normal 2PC way. Prior async-commit mutations are not amended for the sake of easy implementation. 1PC mutations will be normal locks in this case. The returned prewrite response should set min_commit_ts to 0 to indicate a fallback.

So, after CommitTsTooLarge happens, all mutations should be successfully written as locks. But we don't guarantee that all of them satisfy use_async_commit = false. As long as one of the locks has use_async_commit = false, we know that this transaction falls back from async commit.

Roll back the primary lock

In order to resolve locks of a fallback transaction, we need a mechanism to roll back the primary lock. By default, CheckTxnStatus does not roll back an async-commit lock. We can add a new flag async_commit_fallback to indicate that the transaction has fallen back to normal 2PC, so it is safe to roll back the lock.

TiDB

Commit procedure

When TiDB prewrites using the async-commit or 1PC way and it receives 0 as min_commit_ts, then it knows a fallback happens. Then, it uses the normal 2PC way to commit this transaction: commit the primary lock first, return success to the user and commit the secondary locks asynchronously.

Lock resolving

Firstly, TiDB queries the primary lock for the transaction status using CheckTxnStatus as usual. If the primary lock has a use_async_commit flag, it checks all secondary locks. This operation will return the information of each lock or write rollback records if the lock does not exist.

If any rollback is written, the transaction is bound to fail. It has nothing to do with fallback.

If any returned lock's use_async_commit is false, it means it's a fallback transaction. We cannot resolve locks using the async-commit way. Then, we can set the new flag async_commit_fallback and do CheckTxnStatus again. This operation will roll back the primary lock if the primary lock still exists. And the following procedures are the same as resolving a normal 2PC transaction lock.

cc @nrc @youjiali1995 @MyonKeminta

MyonKeminta commented 3 years ago

If any min_commit_ts is 0, it means it's a fallback transaction.

I think it's better use the use_async_commit field, which seems already included in the returned message of check_secondary_locks. I'm kind of afraid that min_commit_ts in secondary locks may not always the flag of whether async commit is used in the future.

sticnarf commented 3 years ago

If any min_commit_ts is 0, it means it's a fallback transaction.

I think it's better use the use_async_commit field, which seems already included in the returned message of check_secondary_locks. I'm kind of afraid that min_commit_ts in secondary locks may not always the flag of whether async commit is used in the future.

Fixed.

nrc commented 3 years ago

@sticnarf can we close this issue now?

sticnarf commented 3 years ago

@sticnarf can we close this issue now?

Ah, yes. We can close it.

tikv / sig-transaction

Handle CommitTsTooLarge error in an efficient way #64

Solution 1:

Solution 2:

Implementation details

TiKV

Auto fallback in prewrite

Roll back the primary lock

TiDB

Commit procedure

Lock resolving