pingcap / tidb

TiDB - the open-source, cloud-native, distributed SQL database designed for modern applications.
https://pingcap.com
Apache License 2.0
37.2k stars 5.84k forks source link

TiDB may return a write conflict error even when the tranaction has been committed. #34066

Closed zyguan closed 2 years ago

zyguan commented 2 years ago

Bug Report

Please answer these questions before submitting your issue. Thanks!

1. Minimal reproduce step (Required)

Not a minimal one, the issue occured when we run jepsen append test.

2. What did you expect to see? (Required)

The test passed normally.

3. What did you see instead (Required)

An anomoly was detected. It seems some transactions were actually committed however tidb returned write conflict (and notified users that they can try again later) rather than ok or underemined.

2022-04-07 20:00:13,245{GMT}    INFO    [jepsen test runner] jepsen.core: {:perf
 {:latency-graph {:valid? true},
  :rate-graph {:valid? true},
  :valid? true},
 :workload
 {:valid? false,
  :anomaly-types (:G-single),
  :anomalies
  {:G-single
   ("Let:\n  T1 = {:type :ok, :f :txn, :value [[:append 784 1] [:append 784 2] [:append 781 10]], :process 30, :time 31383426818, :txn-info {:txn_scope \"global\", :start_ts 432370173604266603, :commit_ts 432370173617373222, :txn_commit_mode \"async_commit\", :async_commit_fallback false, :one_pc_fallback false}, :index 20597}\n  T2 = {:type :ok, :f :txn, :value [[:r 784 [1 2 4 5 3 6]] [:r 783 [2 4 1 7 9 11 12 13 14 16]] [:append 781 12] [:r 781 [2 1 3 5 6 8 4 10 11 14 16 12]]], :process 30, :time 31430826740, :txn-info {:txn_scope \"global\", :start_ts 432370173617373576, :commit_ts 432370173630480390, :txn_commit_mode \"2pc\", :async_commit_fallback false, :one_pc_fallback false}, :index 20653}\n  T3 = {:type :ok, :f :txn, :value [[:r 781 [2 1 3 5 6 8 4 10 11 14 16 12]] [:append 780 5] [:r 777 [2 3 5 6 8 9 7 10 11 12 4 14 1 13 15]] [:append 777 16]], :process 3, :time 34983032310, :txn-info {:txn_scope \"global\", :start_ts 432370173630480536, :commit_ts 432370174561091587, :txn_commit_mode \"2pc\", :async_commit_fallback false, :one_pc_fallback false}, :index 20791}\n  T4 = {:type :ok, :f :txn, :value [[:append 781 4] [:r 777 [2 3 5 6 8 9 7 10 11 12 4 14 1 13 15 16]] [:append 782 4] [:append 777 13]], :process 24, :time 35067883431, :txn-info {:txn_scope \"global\", :start_ts 432370174574460936, :commit_ts 432370174574460956, :txn_commit_mode \"async_commit\", :async_commit_fallback false, :one_pc_fallback false}, :index 20803}\n\nThen:\n  - T1 < T2, because T1 completed at index 20597, 0.000 seconds before the invocation of T2, at index 20599.\n  - T2 < T3, because T3 observed T2's append of 12 to key 781.\n  - T3 < T4, because T4 observed T3's append of 16 to key 777.\n  - However, T4 < T1, because T1 appended 10 after T4 appended 4 to 781: a contradiction!")}},
 :valid? false}

Analysis invalid! (ノಥ益ಥ)ノ ┻━┻

The full log can be accessed here

4. What is your TiDB version? (Required)

master

sticnarf commented 2 years ago

https://github.com/tikv/tikv/issues/11148 discussed about optimizing optimistic deadlocks. But the final solution https://github.com/tikv/client-go/pull/367 introduced this bug. Before the optimization, the lock needs to be resolved first, so it will not report a false write conflict.

The idea of that optimization is that a transaction must be a write conflict if it encounters a lock whose start TS is larger. But we overlooked one case when the transaction has been already committed. This could happen when the response of the first prewrite is lost, but the retried prewrite encounters the newer lock. In this case, the user might receive a write conflict error even if the transaction is actually committed.

TiDB 5.0.6, 5.4.0 and 6.0.0 are affected. The unreleased 5.2 release should be blocked.