pingcap / tiflow

This repo maintains DM (a data migration platform) and TiCDC (change data capture for TiDB)
Apache License 2.0
428 stars 286 forks source link

CDC clould: ERROR ["execute DMLs failed"] Write conflict #2248

Closed Tammyxia closed 2 years ago

Tammyxia commented 3 years ago

Bug Report

Please answer these questions before submitting your issue. Thanks!

  1. What did you do? If possible, provide a recipe for reproducing the error.
  1. What did you expect to see?
  1. What did you see instead? CDC log: [2021/07/07 10:25:33.713 +00:00] [ERROR] [feed_state_manager.go:239] ["processor report an error"] [changefeedID=27f6e0be-db56-4fd0-bdd3-dea18aa9fcac] [captureID=efe4cd44-1841-4cb6-b23e-ac732da3a04f] [error="{\"addr\":\"db-ticdc-1.db-ticdc-peer.tidb458.svc:8301\",\"code\":\"CDC:ErrProcessorUnknown\",\"message\":\"[CDC:ErrMySQLTxnError]Error 9007: Write conflict, txnStartTS=426155312664543240, conflictStartTS=426155307225055233, conflictCommitTS=426155312690757666, key=? [try again later]\"}"] [2021/07/07 10:26:07.310 +00:00] [ERROR] [mysql.go:1045] ["execute DMLs failed"] [err="[CDC:ErrMySQLTxnError]Error 9007: Write conflict, txnStartTS=426155323281375233, conflictStartTS=426155323124088833, conflictCommitTS=426155324434808833, key=? [try again later]"] [2021/07/07 10:26:23.560 +00:00] [ERROR] [mysql.go:1045] ["execute DMLs failed"] [err="[CDC:ErrMySQLTxnError]sql: database is closed"]

image

  1. Versions of the cluster

    • Upstream TiDB cluster version (execute SELECT tidb_version(); in a MySQL client):

      5.0.2
    • TiCDC version (execute cdc version):

       ["Welcome to Change Data Capture (CDC)"] [release-version=v5.0.0-dev] [git-hash=9300b05ceec2c4811198416a5d21e9f4910deddd] [git-branch=cloud-cdc-5.0] [utc-build-time="2021-06-26 14:16:57"] [go-version="go version go1.16.3 linux/amd64"] [failpoint-build=false]
CharlesCheung96 commented 2 years ago

There are two possible reasons for the issue:

  1. There is a bug in the transaction conflict detection part of mysql sink, which causes different workers to execute conflicting transactions concurrently. However, considering that this part of code has not been modified recently and the error has not reappeared in recent tests, it is probably not the root cause of the problem.

  2. There is a bug in the scheduler module during scale-in or scale-out, which causes different captures to write the same downstream at the same time. Considering that the scheduler code has been deprecated in 5.0, and it is not clear whether the new version of the scheduler module has this problem. Therefore, we temporarily close this issue.

Feel free to reopen it if this issue happens again.

maxshuang commented 2 years ago

An oncall of ticdc v6.1.0 has encountered this problem.

[2022/09/23 07:28:36.822 +00:00] [WARN] [mysql.go:612] ["execute DMLs with error, retry later"] [error="
[CDC:ErrMySQLTxnError]MySQL txn error: Error 9007: Write conflict, txnStartTS=436186150766379010, conflictStartTS=436186150727057454, conflictCommitTS=436186150766379021, 
key={tableID=172, indexID=4, indexValues={1, 85709, 15, 13, }} primary={tableID=161, indexID=1, indexValues={346943, 1850395608943165440, 729430118, 1, 1598468504, }} [try again later]"] 

Need more investigation. https://internal.pingcap.net/jira/browse/ONCALL-5443

maxshuang commented 2 years ago

track in https://github.com/pingcap/tiflow/issues/7227

nongfushanquan commented 2 years ago

/remove affects-6.1