pingcap / tiflow

This repo maintains DM (a data migration platform) and TiCDC (change data capture for TiDB)
Apache License 2.0
417 stars 275 forks source link

syncer task is paused after connection recover #11384

Open giant-panda666 opened 3 weeks ago

giant-panda666 commented 3 weeks ago

What did you do?

the connection from dm-worker to downstream TiDB is interrupted and the error is [error="[code=10006:class=database:scope=not-set:level=high], Message: execute statement failed: begin, RawCause: invalid connection"]. After network failure was fixed, the paused task is resumed. But the insert statement has been exected while the checkpoint is not updated, and dm-worker isn't enter auto-safe mode, so the insert statement is executed again. Sadly the syncing task is paused again because of duplicated primary key.

What did you expect to see?

after network failure was fixed, the paused task should be resumed and it can handle duplicated key error.

What did you see instead?

the syncing task paused again.

Versions of the cluster

DM version (run dmctl -V or dm-worker -V or dm-master -V):

v7.5.1

Upstream MySQL/MariaDB server version:

MySQL 5.7

Downstream TiDB cluster version (execute SELECT tidb_version(); in a MySQL client):

v7.5.1

How did you deploy DM: tiup or manually?

TiDB Operator

Other interesting information (system version, hardware config, etc):

>
>

current status of DM cluster (execute query-status <task-name> in dmctl)

(paste current status of DM cluster here)
lance6716 commented 3 weeks ago

Can you provide the log to show "and dm-worker isn't enter auto-safe mode"?

giant-panda666 commented 3 weeks ago

worker.log

well, there are many errors in log. But i can remember what happened in the whole process is from 2024/07/05 13:02 to 2024/07/05 13:19. At 2024/07/05 13:02, the k8s node on which this worker was connected to TiDB server was draining. At 2024/07/05 13:19 the k8s node was recovered to accept pod scheduler. I found the worker was paused, so i do resume-task command to resume the task.

giant-panda666 commented 2 weeks ago

from source code:

  1. execute sql in dm worker: https://github.com/pingcap/tiflow/blob/1252979421fc83ffa2a1548d981e505f7fc0b909/dm/syncer/dml_worker.go#L259
  2. execute sql in dbconn:https://github.com/pingcap/tiflow/blob/1252979421fc83ffa2a1548d981e505f7fc0b909/dm/syncer/dbconn/db.go#L220
  3. execute sql with ignore error: https://github.com/pingcap/tiflow/blob/1252979421fc83ffa2a1548d981e505f7fc0b909/dm/syncer/dbconn/db.go#L172

so if apply succeed in fact, but conn becomes bad connection, it will be retried because of retryable error. However, it will be failed if applied statement is insert statement.

lance6716 commented 2 weeks ago

I need the full log to understand what's happened, especially for the period of resuming the task action before meet "Duplicate entry" error around 2024/07/05 13:02:48.251 +00:00

When the task resumes, safe mode exit point should enable the safe mode to cover the partially replicated binlog events.

https://github.com/pingcap/tiflow/blob/707238a34a8aa4268cd9998be7d07348c6187a70/dm/syncer/safe_mode.go#L94

giant-panda666 commented 2 weeks ago

That's full log, because there are so many log, so i have changed log level to error(it seems not work in go-mysql.). maybe "After network failure was fixed, the paused task is resumed. " as above is wrong(i'm not so familiar with source code...), it should be connection is broken and then dbConn retry during which causing duplicated key.

lance6716 commented 2 weeks ago

it should be connection is broken and then dbConn retry during which causing duplicated key.

Oh I understand, will check it later.

lance6716 commented 2 weeks ago

worker.log

From the log file you provided, the error is "invalid connection", not "bad connection"

And DM will not retry sending the query for "invalid connection". So I still need full log to locate the problem.

giant-panda666 commented 2 weeks ago

no extra logs, log level is error and that's all. I have changed safe-mode to true, the problem will not be produced any more.