Open giant-panda666 opened 3 weeks ago
Can you provide the log to show "and dm-worker isn't enter auto-safe mode"?
well, there are many errors in log. But i can remember what happened in the whole process is from 2024/07/05 13:02 to 2024/07/05 13:19. At 2024/07/05 13:02, the k8s node on which this worker was connected to TiDB server was draining. At 2024/07/05 13:19 the k8s node was recovered to accept pod scheduler. I found the worker was paused, so i do resume-task command to resume the task.
from source code:
so if apply succeed in fact, but conn becomes bad connection, it will be retried because of retryable error. However, it will be failed if applied statement is insert statement.
I need the full log to understand what's happened, especially for the period of resuming the task action before meet "Duplicate entry" error around 2024/07/05 13:02:48.251 +00:00
When the task resumes, safe mode exit point should enable the safe mode to cover the partially replicated binlog events.
That's full log, because there are so many log, so i have changed log level to error(it seems not work in go-mysql.). maybe "After network failure was fixed, the paused task is resumed. " as above is wrong(i'm not so familiar with source code...), it should be connection is broken and then dbConn retry during which causing duplicated key.
it should be connection is broken and then dbConn retry during which causing duplicated key.
Oh I understand, will check it later.
From the log file you provided, the error is "invalid connection", not "bad connection"
And DM will not retry sending the query for "invalid connection". So I still need full log to locate the problem.
no extra logs, log level is error and that's all. I have changed safe-mode to true, the problem will not be produced any more.
What did you do?
the connection from dm-worker to downstream TiDB is interrupted and the error is
[error="[code=10006:class=database:scope=not-set:level=high], Message: execute statement failed: begin, RawCause: invalid connection"]
. After network failure was fixed, the paused task is resumed. But the insert statement has been exected while the checkpoint is not updated, and dm-worker isn't enter auto-safe mode, so the insert statement is executed again. Sadly the syncing task is paused again because of duplicated primary key.What did you expect to see?
after network failure was fixed, the paused task should be resumed and it can handle duplicated key error.
What did you see instead?
the syncing task paused again.
Versions of the cluster
DM version (run
dmctl -V
ordm-worker -V
ordm-master -V
):Upstream MySQL/MariaDB server version:
Downstream TiDB cluster version (execute
SELECT tidb_version();
in a MySQL client):How did you deploy DM: tiup or manually?
Other interesting information (system version, hardware config, etc):
current status of DM cluster (execute
query-status <task-name>
in dmctl)