pingcap / tiflow

This repo maintains DM (a data migration platform) and TiCDC (change data capture for TiDB)
Apache License 2.0
425 stars 283 forks source link

relay-reader may fail to read binlog events after we kill dm-worker and restart it #5234

Open lichunzhu opened 2 years ago

lichunzhu commented 2 years ago

What did you do?

Run a dm-worker and randomly kill it and restart dm-worker.

What did you expect to see?

dm-worker can handle binlog events correctly.

What did you see instead?

https://ci2.pingcap.net/job/dm_ghpr_integration_test/5086/display/redirect dm-worker local-reader reports an error and can't continue working.

[2022-04-21T06:46:07.388Z] [2022/04/21 14:45:51.936 +08:00] [DEBUG] [syncer.go:2034] ["receive binlog event"] [task=test] [unit="binlog replication"] [header="{\"Timestamp\":1650523490,\"EventType\":33,\"ServerID\":1,\"EventSize\":79,\"LogPos\":170583,\"Flags\":0}"]
[2022-04-21T06:46:07.388Z] [2022/04/21 14:45:51.936 +08:00] [ERROR] [streamer_controller.go:277] ["meet error when get binlog event"] [task=test] [unit="binlog replication"] [error="[code=11038:class=functional:scope=internal:level=high], Message: parse relay log file dm-integration-test-go1180-5086-r6cls-xnttk-bin.000001 from offset 4 in dir relay-dir/b9d62051-c13d-11ec-acfe-2672ab805f07.000001: parse relay log file dm-integration-test-go1180-5086-r6cls-xnttk-bin.000001 from offset 173945 in dir relay-dir/b9d62051-c13d-11ec-acfe-2672ab805f07.000001: parse relay log file relay-dir/b9d62051-c13d-11ec-acfe-2672ab805f07.000001/dm-integration-test-go1180-5086-r6cls-xnttk-bin.000001, RawCause: Header &replication.EventHeader{Timestamp:0x6260fd75, EventType:0x1e, ServerID:0x1, EventSize:0x38, LogPos:0x2a779, Flags:0x0}, Data \"\\xf5\\x05\\x00\\x00\\x00\\x00\\x01\\x00\\x02\\x00\\x05\\xff\\x00#\\x00\\x00\\x00#\\x00\\x00\\x00#\\x00\\x00\\x00#\\x00\\x00\\x00#\\x00\\x00\\x00\", Err: table id 1525: invalid table id, no corresponding table map event"] [errorVerbose="[code=11038:class=functional:scope=internal:level=high], Message: parse relay log file dm-integration-test-go1180-5086-r6cls-xnttk-bin.000001 from offset 4 in dir relay-dir/b9d62051-c13d-11ec-acfe-2672ab805f07.000001: parse relay log file dm-integration-test-go1180-5086-r6cls-xnttk-bin.000001 from offset 173945 in dir relay-dir/b9d62051-c13d-11ec-acfe-2672ab805f07.000001: parse relay log file relay-dir/b9d62051-c13d-11ec-acfe-2672ab805f07.000001/dm-integration-test-go1180-5086-r6cls-xnttk-bin.000001, RawCause: Header &replication.EventHeader{Timestamp:0x6260fd75, EventType:0x1e, ServerID:0x1, EventSize:0x38, LogPos:0x2a779, Flags:0x0}, Data \"\\xf5\\x05\\x00\\x00\\x00\\x00\\x01\\x00\\x02\\x00\\x05\\xff\\x00#\\x00\\x00\\x00#\\x00\\x00\\x00#\\x00\\x00\\x00#\\x00\\x00\\x00#\\x00\\x00\\x00\", Err: table id 1525: invalid table id, no corresponding table map event\ngithub.com/pingcap/tiflow/dm/pkg/terror.(*Error).Delegate\n\t/home/jenkins/agent/workspace/dm_ghpr_integration_test/go/src/github.com/pingcap/tiflow/dm/pkg/terror/terror.go:311\ngithub.com/pingcap/tiflow/dm/relay.(*BinlogReader).parseFile\n\t/home/jenkins/agent/workspace/dm_ghpr_integration_test/go/src/github.com/pingcap/tiflow/dm/relay/local_reader.go:656\ngithub.com/pingcap/tiflow/dm/relay.(*BinlogReader).parseFileAsPossible\n\t/home/jenkins/agent/workspace/dm_ghpr_integration_test/go/src/github.com/pingcap/tiflow/dm/relay/local_reader.go:487\ngithub.com/pingcap/tiflow/dm/relay.(*BinlogReader).parseDirAsPossible\n\t/home/jenkins/agent/workspace/dm_ghpr_integration_test/go/src/github.com/pingcap/tiflow/dm/relay/local_reader.go:427\ngithub.com/pingcap/tiflow/dm/relay.(*BinlogReader).parseRelay\n\t/home/jenkins/agent/workspace/dm_ghpr_integration_test/go/src/github.com/pingcap/tiflow/dm/relay/local_reader.go:317\ngithub.com/pingcap/tiflow/dm/relay.(*BinlogReader).StartSyncByGTID.func1\n\t/home/jenkins/agent/workspace/dm_ghpr_integration_test/go/src/github.com/pingcap/tiflow/dm/relay/local_reader.go:286\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_amd64.s:1571"]

Versions of the cluster

master

current status of DM cluster (execute query-status <task-name> in dmctl)

(paste current status of DM cluster here)
lance6716 commented 2 years ago

seems there're multiple relay writers

[2022-04-21T06:45:59.162Z] [2022/04/21 14:45:09.643 +08:00] [DEBUG] [relay.go:662] ["writing binlog event"] [component="relay log"] [header="{\"Timestamp\":1650523509,\"EventType\":19,\"ServerID\":1,\"EventSize\":58,\"LogPos\":173889,\"Flags\":0}"]
[2022-04-21T06:45:59.162Z] [2022/04/21 14:45:09.643 +08:00] [DEBUG] [relay.go:662] ["writing binlog event"] [component="relay log"] [header="{\"Timestamp\":1650523509,\"EventType\":19,\"ServerID\":1,\"EventSize\":58,\"LogPos\":173889,\"Flags\":0}"]

so the relay log is corrupted

niubell commented 2 years ago

/assign lichunzhu

lichunzhu commented 2 years ago

After the check, it's caused by our current test mechanism. We use check_point_offline to make sure dm-worker exits. However, check_process_exit is more accurate because dm-worker.test needs some to exit after its port is offline. Some tests are revised and check_process_exit shell is added to avoid this kind of problem.