pingcap / dm

Data Migration Platform
Apache License 2.0
456 stars 188 forks source link

relay: refactor dm-worker relay logic #2214

Open lichunzhu opened 3 years ago

lichunzhu commented 3 years ago

User story

  1. When relay log is disabled, each subtask of the dm-worker will pull binlogs from the same source, which may put more pressure on the upstream server, and the average performance of each source will be reduced if the number of connections is large.
  2. When relay log is disabled, if the upstream cleans up the binlog and the dm synchronization is behind, the synchronization will fail and the user can only resume the task by redoing the full tasks.
  3. When relay log is turned on, dm synchronization latency increases compared to non-relay log under less load, which makes some users reluctant to turn on relay log (see Testing synchronization link latency and CPU usage)
  4. When relay log is enabled in earlier versions of dm, the CPU consumption of dm-worker is significantly higher than without relay.
  5. relay log can only be read by the local dm-worker, if the source is transferred to another worker, the relay can only be pulled from mysql upstream again, but not from the existing dm-worker.

Target

New relay module design

relay writer

In the existing relay design, the binlog feature determines that the relay writer part itself is written sequentially, and after comparing with MySQL and TiDB Binlog design, I think there is no big problem with the existing implementation, and the update can focus on optimizing the code structure and streamlining the process.

The relay log structure and directory are not changed, so there is no compatibility problem.

relay reader

This proposal focuses on the reader, so we propose to make the following changes to the DM relay module.

  1. The relay reader code is moved to the relay module, and the relay module is responsible for the scheduling management, and implements a unified interface for pulling binlog from local binlog or getting binlog from other dm-worker through grpc.
  2. When the downstream consumption speed is higher than the upstream fetching speed, the unified relay module avoids the reader to repeatedly check the local disk file size to know whether a new binlog event is generated. After preliminary testing, this approach should reduce the latency after relay is turned on to about the same as when it is not turned on.
  3. (To be discussed) The relay module caches a section of the latest binlog read and ready to be written, and if this binlog is requested, the relay is directly taken out of memory and sent to the reader, which is mainly used to solve the time-consuming problem of adding a layer of writing and reading binlog when the downstream consumption speed is higher than the upstream pulling speed.
    • Cache size design, peak -> trough switching
    • Forward and backward switching, switch relay log file as the switching time
    • Quickly locate the event according to GTID/pos when switching

Subtasks

Phase 1 - relay reader refactoring