pingcap / tiflow

This repo maintains DM (a data migration platform) and TiCDC (change data capture for TiDB)
Apache License 2.0
430 stars 287 forks source link

Implement a native binary codec for high throughput scenario. #1613

Open sunxiaoguang opened 3 years ago

sunxiaoguang commented 3 years ago

Feature Request

Describe the feature you'd like: The TiCDC Open Protocol is a json format that is easy to parse and human readable. However a binary serialization format would take less space and use less resource to parse. It is especially important for a distributed database like TiDB that support high write throughput.

Describe alternatives you've considered: Use protobuf to define codec format is easier to implement. However the possiblity to serialize data in more compact format and parse in streaming way makes it worthy to pay the cost of maintaining such code for a protocol like TiCDC that is unlikely to change a lot.

Teachability, Documentation, Adoption, Migration Strategy: Here is a possible encoding format that is compact and support streaming read and skipping certain values at wish. In addition this format can evolve in the future without breaking backward compatibility as well. Old parser can simply skip unknown bits and keep parsing whatever fields it understands.

Some tests demonstrates that the same events serialized by this binary format can be smaller than json format with zlib compression

case craft size json size protobuf 1 size protobuf 2 size craft compressed json compressed protobuf 1 compressed protobuf 2 compressed
case 0 180 390 (116%)+ 197 (9%)+ 199 (10%)+ 136 190 (39%)+ 157 (15%)+ 150 (10)%+
case 1 722 1576 (118%)+ 820 (13%)+ 810 (12%)+ 256 305 (19%)+ 266 (3%)+ 262 (2)%+
sunxiaoguang commented 3 years ago

There is a draft version of document describing this protocol

sunxiaoguang commented 3 years ago

The initial version of java parser in TiBigData project is ready for review at here

sunxiaoguang commented 3 years ago

Changes to table like this can be handled in Flink with Kafka Connector with the new TiCDC native binary format.

DDL on TiDB

CREATE TABLE test_flink (
  c1  tinyint,
  c2  smallint,
  c3  mediumint,
  c4  int,
  c5  bigint,
  c6  char(10),
  c7  varchar(20),
  c8  tinytext,
  c9  mediumtext,
  c10 text,
  c11 longtext,
  c12 binary(20),
  c13 varbinary(20),
  c14 tinyblob,
  c15 mediumblob,
  c16 blob,
  c17 longblob,
  c18 float,
  c19 double,
  c20 decimal(6, 3),
  c21 date,
  c22 time,
  c23 datetime,
  c24 timestamp,
  c25 year,
  c26 boolean,
  c27 json,
  c28 enum ('1','2','3'),
  c29 set ('a','b','c'),
  PRIMARY KEY(c1),
  UNIQUE KEY(c2)
);

DDL on Flink

CREATE TABLE test_flink (
  c1  tinyint,
  c2  smallint,
  c3  int,
  c4  int,
  c5  bigint,
  c6  string, 
  c7  string,
  c8  string,
  c9  string,
  c10 string,
  c11 string,
  c12 bytes,
  c13 bytes,
  c14 bytes,
  c15 bytes,
  c16 bytes,
  c17 bytes,
  c18 float,
  c19 double,
  c20 decimal(6, 3),
  c21 date,
  c22 time,
  c23 timestamp(6),
  c24 timestamp(6),
  c25 int,
  c26 boolean,
  c27 string,
  c28 int,
  c29 int
) WITH (
  'connector' = 'kafka',
  'topic' = 'cdc-test',
  'properties.bootstrap.servers' = 'localhost:9092',
  'properties.group.id' = 'testGroup',
  'format' = 'ticdc-craft',
  'ticdc-craft.schema.include' = 'test',
  'ticdc-craft.table.include' = 'test_flink'
);

image