Open sunxiaoguang opened 3 years ago
The initial version of java parser in TiBigData project is ready for review at here
Changes to table like this can be handled in Flink with Kafka Connector with the new TiCDC native binary format.
DDL on TiDB
CREATE TABLE test_flink (
c1 tinyint,
c2 smallint,
c3 mediumint,
c4 int,
c5 bigint,
c6 char(10),
c7 varchar(20),
c8 tinytext,
c9 mediumtext,
c10 text,
c11 longtext,
c12 binary(20),
c13 varbinary(20),
c14 tinyblob,
c15 mediumblob,
c16 blob,
c17 longblob,
c18 float,
c19 double,
c20 decimal(6, 3),
c21 date,
c22 time,
c23 datetime,
c24 timestamp,
c25 year,
c26 boolean,
c27 json,
c28 enum ('1','2','3'),
c29 set ('a','b','c'),
PRIMARY KEY(c1),
UNIQUE KEY(c2)
);
DDL on Flink
CREATE TABLE test_flink (
c1 tinyint,
c2 smallint,
c3 int,
c4 int,
c5 bigint,
c6 string,
c7 string,
c8 string,
c9 string,
c10 string,
c11 string,
c12 bytes,
c13 bytes,
c14 bytes,
c15 bytes,
c16 bytes,
c17 bytes,
c18 float,
c19 double,
c20 decimal(6, 3),
c21 date,
c22 time,
c23 timestamp(6),
c24 timestamp(6),
c25 int,
c26 boolean,
c27 string,
c28 int,
c29 int
) WITH (
'connector' = 'kafka',
'topic' = 'cdc-test',
'properties.bootstrap.servers' = 'localhost:9092',
'properties.group.id' = 'testGroup',
'format' = 'ticdc-craft',
'ticdc-craft.schema.include' = 'test',
'ticdc-craft.table.include' = 'test_flink'
);
Feature Request
Describe the feature you'd like: The TiCDC Open Protocol is a json format that is easy to parse and human readable. However a binary serialization format would take less space and use less resource to parse. It is especially important for a distributed database like TiDB that support high write throughput.
Describe alternatives you've considered: Use protobuf to define codec format is easier to implement. However the possiblity to serialize data in more compact format and parse in streaming way makes it worthy to pay the cost of maintaining such code for a protocol like TiCDC that is unlikely to change a lot.
Teachability, Documentation, Adoption, Migration Strategy: Here is a possible encoding format that is compact and support streaming read and skipping certain values at wish. In addition this format can evolve in the future without breaking backward compatibility as well. Old parser can simply skip unknown bits and keep parsing whatever fields it understands.
string/bytes array layout n bytes array of elements' size, format: uvarint array n bytes elements, format: bits
varint/uvarint array layout n bytes elements. format: varint / uvarint
delta varint/uvarint array layout n bytes base number n bytes offsets. format: varint/uvarint
string/bytes layout n bytes varint length n bytes payload
float layout, standard protobuf float double layout, standard protobuf double varint layout, standard protobuf varint uvarint layout, standard protobuf uvarint
Message layout 2 bytes version 2 bytes number of pairs n bytes keys n bytes values n bytes size tables
Keys layout n bytes array of commit ts, format: delta uvarint array n bytes array of type, format: uvarint array n bytes array of row id, format: uvarint array n bytes array of partition id, format: varint array, -1 means field is not set n bytes array of schema, format: string array n bytes array of table, format: string array
Row changed layout n bytes multiple column groups
Column group layout 1 byte column group type: 1 New Values, 2: Old Values, 3: Delete Values n bytes number of columns, format: uvarint n bytes array of name, format: string array n bytes array of type, format: uvarint array n bytes array of flag, format: uvarint array n bytes array of value, format: nullable bytes array
DDL layout n bytes type, format: uvarint n bytes query, format: string
Size tables layout n bytes table to store size of serialized keys n bytes table to store size of values n bytes tables to store of serialized column groups n bytes size of serialized size tables, format: reversed uvarint
Size table layout n bytes number of elements, format: uvarint n bytes repeated elements, format: uvarint
Some tests demonstrates that the same events serialized by this binary format can be smaller than json format with zlib compression