Distinguish data source when merge shared tables with no shared key

yufan022 commented 3 years ago

Feature Request

Is your feature request related to a problem? Please describe:

Hi, we are trying to use DM to merge the shared tables of AWS Aurora into the tidb cluster managed by tidb operator.

But our scene is a little special, our shared tables doesn't have shared key, we use table name to splite the table.

So if we merged source table order_btcusdt; order_ethusdt;... to target table order_collect, We can't tell which orders are from btcusdt, and which are from ethusdt. Because there doesn't have shared column.

Can DM extract the table name as an extended column of target table? Or is there a better solution?

thanks.

Describe the feature you'd like:

Merge shared tables without partition keys, extract the table name to the target table column

Describe alternatives you've considered:

Teachability, Documentation, Adoption, Migration Strategy:

lance6716 commented 3 years ago

Hi, thanks for you recommendation. This column will also enable many features so it need carefully design, we'll update our consideration in this issue and welcome to discuss with us 😄

yufan022 commented 3 years ago

Hi, thanks for you recommendation. This column will also enable many features so it need carefully design, we'll update our consideration in this issue and welcome to discuss with us 😄

Great. Is there a plan to support this feature? We will try to support it internally first. Could you give us some help and suggestions?

lance6716 commented 3 years ago

Hi, currently we doesn't have enough manpower for this feature, so we're really happy to see if we can work together to implement it.

Here are our suggestions:

for general purpose, the (source-id, upstream-database, upstream-table) tuple contains enough information to distinguish one row among many shard tables. In your cases I think you may want to only use (upstream-table), maybe we can add an item in task configuration to let user specify the format of values of the added column.
It's better to provide a separate function to generate the value of the added column. We may need it for other purpose.
there're many internal stages during replicating binlog. The attaching of this column should happen at a very late stage, for example, when generating the DML SQL. https://github.com/pingcap/dm/blob/8e3fc124a2df9d76e88146facb8a824671b06f8e/syncer/syncer.go#L2269 (you can find genUpdateSQLs and genDeleteSQLs in below lines.) this will avoid the added column from conficting with other feature.

If you are willing to contribute this feature to DM repo, we maintainers can check bugs in implementation, maintain the logic with other features when code is growing, and give concrete and detailed explanation of internal stages. We are really look forward to your contribution.

lance6716 commented 2 years ago

I'll list some scenarios for this feature:

(this item repeats the issue discription) For shard merging replication, sometimes the information is encoded in table names, database names, and they can't be kept when merging. If the encoded information can be kept, it is used for better display results and help limit query scope in merged tables.
For shard merging replication, if upstream tables using AUTO INCREMENT primary key, it's highly likely to cause a conflict after merging, because we can't tell records (with auto-increment PK) of one shard from another.

so we plan that adding a configration to let source-id (mysql instance id), database name, table name can be appended to records when migrating. The requirement is

in order to support flexible usage such as query on specified "database name" or "database name + table name", each appended information must have its exclusive column. This also meets 1NF of database.
user may need only a part of table name, for example, "us_west" out of "us_west_auth_log", so the configration should have a way to extract needed information
the newly added configration must not break compatibility if it's not enabled.

PTAL if it's OK @yufan022

yufan022 commented 2 years ago

For shard merging replication, if upstream tables using AUTO INCREMENT primary key, it's highly likely to cause a conflict after merging, because we can't tell records (with auto-increment PK) of one shard from another.

The common practice in this scenario is that create the downstream table structure in advance, cancel the primary key constraint and use the extended column with the original primary key as the union primary key.

Scenario example:

# upstream
CREATE TABLE `order_xxx` (
  `id` bigint(20) unsigned NOT NULL AUTO_INCREMENT,
  ...
  PRIMARY KEY (`id`),
) ENGINE=InnoDB DEFAULT CHARSET=utf8;

# downstream
CREATE TABLE `collection_order` (
  `id` bigint(20) unsigned NOT NULL,
  ...
  `s_table` varchar(64) NOT NULL,
  primary key(`id`,`s_table`),
) ENGINE=InnoDB DEFAULT CHARSET=utf8;

in order to support flexible usage such as query on specified "database name" or "database name + table name", each appended information must have its exclusive column. This also meets 1NF of database.

Extracting source / schema / table into specific columns of downstream tables will make querying or adding indexes more flexible, and the meaning of each column is clearer. Although this will increase unnecessary storage space and index, we think it is worth it. Users can set to extracting several columns, up to 3 columns at present.

user may need only a part of table name, for example, "us_west" out of "us_west_auth_log", so the configration should have a way to extract needed information

In the current design, because we are not sure how many complex scenes there are, we use regular expressions to adapt more scenarios.

Task configuration example:

routes:
  order-route:
    schema-pattern: "core"
    table-pattern: "order_*"
    target-schema: "core"
    target-table: "collection_order"
    extract-table:
      table-regexp: "order_([0-9a-zA-Z]+)" # result: order_a -> a
      target-column: "symbol"
    extract-schema:
      schema-regexp: "([0-9a-zA-Z]+)" # result: core -> core
      target-column: "schema"
    extract-source:
      source-regexp: "mysql-replica-(.*)" # result: mysql-replica-01 -> 01
      target-column: "source_s"
  trade-route: # extraction is not enabled and should be compatible
    schema-pattern: "core"
    table-pattern: "trade_*"
    target-schema: "core"
    target-table: "collection_trade"

@lance6716 Thanks for the help, if there is anything unreasonable, welcome to discuss.

lichunzhu commented 1 year ago

closed by https://github.com/pingcap/tiflow/pull/3262

pingcap / tiflow

Distinguish data source when merge shared tables with no shared key #3340

Feature Request