pingcap / tidb

TiDB is an open-source, cloud-native, distributed, MySQL-Compatible database for elastic scale and real-time analytics. Try AI-powered Chat2Query free at : https://www.pingcap.com/tidb-serverless/
https://pingcap.com
Apache License 2.0
37.01k stars 5.82k forks source link

Lightning Dedup Introduces Fail-Fast Mechanism #40743

Open dsdashun opened 1 year ago

dsdashun commented 1 year ago

Enhancement

Currently, when doing de-dup in Lightning, it will scan the table records and handle ALL of them. This is a time-consuming job if there are so many duplicated rows after importing KVs into TiDB from Lightning ( For example, choose the wrong columns as unique keys or hit this bug ) . In these situations, resolving ALL the duplicated records is not a sensible idea.
If there are quite many duplicated records to be resolved, it idicates that there are some problems with the data itself. Lightning had better fail fast, and let users check the data, thus saving the de-dup time.

lance6716 commented 1 year ago

I remember lightning has a max-error configuration, maybe we can treat it as a limit for deduplication

dsdashun commented 1 year ago

Currently, the max-error only accepts an integer and applies to the type error number. The conflict error number is by default set to MaxInt64. Maybe we can change the max-error format to support different kinds of errors. Like this:

[lightning.max-error]
type = 1000
conflict = 10000

Also, we need to introduce some mechanism that if conflict data exceeds X% of the total records, stop the de-dup process and report error.

To wrap-up. Here's a rough design of the new mechanism on conflict error:

okJiang commented 1 year ago

Has the document been updated?

shenli commented 1 year ago

the function is not implemented properly, need further work https://github.com/pingcap/tidb/issues/42471