Open dsdashun opened 1 year ago
I remember lightning has a max-error
configuration, maybe we can treat it as a limit for deduplication
Currently, the max-error
only accepts an integer and applies to the type error number. The conflict error number is by default set to MaxInt64
. Maybe we can change the max-error format to support different kinds of errors. Like this:
[lightning.max-error]
type = 1000
conflict = 10000
Also, we need to introduce some mechanism that if conflict data exceeds X% of the total records, stop the de-dup process and report error.
To wrap-up. Here's a rough design of the new mechanism on conflict error:
max(X, Y% * total records)
, stop the de-dup process and report error. Has the document been updated?
the function is not implemented properly, need further work https://github.com/pingcap/tidb/issues/42471
Enhancement
Currently, when doing de-dup in Lightning, it will scan the table records and handle ALL of them. This is a time-consuming job if there are so many duplicated rows after importing KVs into TiDB from Lightning ( For example, choose the wrong columns as unique keys or hit this bug ) . In these situations, resolving ALL the duplicated records is not a sensible idea.
If there are quite many duplicated records to be resolved, it idicates that there are some problems with the data itself. Lightning had better fail fast, and let users check the data, thus saving the de-dup time.