[Sink to CloudStorage] as a solution to recover and repair data when there are some mis-operations

zhangjinpeng87 commented 1 year ago

Is your feature request related to a problem?

Before TiCDC, TiDB's users used tidb-binlog to replicate data changes from TiDB to downstream systems like MySQL/TiDB/Kafka etc. As TiCDC is more and more mature, the functionality of TiCDC is more and more plentiful, we are going to sunset the legacy project tidb-binlog since TiCDC is a better alternative for tidb-binlog if considering the HA and other benefits of TiCDC.

In the scenario of some mis-operations like delete a large range of data for a specified table by mistake, traditional mysql database users use mysql binlog to recover the data without losing new valid data changes. The typical steps are: 1) parse binlog files and recognize these mistake operations 2) generate the undo DML of these mistake operations (Deletion's undo is insertion, update's undo is update to the old value, insertion's undo is deletion) 3) execute these undo DML in the upstream database

For TiDB users, they also can use tidb-binlog + reparo (https://docs.pingcap.com/tidb/stable/tidb-binlog-reparo) to achieve similar goal. But when come to TiCDC, there is no end to end tool/solution to cover such scenario.

Describe the feature you'd like

TiDB users use TiCDC as a tidb-binlog alternative, they sink all these change stream (data changes and schema changes) to external storage like S3/Distributed File System in CSV format, each row of CSV contains commit-ts informations to indicate when this change happened upstream and what transaction this change belongs to. When there are some mistake operations, users can use these CSV files to recover and repair data in the upstream cluster.

Discussed with one of TiDB user, they doing the data repair journey as follow:

TiCDC can record all time series data changes, including old value and new value. So it is possible to retrieve all data changes with specified filters/conditions (this is highly business dependency).
Easily import these time series data changes into big data system like spark, and the user can run spark SQL to query/filter the data changes they want.
Apply/Undo all filtered changes (with old value) to upstream TiDB/database system.

For step 1, after https://github.com/pingcap/tiflow/issues/10167 TiCDC can output all data changes to external storage with old value. For step 2, I think we probably can make tidb2dw can consume output of step 1 and ingest time ranged data changes into a specified table in big data system, so the user can do following operations easily.

Describe alternatives you've considered

Sink to external storage support SQL format. But in this way we may do a lot of redundant work with existing CSV format and also cannot reuse the work with redo log and tidb2dw.

Teachability, Documentation, Adoption, Migration Strategy

No response

zhangjinpeng87 commented 1 year ago

I found that TiCDC redo log doesn't reuse sink to external storage format which is weird, because both of sink to external storage and redo log are write data changes and schema changes to files.

Sink CSV/Canal-Json to external storage, but there is no tool can easily parse the outputted CSV files or Canal-Json files.

redo log writer: write data changes and schema changes to files (external storage) redo log reader: parse files and replay these changes to downstream MySQL/TiDB

If read log can reuse one of sink to external storage format (like CSV), then the redo log reader part actually is a convenient tool to parse/filter/replay data changes and schema changes to downstream MySQL/TiDB. This would be a better solution for the user scenario I mentioned above.

zhangjinpeng87 commented 1 year ago

Recently, there are two key TiDB customers reported the same requirement to us.

zhangjinpeng87 commented 1 year ago

More details about this scenario: 1) Sometimes the RTO is very critical, the user want to recover/repair the data in few minutes like 10 minutes, it is hard to achieve this goal by restoring to a new cluster/database and then repair data. In one user's MySQL's practice, the user recognize the mistake deletions in the MySQL binlog, and generate undo DMLs (INSERT) and apply these undo DMLs in current MySQL.

zhangjinpeng87 commented 11 months ago

Discussed with one of TiDB user, they doing the data repair journey as follow: 1) TiCDC can record all time series data changes, including old value and new value. So it is possible to retrieve all data changes with specified filters/conditions (this is highly business dependency). 2) Easily import these time series data changes into big data system like spark, and the user can run spark SQL to query/filter the data changes they want. 3) Apply/Undo all filtered changes (with old value) to upstream TiDB/database system.

For 1, after https://github.com/pingcap/tiflow/issues/10167 TiCDC can output all data changes to external storage with old value. For 2, I think we probably can make tidb2dw can consume output of 1 and ingest time ranged data changes into a specified table in big data system, so the user can do following operations easily.

cc @flowbehappy @BenMeadowcroft

benmeadowcroft commented 11 months ago

With respect to the repair journey (2 & 3), we should also take into consideration how we would do this without an external big data system and just using TiDB as well. Some of the reversal of operations on TiDB may be simpler if we are able to query the change data information on TIDB itself and then use that change data to apply/undo the filtered changes without introducing as much data movement between systems (and type conversions, etc.) into the picture.

I think avoiding a requirement on a big data system for repairing erroneous transactions, or dealing with other data issues, will lower the barrier to entry for these features for users.

pingcap / tiflow