Feature Request: add support for syncing table row-differences after VDiff

mcrauwel commented 10 months ago

Feature Description

Currently we use VDiff to find differences in 2 version of a table being migrated through a MoveTables workflow. The tool reports if there are differences and if there are how many. The tool also provides a sample of the mismatched rows but this is limited to 10 rows. So if there are more than 10 rows difference we'd need to use other tools (such as pt-table-checksum) to re-diff the table chunks that are different. Afterwards we can re-run the pt-table-checksum with --chunk-size=1 to find the individual rows that are different, while VDiff could have already told us...

I would like to have an option for VDiff to print/save the PK's (or the index used for diffing the rows) with mismatching rows, so these could be used for reconciling the differences between the 2 tables...

I am ok with contributing code for implementing this, however it would be good to get the Vitess team's insights to identify the correct path forward for implementing this, such as where would you store/print this info? And what format you'd store it in...

Use Case(s)

This would allow the VDiff option to be used like a combination of pt-table-checksum and pt-table-sync

mattlord commented 10 months ago

@mcrauwel we would only care about vtctldclient for new features going forward. So it would be about making this a variable in the create command. This info is then stored in the json report in the vdiff_table sidecardb table.

mcrauwel commented 10 months ago

Hey @mattlord, would you just override that maxVDiffReportSampleRows value and store potentially the entire diff in the sidecar database (can be huge), or would we rather add an option like --store-mismatching-primary-keys and create a new column in that table in the sidecar database that just stores the PK's?

mcrauwel commented 10 months ago

I would be storing something like:

{
  "key": {
    "name": "<keyname/PRIMARY>",
    "fields": [
       <index-fields>
    ]
  },
  "values": [
    {
      "source": [...], 
      "target": [...]
    },
    { ... }
  ]
}

missing rows on source or target would be null in the relevant values... in most cases though source and target would be the same in values

mattlord commented 10 months ago

A JSON column in MySQL can be up to 1GiB. And you can already specify that you only want the PK values in the diff report using the --only-pks flag: https://vitess.io/docs/18.0/reference/programs/vtctldclient/vtctldclient_vdiff/vtctldclient_vdiff_create/

So I would start with simply making that max sample rows variable.

mcrauwel commented 10 months ago

makes sense! and in the docs we can recommend using --only-pks flag when increasing the value for max_sample_rows should I add an option all or treat max_sample_rows = 0 as no limit, being all rows?

deepthi commented 10 months ago

Call the flag something more descriptive please. --max-vdiff-report-rows? I think 0 is fine to specify all, and the current value (10) can be the default.

mattlord commented 10 months ago

Sorry for the delay. vdiff is unnecessary as that's the command the flag is for. But something like --max-report-sample-rows is good. And as @deepthi said, the 0 for unlimited is good (we use that pattern in various places).

vitessio / vitess

Feature Request: add support for syncing table row-differences after VDiff #14323

Feature Description

Use Case(s)