Open kennytm opened 1 year ago
The only way they differ is if the RocksDB snapshot mechanism break down or the checksum algorithm is changed. These are not something the users really care about. That is to say, running with --checksum=1 only wasted CPU on TiKV to prove something that can never be false.
Also notice that the client implementation may make them differ. Say, some regions were omitted during backup.
(BR uses a range tree and batched style to query TiKV, while checksumming queries TiKV region by region).
@YuJuncen
Also notice that the client implementation may make them differ. Say, some regions were omitted during backup.
Well if the local checksum and remote checksum are operating on different set of ranges, it is an implementation bug of BR which again the user cannot control besides upgrading BR to a fixed version.
In the early design, checksum were used to verify:
In practice, backup checksum mismatches have rarely occurred. To my recollection, there were only a few times issues were caused by physical imports during the backup.
Therefore, we plan to gradually remove this logic and disable the checksum after the table is backed up by default. we already done it on Cloud. I think we can remove it on Premise for next release.
Currently (v8.1) BR backup will only write the schema checksum ("TXNKV_CHECKSUM_ORIGIN") when we enable backup --checksum=1
, causing restore to ignore the E2E checksum validation even if we restore --checksum=1
.
Will we change the restore checksum to consider the sum of file checksum ("TXNKV_CHECKSUM_SST") when the schema checksum is missing from the backupmeta? The two should be equivalent.
Feature Request
Is your feature request related to a problem? Please describe:
Currently BR backup performs checksum twice, once during SST generation in TiKV, and once again after a table is backed up using TiKV Coprocessor. These two checksums are compared to verify correctness.
However, these checksums are calculated from the exact same source: the online TiKV database. The actual uploaded SST file could be corrupt but this won't be detected by the checksum.
The only way they differ is if the RocksDB snapshot mechanism break down or the checksum algorithm is changed. These are not something the users really care about. That is to say, running with
--checksum=1
only wasted CPU on TiKV to prove something that can never be false.Describe the feature you'd like:
backupmeta
regardless of the--checksum
flag, we can safely change the flag's default value tofalse
.br debug checksum
command to actually read the SST files to calculate that file's CRC64XOR etc. Currently it only compares the SHA256 hash.In the documentation recommend users to run
br debug checksum
after performingbr backup
to ensure the integrity of the uploaded backup archive.Describe alternatives you've considered:
A. Change the
--checksum
flag to mean actually performingbr debug checksum
. Note thatbr debug checksum
will need to download the SST files again so it will double the I/O cost of the target cloud storage.B. Don't calculate CRC64XOR, keeping the current implementation of
br debug checksum
. Checking SHA256 should be sufficient to ensure the file content is intact.Teachability, Documentation, Adoption, Migration Strategy:
For the public API since we are only changing the default value of a flag this is fully compatible.
The
br debug
command is currently hidden andbr debug checksum
doubly so. They need to be revealed and documented.