pingcap / tidb

TiDB is an open-source, cloud-native, distributed, MySQL-Compatible database for elastic scale and real-time analytics. Try AI-powered Chat2Query free at : https://www.pingcap.com/tidb-serverless/
https://pingcap.com
Apache License 2.0
36.97k stars 5.82k forks source link

BR restore txn report ErrRestoreInvalidRange error, 'startKey > endKey, endKey 0000000000000000f7', when restore region's endKey is "" #56228

Open SonglinLife opened 3 days ago

SonglinLife commented 3 days ago

Bug Report

Please answer these questions before submitting your issue. Thanks!

1. Minimal reproduce step (Required)

  1. TiKV has region endKey is ""
  2. backup txn (sucess)
  3. restore txn (failed)

2. What did you expect to see? (Required)

restore txn always success.

3. What did you see instead (Required)

restore txn failed with error, report startKey > endKey, endKey was 0000000000000000f7. Due to tikv encode rule, empty byte slice will encode as 0000000000000000f7. And also I checked my tikv cluster, it did have a region, which endKey was "".

Same Issue also seen inRestore txn kv fails and reports ErrRestoreInvalidRange #52574 , although this issue was resolved by pr https://github.com/pingcap/tidb/commit/80d4dec1c07038cf8f81746158ebca5c28720def.

but I find the br function SplitKeysAndScatter in br/pkg/restore/split/client.go https://github.com/pingcap/tidb/blob/master/br/pkg/restore/split/client.go#L536-L566 also encode the lastKey without check the lastKey is empty slices. and then it call PaginateScanRegion, which throw the error.

I guess It was some issue like https://github.com/pingcap/tidb/issues/52574

report error:

[2024/09/23 13:33:05.816 +08:00] [ERROR] [main.go:38] ["br failed"] [error="startKey > endKey, startKey: 63616665fd448403ff0800000000000000ff000000002aff0000fd, endkey: 0000000000000000f7: [BR:Common:ErrInvalidRange]invalid restore range"] [errorVerbose="[BR:Common:ErrInvalidRange]invalid restore range\nstartKey > endKey, startKey: 63616665fd448403ff0800000000000000ff000000002aff0000fd, endkey: 0000000000000000f7\ngithub.com/pingcap/tidb/br/pkg/restore/split.PaginateScanRegion\n\t/workspace/source/tidb/br/pkg/restore/split/split.go:100\ngithub.com/pingcap/tidb/br/pkg/restore/split.(*pdClient).SplitKeysAndScatter.func1\n\t/workspace/source/tidb/br/pkg/restore/split/client.go:566\ngithub.com/pingcap/tidb/br/pkg/utils.WithRetryReturnLastErr\n\t/workspace/source/tidb/br/pkg/utils/retry.go:94\ngithub.com/pingcap/tidb/br/pkg/restore/split.(*pdClient).SplitKeysAndScatter\n\t/workspace/source/tidb/br/pkg/restore/split/client.go:558\ngithub.com/pingcap/tidb/br/pkg/restore/internal/snap_split.(*RegionSplitter).executeSplitByKeys\n\t/workspace/source/tidb/br/pkg/restore/internal/snap_split/split.go:89\ngithub.com/pingcap/tidb/br/pkg/restore/internal/snap_split.(*RegionSplitter).executeSplitByRanges\n\t/workspace/source/tidb/br/pkg/restore/internal/snap_split/split.go:72\ngithub.com/pingcap/tidb/br/pkg/restore/internal/snap_split.(*RegionSplitter).ExecuteSplit\n\t/workspace/source/tidb/br/pkg/restore/internal/snap_split/split.go:48\ngithub.com/pingcap/tidb/br/pkg/restore/snap_client.(*SnapClient).SplitPoints\n\t/workspace/source/tidb/br/pkg/restore/snap_client/tikv_sender.go:352\ngithub.com/pingcap/tidb/br/pkg/task.RunRestoreTxn\n\t/workspace/source/tidb/br/pkg/task/restore_txn.go:91\nmain.runRestoreTxnCommand\n\t/workspace/source/tidb/br/cmd/br/restore.go:136\nmain.newTxnRestoreCommand.func1\n\t/workspace/source/tidb/br/cmd/br/restore.go:235\ngithub.com/spf13/cobra.(*Command).execute\n\t/root/go/pkg/mod/github.com/spf13/cobra@v1.8.0/command.go:983\ngithub.com/spf13/cobra.(*Command).ExecuteC\n\t/root/go/pkg/mod/github.com/spf13/cobra@v1.8.0/command.go:1115\ngithub.com/spf13/cobra.(*Command).Execute\n\t/root/go/pkg/mod/github.com/spf13/cobra@v1.8.0/command.go:1039\nmain.main\n\t/workspace/source/tidb/br/cmd/br/main.go:36\nruntime.main\n\t/usr/local/go/src/runtime/proc.go:267\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_amd64.s:1650"] [stack="main.main\n\t/workspace/source/tidb/br/cmd/br/main.go:38\nruntime.main\n\t/usr/local/go/src/runtime/proc.go:267"]

4. What is your TiDB version? (Required)

Only Tikv and pd (v5.0.6)

br(v8.4.0-nigthly)

lance6716 commented 3 days ago

Only Tikv and pd (v5.0.6)

Please use LTS version

lance6716 commented 3 days ago

I guess the problem is caused by

https://github.com/pingcap/tidb/blob/0e7c06cac1f2385fea2c5fd93979fb6e4d2cc522/br/pkg/task/restore_txn.go#L91

Here sortedSplitKeys is assigned from getEndKeys(ranges). End keys may not be a key to split region, especially it's "".

lance6716 commented 3 days ago

PTAL @Leavrth

SonglinLife commented 3 days ago

Please use LTS version

thanks for your suggestion, We will upgrade as soon as we can.

Here sortedSplitKeys is assigned from getEndKeys(ranges). End keys may not be a key to split region, especially it's "".

I commented out this line and things work fine for me, it successfully restored the txn data. But I am not know the potential negative impact of removing the split code. Could there be any adverse effects?

Leavrth commented 3 days ago

I commented out this line and things work fine for me, it successfully restored the txn data. But I am not know the potential negative impact of removing the split code. Could there be any adverse effects?

This line will split regions to avoid region's data size too large. So if this line is commented, the region won't be split, and the data will be restored into one region. If the restored data size is not large, you can wait until regions are split automatically.

It's better to skip the empty EndKey in the function getEndKeys

https://github.com/pingcap/tidb/blob/9785cddc788caeaf68525de4d889bd38dd1181e1/br/pkg/task/restore_raw.go#L174-L180

func getEndKeys(ranges []rtree.RangeStats) [][]byte { 
    endKeys := make([][]byte, 0, len(ranges)) 
    for _, rg := range ranges { 
                 if len(rg.EndKey) == 0 {
                    continue
                 }
        endKeys = append(endKeys, rg.EndKey) 
    } 
    return endKeys 
 } 
lance6716 commented 1 day ago

Hi @SonglinLife do you have time to fix this problem?

SonglinLife commented 21 hours ago

Really sorry for the late reply.

Yes, I do want to resolve this bug. Does it only need to ignore the last key? Recently, I dug into the BR project and read the code, and it is hard for me to understand it.

I also see some other bugs in the BR project, like it retries backup but doesn't reset the progress bar. https://github.com/pingcap/tidb/blob/119e76552731095cf39c2b17e10bb9fdc4d7c542/br/pkg/backup/client.go#L355 It really confused me at the beginning because I saw the backup progress bar reach 100% but it didn't stop the backup(it start a new round).

And I also struggle to figure out why the BR backup restarts rounds infinitely (starting 5 rounds). Can you give me some hints? I found in the BR code that it checks if there is an incomplete range. If none, then the main loop will stop.

https://github.com/pingcap/tidb/blob/119e76552731095cf39c2b17e10bb9fdc4d7c542/br/pkg/backup/client.go#L216-L221

before start backup, It get range information by listdb. https://github.com/pingcap/tidb/blob/9dff38ba98405422cb0eb15993f385efe9068b47/br/pkg/backup/client.go#L750-L755 But I use the TiKV and PD only, not with TiDB. So the first incomplete range is <"", "">. And the BR tree data structure will fill the incomplete range when TiKV backs up a region successfully. So if the first successful backup region is <a,b>, then the incomplete range will be <"", a> and <b, "">.

If TiKV have some gap between two adjacent regions, like <a, b> and <e,f>, the BR will think <c,e> as an incomplete region, and never stop the main loop.

I am a totally new user of TiDB, and it is really hard without your guys help. I do really want to improve the BR tools.

lance6716 commented 19 hours ago

Yes, I do want to resolve this bug. Does it only need to ignore the last key? Recently, I dug into the BR project and read the code, and it is hard for me to understand it.

Only ignore the empty key (zero-length key). BR wants to split regions based on the ranges boundaries, but if a range use max key as end key, we can't split on it.

For rest problem, please open separate issues for them.

SonglinLife commented 18 hours ago

thinks for your reply, before I open a new issue I will read code file to understand more detail. In practice, we force br backup txn stop at round 2 and retore txn, it work fine on a prod tikv cluster. but there must be some thing unusual. yes, lets discuss in another issue.

and I also open a new request, for this issue base on this discussion.