Open SonglinLife opened 3 days ago
Only Tikv and pd (v5.0.6)
Please use LTS version
I guess the problem is caused by
Here sortedSplitKeys
is assigned from getEndKeys(ranges)
. End keys may not be a key to split region, especially it's ""
.
PTAL @Leavrth
Please use LTS version
thanks for your suggestion, We will upgrade as soon as we can.
Here sortedSplitKeys is assigned from getEndKeys(ranges). End keys may not be a key to split region, especially it's "".
I commented out this line and things work fine for me, it successfully restored the txn data. But I am not know the potential negative impact of removing the split code. Could there be any adverse effects?
I commented out this line and things work fine for me, it successfully restored the txn data. But I am not know the potential negative impact of removing the split code. Could there be any adverse effects?
This line will split regions to avoid region's data size too large. So if this line is commented, the region won't be split, and the data will be restored into one region. If the restored data size is not large, you can wait until regions are split automatically.
It's better to skip the empty EndKey
in the function getEndKeys
func getEndKeys(ranges []rtree.RangeStats) [][]byte {
endKeys := make([][]byte, 0, len(ranges))
for _, rg := range ranges {
if len(rg.EndKey) == 0 {
continue
}
endKeys = append(endKeys, rg.EndKey)
}
return endKeys
}
Hi @SonglinLife do you have time to fix this problem?
Really sorry for the late reply.
Yes, I do want to resolve this bug. Does it only need to ignore the last key? Recently, I dug into the BR project and read the code, and it is hard for me to understand it.
I also see some other bugs in the BR project, like it retries backup but doesn't reset the progress bar. https://github.com/pingcap/tidb/blob/119e76552731095cf39c2b17e10bb9fdc4d7c542/br/pkg/backup/client.go#L355 It really confused me at the beginning because I saw the backup progress bar reach 100% but it didn't stop the backup(it start a new round).
And I also struggle to figure out why the BR backup restarts rounds infinitely (starting 5 rounds). Can you give me some hints? I found in the BR code that it checks if there is an incomplete range. If none, then the main loop will stop.
before start backup, It get range information by listdb.
https://github.com/pingcap/tidb/blob/9dff38ba98405422cb0eb15993f385efe9068b47/br/pkg/backup/client.go#L750-L755
But I use the TiKV and PD only, not with TiDB. So the first incomplete range is <"", "">
. And the BR tree data structure will fill the incomplete range when TiKV backs up a region successfully. So if the first successful backup region is <a,b>
, then the incomplete range will be <"", a>
and <b, "">
.
If TiKV have some gap between two adjacent regions, like <a, b>
and <e,f>
, the BR will think <c,e>
as an incomplete region, and never stop the main loop.
I am a totally new user of TiDB, and it is really hard without your guys help. I do really want to improve the BR tools.
Yes, I do want to resolve this bug. Does it only need to ignore the last key? Recently, I dug into the BR project and read the code, and it is hard for me to understand it.
Only ignore the empty key (zero-length key). BR wants to split regions based on the ranges
boundaries, but if a range use max key as end key, we can't split on it.
For rest problem, please open separate issues for them.
thinks for your reply, before I open a new issue I will read code file to understand more detail. In practice, we force br backup txn stop at round 2 and retore txn, it work fine on a prod tikv cluster. but there must be some thing unusual. yes, lets discuss in another issue.
and I also open a new request, for this issue base on this discussion.
Bug Report
Please answer these questions before submitting your issue. Thanks!
1. Minimal reproduce step (Required)
2. What did you expect to see? (Required)
restore txn always success.
3. What did you see instead (Required)
restore txn failed with error, report startKey > endKey, endKey was
0000000000000000f7
. Due to tikv encode rule, empty byte slice will encode as0000000000000000f7
. And also I checked my tikv cluster, it did have a region, which endKey was""
.Same Issue also seen in
Restore txn kv fails and reports ErrRestoreInvalidRange
#52574 , although this issue was resolved by pr https://github.com/pingcap/tidb/commit/80d4dec1c07038cf8f81746158ebca5c28720def.but I find the br function
SplitKeysAndScatter
inbr/pkg/restore/split/client.go
https://github.com/pingcap/tidb/blob/master/br/pkg/restore/split/client.go#L536-L566 also encode the lastKey without check the lastKey is empty slices. and then it callPaginateScanRegion
, which throw the error.I guess It was some issue like https://github.com/pingcap/tidb/issues/52574
report error:
4. What is your TiDB version? (Required)
Only Tikv and pd (v5.0.6)
br(v8.4.0-nigthly)