pingcap / tidb

TiDB is an open-source, cloud-native, distributed, MySQL-Compatible database for elastic scale and real-time analytics. Try AI-powered Chat2Query free at : https://www.pingcap.com/tidb-serverless/
https://pingcap.com
Apache License 2.0
36.9k stars 5.81k forks source link

Restore: The BR restore failed in 7 hours with panic in the checksum phase #42192

Closed Yui-Song closed 5 months ago

Yui-Song commented 1 year ago

Bug Report

Please answer these questions before submitting your issue. Thanks!

1. Minimal reproduce step (Required)

Restore ossinsight data with BR nightly version

tiup br:nightly restore db --db=gharchive_dev --pd \"pd-peer.release-perftest-arm64-ddl-ossinsight-001-tps-1657128-1-676:2379\" \\\n        --storage \"s3://perftest/ossinsight-2tiflash\" \\\n        --s3.endpoint \"http://172.16.6.xx:9000\" \\\n        --send-credentials-to-tikv=true \\\n\t\t--check-requirements=false --checksum-concurrency 128

2. What did you expect to see? (Required)

The BR restore would be finished in 50 mins.

3. What did you see instead (Required)

The BR restore failed in 7 hours with panic in the checksum phase.

[ERROR] [utils.go:679] ["The component `br` version v6.7.0-alpha-nightly-20230311 is not installed; downloading from repository.\ndownload http://172.16.5.134:8987/br-v6.7.0-alpha-nightly-20230311-linux-arm64.tar.gz 65.23 MiB / 65.23 MiB 100.00% ? MiB/sStarting component `br`: /root/.tiup/components/br/v6.7.0-alpha-nightly-20230311/br restore db --db=gharchive_dev --pd pd-peer.release-perftest-arm64-ddl-ossinsight-001-tps-1657128-1-676:2379 --storage s3://perftest/ossinsight-2tiflash --s3.endpoint http://172.16.6.59:9000 --send-credentials-to-tikv=true --check-requirements=false --checksum-concurrency 128\nDetail BR log in /tmp/br.log.2023-03-13T20.55.31+0800 \n
DataBase Restore <...........................................................................> 0.00%
DataBase Restore <...........................................................................> 
....
DataBase Restore <------------------------------------------------------------------|.......> 89.75%
DataBase Restore <------------------------------------------------------------------/.......> 89.75%
DataBase Restore <-------------------------------------------------------------------------> 100.00%
panic: runtime error: invalid memory address or nil pointer dereference\n[signal SIGSEGV: segmentation violation code=0x1 addr=0x78 pc=0x233d910]\n\ngoroutine 314041 [running]:\ngithub.com/tikv/pd/client.(*client).GetRegion(0x40011902d0, {0x55a0fa8, 0x40015f7920}, {0x403635f640, 0x1b, 0x1b}, {0x400ddabd88, 0x1, 0x1b?})\n\t/root/go/pkg/mod/github.com/tikv/pd/client@v0.0.0-20230309025512-47cd76ae5d67/client.go:632 +0x2e0\ngithub.com/tikv/client-go/v2/util.InterceptedPDClient.GetRegion({{0x560b098?, 0x40011902d0?}}, {0x55a0fa8, 0x40015f7920}, {0x403635f640, 0x1b, 0x1b}, {0x400ddabd88, 0x1, 0x1})\n\t/root/go/pkg/mod/github.com/tikv/client-go/v2@v2.0.7-0.20230309100832-f555fdd2c9d8/util/pd_interceptor.go:100 +0x90\ngithub.com/tikv/client-go/v2/internal/locate.(*CodecPDClient).GetRegion(0x40015a4380, {0x55a0fa8, 0x40015f7920}, {0x403635f600?, 0x4004abace0?, 0x1000000000020?}, {0x400ddabd88, 0x1, 0x1})\n\t/root/go/pkg/mod/github.com/tikv/client-go/v2@v2.0.7-0.20230309100832-f555fdd2c9d8/internal/locate/pd_codec.go:97 +0x98\ngithub.com/tikv/client-go/v2/internal/locate.(*RegionCache).loadRegion(0x40002c6000, 0x4001974510, {0x403635f600, 0x13, 0x1b}, 0x0)\n\t/root/go/pkg/mod/github.com/tikv/client-go/v2@v2.0.7-0.20230309100832-f555fdd2c9d8/internal/locate/region_cache.go:1502 +0x2bc\ngithub.com/tikv/client-go/v2/internal/locate.(*RegionCache).findRegionByKey(0x40002c6000, 0x4001974510, {0x403635f600, 0x13, 0x1b}, 0x18?)\n\t/root/go/pkg/mod/github.com/tikv/client-go/v2@v2.0.7-0.20230309100832-f555fdd2c9d8/internal/locate/region_cache.go:986 +0x318\ngithub.com/tikv/client-go/v2/internal/locate.(*RegionCache).LocateKey(0x401ea531d0?, 0x403635f540?, {0x403635f600?, 0x1b?, 0x4852400?})\n\t/root/go/pkg/mod/github.com/tikv/client-go/v2@v2.0.7-0.20230309100832-f555fdd2c9d8/internal/locate/region_cache.go:941 +0x28\ngithub.com/pingcap/tidb/store/copr.(*RegionCache).SplitKeyRangesByLocations(0x4020cbb5a0, 0x4020cbb590, 0x40015f7980, 0xffffffffffffffff)\n\t/var/lib/docker/jenkins/workspace/build-common/go/src/github.com/pingcap/br/store/copr/region_cache.go:135 +0x128\ngithub.com/pingcap/tidb/store/copr.(*RegionCache).SplitKeyRangesByBuckets(0x4001214000?, 0x1a9bf2c?, 0x7db58c0?)\n\t/var/lib/docker/jenkins/workspace/build-common/go/src/github.com/pingcap/br/store/copr/region_cache.go:185 +0x20\ngithub.com/pingcap/tidb/store/copr.buildCopTasks(0x4020cbb590, 0x40015f7980, 0x4004abb688)\n\t/var/lib/docker/jenkins/workspace/build-common/go/src/github.com/pingcap/br/store/copr/coprocessor.go:327 +0xf0\ngithub.com/pingcap/tidb/store/copr.(*CopClient).BuildCopIterator.func3({0x4021dfc6e0, 0x1, 0x1}, {0x0, 0x0, 0x0})\n\t/var/lib/docker/jenkins/workspace/build-common/go/src/github.com/pingcap/br/store/copr/coprocessor.go:147 +0xf4\ngithub.com/pingcap/tidb/kv.(*KeyRanges).ForEachPartitionWithErr(0x4006500540, 0x4004abb660)\n\t/var/lib/docker/jenkins/workspace/build-common/go/src/github.com/pingcap/br/kv/kv.go:455 +0xc8\ngithub.com/pingcap/tidb/store/copr.(*CopClient).BuildCopIterator(0x4001c25bc0, {0x55a0fa8?, 0x40015f7920}, 0x400cb2af20, 0x401cc09db8, 0x401cc09dd0?)\n\t/var/lib/docker/jenkins/workspace/build-common/go/src/github.com/pingcap/br/store/copr/coprocessor.go:161 +0x2c0\ngithub.com/pingcap/tidb/store/copr.(*CopClient).Send(0x36b3800?, {0x55a0f00, 0x40044e9400}, 0x400cb2af20, {0x45564a0?, 0x401cc09db8?}, 0x400af3d760)\n\t/var/lib/docker/jenkins/workspace/build-common/go/src/github.com/pingcap/br/store/copr/coprocessor.go:91 +0x1e4\ngithub.com/pingcap/tidb/distsql.Checksum({0x55a0f00, 0x40044e9400}, {0x558c420, 0x4001c25bc0}, 0x400cb2af20, {0x45564a0, 0x401cc09db8})\n\t/var/lib/docker/jenkins/workspace/build-common/go/src/github.com/pingcap/br/distsql/distsql.go:179 +0x6c\ngithub.com/pingcap/tidb/br/pkg/checksum.sendChecksumRequest({0x55a0f00, 0x40044e9400}, {0x558c420?, 0x4001c25bc0?}, 0x15?, 0x14?)\n\t/var/lib/docker/jenkins/workspace/build-common/go/src/github.com/pingcap/br/br/pkg/checksum/executor.go:257 +0x4c\ngithub.com/pingcap/tidb/br/pkg/checksum.(*Executor).Execute.func1()\n\t/var/lib/docker/jenkins/workspace/build-common/go/src/github.com/pingcap/br/br/pkg/checksum/executor.go:347 +0xbc\ngithub.com/pingcap/tidb/br/pkg/utils.WithRetry({0x55a0f00, 0x40044e9400}, 0x4004abba60, {0x5588730, 0x401c53e540})\n\t/var/lib/docker/jenkins/workspace/build-common/go/src/github.com/pingcap/br/br/pkg/utils/retry.go:52 +0x74\ngithub.com/pingcap/tidb/br/pkg/checksum.(*Executor).Execute(0x4019d96990, {0x55a0f00, 0x40044e9400}, {0x558c420, 0x4001c25bc0}, 0x4f7c0e0)\n\t/var/lib/docker/jenkins/workspace/build-common/go/src/github.com/pingcap/br/br/pkg/checksum/executor.go:346 +0x190\ngithub.com/pingcap/tidb/br/pkg/restore.(*Client).execChecksum(0x4020663b30?, {0x55a0f00, 0x40044e9400?}, {0x400b210dc0, 0x4009e63ba0, 0x400968b7c0}, {0x558c420, 0x4001c25bc0}, 0x80, 0x1af13ac?)\n\t/var/lib/docker/jenkins/workspace/build-common/go/src/github.com/pingcap/br/br/pkg/restore/client.go:1455 +0x564\ngithub.com/pingcap/tidb/br/pkg/restore.(*Client).GoValidateChecksum.func2.2()\n\t/var/lib/docker/jenkins/workspace/build-common/go/src/github.com/pingcap/br/br/pkg/restore/client.go:1401 +0xec\ngithub.com/pingcap/tidb/br/pkg/utils.(*WorkerPool).ApplyOnErrorGroup.func1()\n\t/var/lib/docker/jenkins/workspace/build-common/go/src/github.com/pingcap/br/br/pkg/utils/worker.go:76 +0x68\ngolang.org/x/sync/errgroup.(*Group).Go.func1()\n\t/root/go/pkg/mod/golang.org/x/sync@v0.1.0/errgroup/errgroup.go:75 +0x5c\ncreated by golang.org/x/sync/errgroup.(*Group).Go\n\t/root/go/pkg/mod/golang.org/x/sync@v0.1.0/errgroup/errgroup.go:72 +0x9c"] [stack="github.com/pingcap/endless/testcase/perftest.runCMD\n\t/home/jenkins/agent/workspace/endless-master-build/testcase/perftest/utils.go:679\ngithub.com/pingcap/endless/testcase/perftest.glob..func1.1.1\n\t/home/jenkins/agent/workspace/endless-master-build/testcase/perftest/bench.go:506\ngithub.com/onsi/ginkgo/internal/leafnodes.(*runner).runSync\n\t/go/pkg/mod/github.com/onsi/ginkgo@v1.16.5/internal/leafnodes/runner.go:113\ngithub.com/onsi/ginkgo/internal/leafnodes.(*runner).run\n\t/go/pkg/mod/github.com/onsi/ginkgo@v1.16.5/internal/leafnodes/runner.go:64\ngithub.com/onsi/ginkgo/internal/leafnodes.(*ItNode).Run\n\t/go/pkg/mod/github.com/onsi/ginkgo@v1.16.5/internal/leafnodes/it_node.go:26\ngithub.com/onsi/ginkgo/internal/spec.(*Spec).runSample\n\t/go/pkg/mod/github.com/onsi/ginkgo@v1.16.5/internal/spec/spec.go:215\ngithub.com/onsi/ginkgo/internal/spec.(*Spec).Run\n\t/go/pkg/mod/github.com/onsi/ginkgo@v1.16.5/internal/spec/spec.go:138\ngithub.com/onsi/ginkgo/internal/specrunner.(*SpecRunner).runSpec\n\t/go/pkg/mod/github.com/onsi/ginkgo@v1.16.5/internal/specrunner/spec_runner.go:200\ngithub.com/onsi/ginkgo/internal/specrunner.(*SpecRunner).runSpecs\n\t/go/pkg/mod/github.com/onsi/ginkgo@v1.16.5/internal/specrunner/spec_runner.go:170\ngithub.com/onsi/ginkgo/internal/specrunner.(*SpecRunner).Run\n\t/go/pkg/mod/github.com/onsi/ginkgo@v1.16.5/internal/specrunner/spec_runner.go:66\ngithub.com/onsi/ginkgo/internal/suite.(*Suite).Run\n\t/go/pkg/mod/github.com/onsi/ginkgo@v1.16.5/internal/suite/suite.go:79\ngithub.com/onsi/ginkgo.runSpecsWithCustomReporters\n\t/go/pkg/mod/github.com/onsi/ginkgo@v1.16.5/ginkgo_dsl.go:245\ngithub.com/onsi/ginkgo.RunSpecs\n\t/go/pkg/mod/github.com/onsi/ginkgo@v1.16.5/ginkgo_dsl.go:220\ngithub.com/pingcap/endless/testcase/perftest.TestFailover\n\t/home/jenkins/agent/workspace/endless-master-build/testcase/perftest/bench_suite_test.go:23\ntesting.tRunner\n\t/usr/local/go/src/testing/testing.go:1446"]
• Failure [23370.189 seconds]

4. What is your TiDB version? (Required)

The componentbrversion v6.7.0-alpha-nightly-20230311

3pointer commented 1 year ago

This issue happened when restore already failed and started exit. At same time when the program exists checksum progress get nil pd-client by unexpect order of exits.

After talk with @jebter we think it's not a critical. leave it major and none-release-block

Leavrth commented 1 year ago

the issue would be fixed by https://github.com/pingcap/tidb/pull/42329

BornChanger commented 5 months ago

@Yui-Song please double check if the problem is still there. If not, please close it.

Yui-Song commented 5 months ago

Since the issue has not been observed again, it will be closed.