pingcap / tidb

TiDB is an open-source, cloud-native, distributed, MySQL-Compatible database for elastic scale and real-time analytics. Try AI-powered Chat2Query free at : https://www.pingcap.com/tidb-serverless/
https://pingcap.com
Apache License 2.0
36.63k stars 5.77k forks source link

TiDB lighitng failed for the error of requested pd is not leader of cluster #38751

Open lilinghai opened 1 year ago

lilinghai commented 1 year ago

Bug Report

Please answer these questions before submitting your issue. Thanks!

1. Minimal reproduce step (Required)

/tidb-lightning \"-pd-urls\" \"tc-pd.e2e-htap-encryption-tps-1302571-1-900:2379\" \"-tidb-host\" \"tc-tidb.e2e-htap-encryption-tps-1302571-1-900\" \"-tidb-port\" \"4000\" \"-tidb-user\" \"root\" \"-tidb-password\" \"\" \"-backend\" \"local\" \"-sorted-kv-dir\" \"/tmp/sorted-kv-dir\" \"-d\" \"s3://nfs/tiflash/csv-tpcc-100?access-key=minioadmin&secret-access-key=minioadmin&endpoint=http%3a%2f%2fminio.pingcap.net%3a9000&force-path-style=true\" \"-c\" \"/lightning.yaml\""

2. What did you expect to see? (Required)

success

3. What did you see instead (Required)

[2022/10/30 18:38:24.213 +00:00] [ERROR] [restore.go:1528] ["restore all tables data failed"] [takeTime=11m17.100654126s] [error="fetch tso from pd failed: rpc error: code = Unknown desc = [PD:tso:ErrGenerateTimestamp]generate timestamp failed, requested pd is not leader of cluster"]
[2022/10/30 18:38:24.213 +00:00] [INFO] [restore.go:1171] ["everything imported, stopping periodic actions"]
[2022/10/30 18:38:24.213 +00:00] [ERROR] [restore.go:466] ["run failed"] [step=4] [error="fetch tso from pd failed: rpc error: code = Unknown desc = [PD:tso:ErrGenerateTimestamp]generate timestamp failed, requested pd is not leader of cluster"]
[2022/10/30 18:38:24.213 +00:00] [ERROR] [restore.go:476] ["the whole procedure failed"] [takeTime=11m18.462109893s] [error="fetch tso from pd failed: rpc error: code = Unknown desc = [PD:tso:ErrGenerateTimestamp]generate timestamp failed, requested pd is not leader of cluster"]
[2022/10/30 18:38:24.321 +00:00] [INFO] [checksum.go:459] ["service safe point keeper exited"]
[2022/10/30 18:38:24.321 +00:00] [ERROR] [main.go:103] ["tidb lightning encountered error stack info"] [error="fetch tso from pd failed: rpc error: code = Unknown desc = [PD:tso:ErrGenerateTimestamp]generate timestamp failed, requested pd is not leader of cluster"] [errorVerbose="rpc error: code = Unknown desc = [PD:tso:ErrGenerateTimestamp]generate timestamp failed, requested pd is not leader of cluster\ngithub.com/tikv/pd/client.(*client).processTSORequests\n\t/go/pkg/mod/github.com/tikv/pd/client@v0.0.0-20221010134149-d50e5fe43f14/client.go:1092\ngithub.com/tikv/pd/client.(*client).handleDispatcher\n\t/go/pkg/mod/github.com/tikv/pd/client@v0.0.0-20221010134149-d50e5fe43f14/client.go:842\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_amd64.s:1594\ngithub.com/tikv/pd/client.(*tsoRequest).Wait\n\t/go/pkg/mod/github.com/tikv/pd/client@v0.0.0-20221010134149-d50e5fe43f14/client.go:1308\ngithub.com/tikv/pd/client.(*client).GetTS\n\t/go/pkg/mod/github.com/tikv/pd/client@v0.0.0-20221010134149-d50e5fe43f14/client.go:1328\ngithub.com/pingcap/tidb/br/pkg/lightning/restore.(*tikvChecksumManager).Checksum\n\t/home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/br/br/pkg/lightning/restore/checksum.go:315\ngithub.com/pingcap/tidb/br/pkg/lightning/restore.DoChecksum\n\t/home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/br/br/pkg/lightning/restore/checksum.go:161\ngithub.com/pingcap/tidb/br/pkg/lightning/restore.(*TableRestore).postProcess\n\t/home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/br/br/pkg/lightning/restore/table_restore.go:800\ngithub.com/pingcap/tidb/br/pkg/lightning/restore.(*Controller).restoreTables.func7.1\n\t/home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/br/br/pkg/lightning/restore/restore.go:1652\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_amd64.s:1594\nfetch tso from pd failed"]

4. What is your TiDB version? (Required)

master

fubinzh commented 1 year ago

This is by design as per @niubell , the error happens in checksum phase, and currently there is no retry mechanism during checksum. Change it to enhancement.

mittalrishabh commented 1 year ago

This is Rishabh. I work in airbnb. We have seen multiple times lightning failing during checksum phase. I understand that checksum is an expensive operation and retry can be costly. But we should retry for the errors like "region unavailable" or "PD can't fetch timeout". We already filed support tickets for these issues

  1. https://support.pingcap.com/hc/en-us/requests/1840
  2. https://support.pingcap.com/hc/en-us/requests/1832