pingcap / tidb

TiDB - the open-source, cloud-native, distributed SQL database designed for modern applications.
https://pingcap.com
Apache License 2.0
37.3k stars 5.85k forks source link

lightning failed with error “tidb lightning encountered error: fetch tso from pd failed: rpc error: code = Unknown desc = server not started” when pd rolling restart #50504

Open Lily2025 opened 10 months ago

Lily2025 commented 10 months ago

Bug Report

Please answer these questions before submitting your issue. Thanks!

1. Minimal reproduce step (Required)

1、run lightning 2、pd rolling restart

2. What did you expect to see? (Required)

lightning can cuccess

3. What did you see instead (Required)

lightning failed with error “tidb lightning encountered error: fetch tso from pd failed: rpc error: code = Unknown desc = server not started” when pd rolling restart

Run Lightning failed. cmd start at 2024-01-16 04:03:22 cmd failed at 2024-01-16 04:10:03 stdout: Verbose debug logs will be written to /tmp/lightning.log.2024-01-15T20.03.22Z

+----+------------------------------------------------------------------------------------------------------------------------------------+-------------+--------+ | # | CHECK ITEM | TYPE | PASSED | +----+------------------------------------------------------------------------------------------------------------------------------------+-------------+--------+ | 1 | Source data files size is proper | performance | true | +----+------------------------------------------------------------------------------------------------------------------------------------+-------------+--------+ | 2 | the checkpoints are valid | critical | true | +----+------------------------------------------------------------------------------------------------------------------------------------+-------------+--------+ | 3 | table schemas are valid | critical | true | +----+------------------------------------------------------------------------------------------------------------------------------------+-------------+--------+ | 4 | all importing tables on the target are empty | critical | true | +----+------------------------------------------------------------------------------------------------------------------------------------+-------------+--------+ | 5 | Cluster version check passed | critical | true | +----+------------------------------------------------------------------------------------------------------------------------------------+-------------+--------+ | 6 | Lightning has the correct storage permission | critical | true | +----+------------------------------------------------------------------------------------------------------------------------------------+-------------+--------+ | 7 | local disk resources are rich, estimate sorted data size 26.05GiB, local available is 3.399TiB | critical | true | +----+------------------------------------------------------------------------------------------------------------------------------------+-------------+--------+ | 8 | The storage space is rich, which TiKV/Tiflash is 5.396TiB/0B. The estimated storage space is 78.16GiB/0B. | performance | true | +----+------------------------------------------------------------------------------------------------------------------------------------+-------------+--------+ | 9 | TiKV stores (1, 13, 12, 14) contains more than 1000 empty regions respectively, which will greatly affect the import speed and suc | performance | false | | | cess rate | | | +----+------------------------------------------------------------------------------------------------------------------------------------+-------------+--------+ | 10 | Cluster region distribution is balanced | performance | true | +----+------------------------------------------------------------------------------------------------------------------------------------+-------------+--------+ | 11 | no CDC or PiTR task found | critical | true | +----+------------------------------------------------------------------------------------------------------------------------------------+-------------+--------+

tidb lightning encountered error: fetch tso from pd failed: rpc error: code = Unknown desc = server not started

ha-test-lightning-tps-5970173-1-664.tar.gz

4. What is your TiDB version? (Required)

./tidb-server -V Release Version: v8.0.0-alpha Edition: Community Git Commit Hash: https://github.com/pingcap/tidb/commit/3a3237ee49ee9f5be4c5aefb1559b97586884786 Git Branch: heads/refs/tags/v8.0.0-alpha UTC Build Time: 2024-01-15 11:42:58 GoVersion: go1.21.5 Race Enabled: false Check Table Before Drop: false Store: unistore

./tidb-lightning -V Release Version: v8.0.0-alpha Git Commit Hash: https://github.com/pingcap/tidb/commit/3a3237ee49ee9f5be4c5aefb1559b97586884786 Git Branch: heads/refs/tags/v8.0.0-alpha Go Version: go1.21.5 UTC Build Time: 2024-01-15 11:39:25 Race Enabled: false

Lily2025 commented 10 months ago

/severity major /assign lance6716

Lily2025 commented 9 months ago

from @lance6716 need a function like IsLeaderChange can judge retry, which need to be provided by pd img_v3_0276_4955bf52-369b-4121-83d1-f838e53dfe8g

CabinfeverB commented 9 months ago

We have plans to introduce a retry mechanism within pd client. I'll follow up on this