pingcap / tidb

TiDB is an open-source, cloud-native, distributed, MySQL-Compatible database for elastic scale and real-time analytics. Try AI-powered Chat2Query free at : https://www.pingcap.com/tidb-serverless/
https://pingcap.com
Apache License 2.0
37.13k stars 5.83k forks source link

large reorg sql execute failed duiring upgrade from v8.3.0 to v8.4.0 #56757

Open apollodafoni opened 2 days ago

apollodafoni commented 2 days ago

Bug Report

Please answer these questions before submitting your issue. Thanks!

1. Minimal reproduce step (Required)

tiup.yaml is as follows:

global:
  arch: amd64
  user: "tidb"
  ssh_port: 22
  deploy_dir: "/tiup/deploy"
  data_dir: "/tiup/data"
  enable_tls: false
server_configs:
  pd: {}
  tidb: {}
  tikv: {}
pd_servers:
  - host: pd-1-peer
  - host: pd-2-peer
  - host: pd-3-peer
tidb_servers:
  - host: tidb-1-peer
  - host: tidb-2-peer
  - host: tidb-3-peer
tikv_servers:
  - host: tikv-1-peer
  - host: tikv-2-peer
  - host: tikv-3-peer
monitoring_servers:
  - host: tiup-peer
    ng_port: 12020
grafana_servers:
  - host: tiup-peer
alertmanager_servers:
  - host: tiup-peer
tiup cluster deploy ddl_upgrade v8.3.0 tiup.yaml --format json -y
tiup cluster start ddl_upgrade --format json -y

After restore some data from S3, do large reorg:

alter table bill_detail add index idx1 (create_time, update_time, bill_code, order_code, assign_site_code, three_code, send_name, receive_name, send_mobile)

Duiring sql execution, do tiup upgrade:

tiup cluster upgrade ddl_upgrade v8.4.0-pre --wait-timeout 300 -y

2. What did you expect to see? (Required)

large reorg sql execute success after upgrade

3. What did you see instead (Required)

It seems like ddl job not paused duiring upgrade!

sql execute failed: Message: "receive Regions with no peer"

[2024/10/21 23:43:13.073 +08:00] [ERROR] [task_executor.go:536] [onError] [task-id=1] [task-type=backfill] [error="receive Regions with no peer"] [stack="github.com/pingcap/tidb/pkg/disttask/framework/taskexecutor.(*BaseTaskExecutor).onError\n\t/workspace/source/tidb/pkg/disttask/framework/taskexecutor/task_executor.go:536\ngithub.com/pingcap/tidb/pkg/disttask/framework/taskexecutor.(*BaseTaskExecutor).runSubtask\n\t/workspace/source/tidb/pkg/disttask/framework/taskexecutor/task_executor.go:418\ngithub.com/pingcap/tidb/pkg/disttask/framework/taskexecutor.(*BaseTaskExecutor).runStep\n\t/workspace/source/tidb/pkg/disttask/framework/taskexecutor/task_executor.go:374\ngithub.com/pingcap/tidb/pkg/disttask/framework/taskexecutor.(*BaseTaskExecutor).RunStep\n\t/workspace/source/tidb/pkg/disttask/framework/taskexecutor/task_executor.go:256\ngithub.com/pingcap/tidb/pkg/disttask/framework/taskexecutor.(*BaseTaskExecutor).Run\n\t/workspace/source/tidb/pkg/disttask/framework/taskexecutor/task_executor.go:236\ngithub.com/pingcap/tidb/pkg/disttask/framework/taskexecutor.(*Manager).startTaskExecutor.func1\n\t/workspace/source/tidb/pkg/disttask/framework/taskexecutor/manager.go:337\ngithub.com/pingcap/tidb/pkg/util.(*WaitGroupWrapper).RunWithLog.func1\n\t/workspace/source/tidb/pkg/util/wait_group_wrapper.go:171"]
[2024/10/21 23:43:13.073 +08:00] [ERROR] [task_executor.go:542] ["taskExecutor met first error"] [task-id=1] [task-type=backfill] [error="receive Regions with no peer"]

4. What is your TiDB version? (Required)

tiup upgrade tidb from v8.3.0 to v8.4.0-pre

apollodafoni commented 2 days ago

/severity critical /component ddl /assign @tangenta /label affects-8.4

lance6716 commented 2 days ago

https://github.com/tikv/client-go/blob/8dfa86b5d1dbd77b608b192bbf98132c79670706/internal/locate/region_cache.go#L2400

The error is here, found by search in "org" level in GitHub. I guess the reason is PD ScanRegion API does not have internal retry and caller forgets to handle it like

https://github.com/tikv/client-go/blob/8dfa86b5d1dbd77b608b192bbf98132c79670706/internal/locate/region_cache.go#L2218-L2225

lance6716 commented 2 days ago

Same cause as https://github.com/tikv/pd/issues/8442

cfzjywxk commented 2 days ago

If the region information is loaded from the local disk and the current leader has not yet reported a heartbeat to PD, the region information scanned at this time will not include the leader.

The lighting has encountered similar issues before https://github.com/pingcap/tidb/pull/52822.

Need to add retry logic when no leader region information is returned.