pingcap / tidb

TiDB is an open-source, cloud-native, distributed, MySQL-Compatible database for elastic scale and real-time analytics. Try AI-powered Chat2Query free at : https://www.pingcap.com/tidb-serverless/
https://pingcap.com
Apache License 2.0
36.91k stars 5.81k forks source link

Cluster update failed with check dist task failed, tidb_enable_dist_task is enabled #54061

Open 0xdeafbeef opened 3 months ago

0xdeafbeef commented 3 months ago

Bug Report

steps to reproduce:

0xdeafbeef commented 3 months ago

I can build db without this check, but it's pretty scary

https://github.com/pingcap/tidb/blob/master/pkg/session/bootstrap.go#L1387

dveeden commented 3 months ago

See also:

0xdeafbeef commented 3 months ago

Monkey patching the executable to set a variable has resolved the issue. Perhaps TiDB should have a command flag to execute SQL commands before running all checks? Regardless, there shouldn't be such pitfalls

lance6716 commented 3 months ago

Hi @0xdeafbeef this is an expected check. Don't remove it because if a tidb_enable_dist_task job is running the job can not ensure correctness. You should connect to the online TiDB node and set this variable off.

dveeden commented 3 months ago

Note that this is documented on:

0xdeafbeef commented 3 months ago

Note that this is documented on:

* [docs.pingcap.com/tidb/stable/release-8.1.0#system-variables](https://docs.pingcap.com/tidb/stable/release-8.1.0#system-variables)

* [docs.pingcap.com/tidb/stable/smooth-upgrade-tidb#limitations-on-user-operations](https://docs.pingcap.com/tidb/stable/smooth-upgrade-tidb#limitations-on-user-operations)

But it's not listed in update docs. Previously, all nasty consequences were documented here, e.g., it asked to stop all DDL jobs, and stop TiFlash if you are updating from some old version. This time, it's written in a separate document.

Also, I've run tiup cluster check and it said that everything is perfect. So, yeah, I should have read all the release notes, but previous smooth experiences gave me faith that all pitfalls would be caught before the update

0xdeafbeef commented 3 months ago

Hi @0xdeafbeef this is an expected check. Don't remove it because if a tidb_enable_dist_task job is running the job can not ensure correctness. You should connect to the online TiDB node and set this variable off.

I understand that the check is needed. The problem is that I've run the cluster check, stopped the cluster, and performed an offline upgrade. During the start, this check triggers and asks to set the variable. You can't connect to a live tidb node if all nodes are stopped. You can't rollback cluster to previous version. Furthermore, you can't set this variable using InitializeSQLFile because it runs after all checks. So, I had to patch TiDB to set this variable to OFF while booting up.

D3Hunter commented 3 months ago

we should link the limitations in upgrade-tidb-using-tiup. and seems tiup cluster check does NOT check whether the limitations are met.