pingcap / tidb

TiDB is an open-source, cloud-native, distributed, MySQL-Compatible database for elastic scale and real-time analytics. Try AI-powered Chat2Query free at : https://www.pingcap.com/tidb-serverless/
https://pingcap.com
Apache License 2.0
37.12k stars 5.83k forks source link

DDL job could not be consume in queue #14633

Closed thuyenptr closed 4 years ago

thuyenptr commented 4 years ago

Bug Report

  1. What did you do?

Create DDL like drop table, create table, detail:

drop table tidb_loader_voucher.checkpoint;

  1. What did you expect to see?

Drop table or create a table successfully.

  1. What did you see instead?

I use go-sql-driver in my service when execute drop table tidb_loader_voucher.checkpoint; this print bellow error log:

level=info msg="remove checkpoint error: invalid connection\n" source="thanos.go:221"
[mysql] 2020/02/04 11:07:19 packets.go:36: unexpected EOF
Screen Shot 2020-02-04 at 13 45 30

It sounds some DDL Job in the queue and never be executed.

MicrosoftTeams-image

Some job is being stuck. If I add more DDL statement it will never be executed because of that stuck.

> SHOW TABLES FROM tidb_loader_voucher;
Empty set (0.00 sec)

This indicates that the checkpoint table was deleted, but why the job didn't finish?

I think something went wrong at step 7.

  1. What version of TiDB are you using (tidb-server -V or run select tidb_version(); on TiDB)?

v3.0.1

Relate report

shenli commented 4 years ago

Thanks for your reporting. To root cause this issue, please find the ddl owner first and then check whether there is any error log.

To identify the ddl owner, please use the admin show ddl; statement to check the DDL info https://pingcap.com/docs/stable/faq/tidb/#how-to-view-the-progress-of-the-current-ddl-job

     OWNER: 1a1c4174-0fcd-4ba0-add9-12d08c4077dc
     SELF_ID: 1a1c4174-0fcd-4ba0-add9-12d08c4077dc

If the OWNER is equal with SELF_ID, then that instance is the owner.

If there is only one tidb-server instance, the first step is not necessary.

Please grep error in the ddl owner instance's log.

bb7133 commented 4 years ago

Thanks for reporting!

Please grep error in the ddl owner instance's log.

Can you provide the log of DDL owner instance, if it is not too large?

crazycs520 commented 4 years ago

@billhcmus, Thanks for reporting.

If you don't mind, you can seed the log to my email: chenshuang@pingcap.com

thuyenptr commented 4 years ago

Thank all for your fast reply, I will give you the error log as soon as possible.

thuyenptr commented 4 years ago

Thanks for your reporting. To root cause this issue, please find the ddl owner first and then check whether there is any error log.

To identify the ddl owner, please use the admin show ddl; statement to check the DDL info https://pingcap.com/docs/stable/faq/tidb/#how-to-view-the-progress-of-the-current-ddl-job

     OWNER: 1a1c4174-0fcd-4ba0-add9-12d08c4077dc
     SELF_ID: 1a1c4174-0fcd-4ba0-add9-12d08c4077dc

If the OWNER is equal with SELF_ID, then that instance is the owner.

If there is only one tidb-server instance, the first step is not necessary.

Please grep error in the ddl owner instance's log.

Hi, when I execute admin show ddl in 3 instances of TiDB Server it shows:

> admin show ddl;
+------------+--------------------------------------+---------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--------------------------------------+------------------------------------------------------+
| SCHEMA_VER | OWNER_ID                             | OWNER_ADDRESS | RUNNING_JOBS                                                                                                                                                                                                | SELF_ID                              | QUERY                                                |
+------------+--------------------------------------+---------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--------------------------------------+------------------------------------------------------+
|       2119 | b202210f-c26a-49f1-90a1-d99f488d62eb | :4000         | ID:3467, Type:drop table, State:running, SchemaState:delete only, SchemaID:3281, TableID:3459, RowCount:0, ArgLen:0, start time: 2020-02-03 16:50:24.51 +0700 +07, Err:<nil>, ErrCount:0, SnapshotVersion:0 | 9a2519ff-1a54-4012-b487-a0bc8d73ef79 | DROP TABLE IF EXISTS tidb_loader_voucher.checkpoint; |
+------------+--------------------------------------+---------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--------------------------------------+------------------------------------------------------+
> admin show ddl;                                                                                                                                       
+------------+--------------------------------------+---------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--------------------------------------+------------------------------------------------------+                                                                                                                                 
| SCHEMA_VER | OWNER_ID                             | OWNER_ADDRESS | RUNNING_JOBS                                                                                                                                                                                                | SELF_ID                              | QUERY                                                |                                                                                                                                 
+------------+--------------------------------------+---------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--------------------------------------+------------------------------------------------------+                                                                                                                                 
|       2119 | b202210f-c26a-49f1-90a1-d99f488d62eb | :4000         | ID:3467, Type:drop table, State:running, SchemaState:delete only, SchemaID:3281, TableID:3459, RowCount:0, ArgLen:0, start time: 2020-02-03 16:50:24.51 +0700 +07, Err:<nil>, ErrCount:0, SnapshotVersion:0 | b9de0dc0-3797-4513-90e3-4e5274e3d4e3 | DROP TABLE IF EXISTS tidb_loader_voucher.checkpoint; |                                                                                                                                 
+------------+--------------------------------------+---------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--------------------------------------+------------------------------------------------------+  

The is no OWNER is equal with SELF_ID in the above result. I think this is the cause of the problem, the worker in each instance has no role to execute pending DDL. One more thing, before I execute the DDL, a few days ago 3 instances of TiDB crash because of OOM, then I restart them. Every DMLs working normally but DDLs stuck.

zimulala commented 4 years ago

@billhcmus Could you show me the result of "curl http://{TiDBIP}:{TiDBPORT}/info/all"?

thuyenptr commented 4 years ago

@billhcmus Could you show me the result of "curl http://{TiDBIP}:{TiDBPORT}/info/all"?

Here Something wrong with: b202210f-c26a-49f1-90a1-d99f488d62eb

=================================================================================================
   "status_port": 10080,
   "lease": "45s"
  },
  "a38e4abd-173f-4109-87a1-1c0f4ccb61f6": {
   "version": "5.7.25-TiDB-v3.0.1",
   "git_hash": "9e4e8da3c58c65123db5f26409759fe1847529f8",
   "ddl_id": "a38e4abd-173f-4109-87a1-1c0f4ccb61f6",
   "ip": "",
   "listening_port": 4000,
   "status_port": 10080,
   "lease": "45s"
  },
  "b202210f-c26a-49f1-90a1-d99f488d62eb": {
   "version": "5.7.25-TiDB-v3.0.1",
   "git_hash": "9e4e8da3c58c65123db5f26409759fe1847529f8",
   "ddl_id": "b202210f-c26a-49f1-90a1-d99f488d62eb",
   "ip": "",
   "listening_port": 4000,
   "status_port": 10080,
   "lease": "45s"
  },
  "b26ee3e0-8a64-4730-96a2-64d441a0603c": {
   "version": "5.7.25-TiDB-v3.0.1",
   "git_hash": "9e4e8da3c58c65123db5f26409759fe1847529f8",
   "ddl_id": "b26ee3e0-8a64-4730-96a2-64d441a0603c",
   "ip": "",
   "listening_port": 4000,
   "status_port": 10080,
   "lease": "45s"
  },
  "b9de0dc0-3797-4513-90e3-4e5274e3d4e3": {
   "version": "5.7.25-TiDB-v3.0.1",
   "git_hash": "9e4e8da3c58c65123db5f26409759fe1847529f8",
   "ddl_id": "b9de0dc0-3797-4513-90e3-4e5274e3d4e3",
   "ip": "",
   "listening_port": 4000,
   "status_port": 10080,
   "lease": "45s"
  },
  "c9a24ebf-d620-44fd-bf58-1e8e86dd77c6": {
   "version": "5.7.25-TiDB-v3.0.1",
   "git_hash": "9e4e8da3c58c65123db5f26409759fe1847529f8",
   "ddl_id": "c9a24ebf-d620-44fd-bf58-1e8e86dd77c6",
   "ip": "",
   "listening_port": 4000,
   "status_port": 10080,
   "lease": "45s"
  },
  "f73fb8e2-928e-4bfa-a8b3-1844eabb10d0": {
   "version": "5.7.25-TiDB-v3.0.1",
   "git_hash": "9e4e8da3c58c65123db5f26409759fe1847529f8",
   "ddl_id": "f73fb8e2-928e-4bfa-a8b3-1844eabb10d0",
   "ip": "",
   "listening_port": 4000,
   "status_port": 10080,
   "lease": "45s"
  }
 }
}
zimulala commented 4 years ago

@billhcmus Do you install this cluster with tidb-ansible? You can finde tidb-ctl in "tidb-ansible/resources/bin".

You can execute "./tidb-ctl etcd ddlinfo --pdhost {PD_IP} --pdport {PD_PORT}". Then you will see "b202210f-c26a-49f1-90a1-d99f488d62eb" in value of one of the kvs. In this KV, you can fine the key value like "/tidb/ddl/fg/onwner/xxxx". Then you can execute the following cmds:

At last, you can execute "./tidb-ctl etcd ddlinfo --pdhost {PD_IP} --pdport {PD_PORT}" to check if it works.

zimulala commented 4 years ago

@billhcmus Could you tell me show many TiDB in your cluster?If only 3 TiDB instances? We need to delete other DDL ID info.

thuyenptr commented 4 years ago

I run only 3 TiDB instances.

We need to delete other DDL ID info what do you mean?

Delete ddl_id correspond with job_id after job_id=3467 ?

zimulala commented 4 years ago

@billhcmus No. I mean we need to delete DDL info like "./tidb-ctl etcd delkey ...". We have only 3 instances in the cluster, but we get 6 server info in the result of "curl http://{TiDBIP}:{TiDBPORT}/info/all". So we need to delete the other 3 server infos. Could you accept restart of PD nodes? If you can accept, we needn't delete DDL info like "./tidb-ctl etcd delkey ...".

thuyenptr commented 4 years ago

How do I know what server info I need to delete? choose 3 servers in 6 servers?

zimulala commented 4 years ago

@billhcmus The DDL ID named 77c6fbd5-060a-4f91-9782-7d68dd6be6cf, 9a2519ff-1a54-4012-b487-a0bc8d73ef79 and b9de0dc0-3797-4513-90e3-4e5274e3d4e3 are valid, and we can see it through the SELF_ID field of admin show ddl; on the 3 servers.

thuyenptr commented 4 years ago

but in the result after exec command: curl http://{TiDBIP}:{TiDBPORT}/info/all we not have any of: 77c6fbd5-060a-4f91-9782-7d68dd6be6cf, 9a2519ff-1a54-4012-b487-a0bc8d73ef79 and b9de0dc0-3797-4513-90e3-4e5274e3d4e3

zimulala commented 4 years ago

Do you restart these TiDB instances? Could you execute admin show ddl; and curl http://{TiDBIP}:{TiDBPORT}/info/all again to make sure there are no other abnormal operations between these two operations?

thuyenptr commented 4 years ago

After check again I found that I have 6 TiDB instances.

Cluster 1: 3 instances, tidb-binlog enable Cluster 2: 3 instances, tidb-binlog enable but firewall rules not enable to pump-client call pump server When Cluster 1 crash because of OOM the owner switch to one instance in Cluster 2 Then I execute DDL and DDL could not be sync to Pump server, then DDL stagnant in the queue.

[2020/02/03 16:50:25.900 +07:00] [INFO] [ddl_worker.go:486] ["[ddl] run DDL job"] [worker="worker 1, tp general"] [job="ID:3467, Type:drop table, State:running, SchemaState:wr
ite only, SchemaID:3281, TableID:3459, RowCount:0, ArgLen:0, start time: 2020-02-03 16:50:24.51 +0700 +07, Err:<nil>, ErrCount:0, SnapshotVersion:0"]

[2020/02/03 16:50:41.047 +07:00] [WARN] [client.go:288] ["[pumps client] write binlog to pump failed"] [NodeID=PMA_ZaloPay_DB-527.8:8250] ["binlog type"=Prewrite] ["start ts"=414377161774858243] ["commit ts"=0] [length=91] [error="rpc error: code = DeadlineExceeded desc = context deadline exceeded"]

Thank you for your support @zimulala .

shenli commented 4 years ago

Do you mean you have 6 tidb-server instances in one TiDB cluster?

After check again I found that I have 6 TiDB instances.

Cluster 1: 3 instances, tidb-binlog enable Cluster 2: 3 instances, tidb-binlog enable but firewall rules not enable to pump-client call pump server When Cluster 1 crash because of OOM the owner switch to one instance in Cluster 2 Then I execute DDL and DDL could not be sync to Pump server, then DDL stagnant in the queue.

[2020/02/03 16:50:25.900 +07:00] [INFO] [ddl_worker.go:486] ["[ddl] run DDL job"] [worker="worker 1, tp general"] [job="ID:3467, Type:drop table, State:running, SchemaState:wr
ite only, SchemaID:3281, TableID:3459, RowCount:0, ArgLen:0, start time: 2020-02-03 16:50:24.51 +0700 +07, Err:<nil>, ErrCount:0, SnapshotVersion:0"]

[2020/02/03 16:50:41.047 +07:00] [WARN] [client.go:288] ["[pumps client] write binlog to pump failed"] [NodeID=PMA_ZaloPay_DB-527.8:8250] ["binlog type"=Prewrite] ["start ts"=414377161774858243] ["commit ts"=0] [length=91] [error="rpc error: code = DeadlineExceeded desc = context deadline exceeded"]

Thank you for your support @zimulala .

thuyenptr commented 4 years ago

Do you mean you have 6 tidb-server instances in one TiDB cluster?

Yeah, and finally, after open firewall rules, DDL executed normally. I will close this issue here. Thank you.

shenli commented 4 years ago

Great! Let me check whether there is an alert rule for such a scenario. This could help root cause this issue quickly.

lonng commented 4 years ago

Hi @billhcmus, we are developing a new feature SQL diagnostics and there is a diagnosis rule DDL that may boost locate this problem.

thuyenptr commented 4 years ago

@shenli @lonng Wow, it's good to hear that. Hope to release soon.