pingcap / tidb

TiDB - the open-source, cloud-native, distributed SQL database designed for modern applications.
https://pingcap.com
Apache License 2.0
37.41k stars 5.85k forks source link

Can not start tidb-server after upgrading! Help! Help! #8857

Closed 474420502 closed 5 years ago

474420502 commented 5 years ago

Bug Report

Please answer these questions before submitting your issue. Thanks!

Dec 28 10:23:48 vnfs-ag-01 systemd[1]: Started tidb-4000 service. root@vnfs-ag-01:/data/deploy/log# systemctl status tidb-4000.service ● tidb-4000.service - tidb-4000 service Loaded: loaded (/etc/systemd/system/tidb-4000.service; disabled; vendor preset: enabled) Active: active (running) since Fri 2018-12-28 10:23:48 CST; 18s ago Main PID: 12556 (tidb-server) Tasks: 44 (limit: 4915) CGroup: /system.slice/tidb-4000.service └─12556 bin/tidb-server -P 4000 --status=10080 --advertise-address=192.168.6.101 --

Dec 28 10:23:48 vnfs-ag-01 systemd[1]: Started tidb-4000 service. root@vnfs-ag-01:/data/deploy/log# systemctl status tidb-4000.service ● tidb-4000.service - tidb-4000 service Loaded: loaded (/etc/systemd/system/tidb-4000.service; disabled; vendor preset: enabled) Active: active (running) since Fri 2018-12-28 10:23:48 CST; 19s ago Main PID: 12556 (tidb-server) Tasks: 44 (limit: 4915) CGroup: /system.slice/tidb-4000.service └─12556 bin/tidb-server -P 4000 --status=10080 --advertise-address=192.168.6.101 --

Dec 28 10:23:48 vnfs-ag-01 systemd[1]: Started tidb-4000 service. root@vnfs-ag-01:/data/deploy/log# systemctl status tidb-4000.service ● tidb-4000.service - tidb-4000 service Loaded: loaded (/etc/systemd/system/tidb-4000.service; disabled; vendor preset: enabled) Active: activating (auto-restart) (Result: exit-code) since Fri 2018-12-28 10:24:08 CST; 63 Process: 12556 ExecStart=/data/deploy/scripts/run_tidb.sh (code=exited, status=1/FAILURE) Main PID: 12556 (code=exited, status=1/FAILURE)

这次tidb-ansible Merge branch 'release-2.1' of https://github.com/pingcap/tidb-ansible into my 这次轮到 tidb启动不了.

474420502 commented 5 years ago

send tikv request error: rpc error: code = Unavailable desc = all SubConns are in TransientFailure, latest connection error: connection error: desc = "transport: Error while dialing dial tcp 192.168.6.101:20160: connect: connection refused", ctx: &{{787169 71 26} id:787169 end_key:"mDB:153\000\000\375\000\000\000\000\000\000\000hTable:10\377040\000\000\000\000\000\372" region_epoch: peers: peers: peers: id:787170 store_id:1 192.168.6.101:20160}, try next peer later at 2018-12-28T10:29:30.26849713+08:00 2018/12/28 10:29:30.268 session.go:1275: [fatal] check bootstrapped err [tikv:9002]TiKV server timeout[try again later] 2018/12/28 10:29:45.365 printer.go:39: [info] Welcome to TiDB. 2018/12/28 10:29:45.365 printer.go:40: [info] Release Version: v2.1.2-1-g8ba8096 2018/12/28 10:29:45.365 printer.go:41: [info] Git Commit Hash: 8ba809622668a1287e9f7152bfd34ad6be220304 2018/12/28 10:29:45.365 printer.go:42: [info] Git Branch: release-2.1 2018/12/28 10:29:45.365 printer.go:43: [info] UTC Build Time: 2018-12-21 03:45:55 2018/12/28 10:29:45.365 printer.go:44: [info] GoVersion: go version go1.11.2 linux/amd64 2018/12/28 10:29:45.365 printer.go:45: [info] Race Enabled: false 2018/12/28 10:29:45.365 printer.go:46: [info] Check Table Before Drop: false 2018/12/28 10:29:45.365 printer.go:47: [info] TiKV Min Version: 2.1.0-alpha.1-ff3dd160846b7d1aed9079c389fc188f7f5ea13e 2018/12/28 10:29:45.365 printer.go:52: [info] Config: {"host":"0.0.0.0","advertise-address":"192.168.6.101","port":4000,"store":"tikv","path":"192.168.6.101:2379,192.168.6.102:2379,192.168.6.103:2379,192.168.6.104:2379,192.168.6.105:2379","socket":"","lease":"45s","run-ddl":true,"split-table":true,"token-limit":1000,"oom-action":"log","mem-quota-query":34359738368,"enable-streaming":false,"txn-local-latches":{"enabled":false,"capacity":10240000},"lower-case-table-names":2,"log":{"level":"info","format":"text","disable-timestamp":false,"file":{"filename":"/data/deploy/log/tidb.log","log-rotate":true,"max-size":300,"max-days":0,"max-backups":0},"slow-query-file":"","slow-threshold":300,"expensive-threshold":10000,"query-log-max-len":2048},"security":{"skip-grant-table":false,"ssl-ca":"","ssl-cert":"","ssl-key":"","cluster-ssl-ca":"","cluster-ssl-cert":"","cluster-ssl-key":""},"status":{"report-status":true,"status-port":10080,"metrics-addr":"","metrics-interval":15},"performance":{"max-procs":0,"tcp-keep-alive":true,"cross-join":true,"stats-lease":"3s","run-auto-analyze":true,"stmt-count-limit":5000,"feedback-probability":0.05,"query-feedback-limit":1024,"pseudo-estimate-ratio":0.8,"force-priority":"NO_PRIORITY"},"xprotocol":{"xserver":false,"xhost":"","xport":0,"xsocket":""},"prepared-plan-cache":{"enabled":false,"capacity":100},"opentracing":{"enable":false,"sampler":{"type":"const","param":1,"sampling-server-url":"","max-operations":0,"sampling-refresh-interval":0},"reporter":{"queue-size":0,"buffer-flush-interval":0,"log-spans":false,"local-agent-host-port":""},"rpc-metrics":false},"proxy-protocol":{"networks":"","header-timeout":5},"tikv-client":{"grpc-connection-count":16,"grpc-keepalive-time":10,"grpc-keepalive-timeout":3,"commit-timeout":"41s"},"binlog":{"enable":false,"write-timeout":"15s","ignore-error":false,"binlog-socket":""},"compatible-kill-query":false} 2018/12/28 10:29:45.366 main.go:208: [info] disable Prometheus push client 2018/12/28 10:29:45.366 tidb.go:291: [info] new store 2018/12/28 10:29:45.366 client.go:124: [info] [pd] create pd client with endpoints [192.168.6.101:2379 192.168.6.102:2379 192.168.6.103:2379 192.168.6.104:2379 192.168.6.105:2379] 2018/12/28 10:29:45.366 systime_mon.go:24: [info] start system time monitor 2018/12/28 10:29:45.370 client.go:226: [info] [pd] leader switches to: http://192.168.6.105:2379, previous: 2018/12/28 10:29:45.370 client.go:143: [info] [pd] init cluster id 6579458174277658331 2018/12/28 10:29:45.375 region_cache.go:482: [info] drop regions that on the store 1(192.168.6.101:20160) due to send request fail, err: rpc error: code = Unavailable desc = all SubConns are in TransientFailure, latest connection error: connection error: desc = "transport: Error while dialing dial tcp 192.168.6.101:20160: connect: connection refused" 2018/12/28 10:29:45.540 region_cache.go:482: [info] drop regions that on the store 1(192.168.6.101:20160) due to send request fail, err: rpc error: code = Unavailable desc = all SubConns are in TransientFailure, latest connection error: connection error: desc = "transport: Error while dialing dial tcp 192.168.6.101:20160: connect: connection refused" 2018/12/28 10:29:45.905 region_cache.go:482: [info] drop regions that on the store 1(192.168.6.101:20160) due to send request fail, err: rpc error: code = Unavailable desc = all SubConns are in TransientFailure, latest connection error: connection error: desc = "transport: Error while dialing dial tcp 192.168.6.101:20160: connect: connection refused" 2018/12/28 10:29:46.637 region_cache.go:482: [info] drop regions that on the store 1(192.168.6.101:20160) due to send request fail, err: rpc error: code = Unavailable desc = all SubConns are in TransientFailure, latest connection error: connection error: desc = "transport: Error while dialing dial tcp 192.168.6.101:20160: connect: connection refused" 2018/12/28 10:29:47.689 region_cache.go:482: [info] drop regions that on the store 1(192.168.6.101:20160) due to send request fail, err: rpc error: code = Unavailable desc = all SubConns are in TransientFailure, latest connection error: connection error: desc = "transport: Error while dialing dial tcp 192.168.6.101:20160: connect: connection refused"

474420502 commented 5 years ago

上面是tidb.log日志. 这次tikv可以启动了, 轮到tidb启动不了.
滚动升级和非滚动升级我都试了. 都是卡着TIDB这个节点启动不了.

shenli commented 5 years ago

@474420502 From the following log, I can tell TiKV is not running:

2018/12/28 10:29:45.905 region_cache.go:482: [info] drop regions that on the store 1(192.168.6.101:20160) due to send request fail, err: rpc error: code = Unavailable desc = all SubConns are in TransientFailure, latest connection error: connection error: desc = "transport: Error while dialing dial tcp 192.168.6.101:20160: connect: connection refused"

shenli commented 5 years ago

Is there any error log in TiKV?

474420502 commented 5 years ago

2018/12/28 09:54:31.047 INFO util.rs:406: connecting to PD endpoint: "http://192.168.6.104:2379" 2018/12/28 09:54:31.051 INFO util.rs:465: connected to PD leader "http://192.168.6.104:2379" 2018/12/28 09:54:31.051 WARN util.rs:186: heartbeat sender and receiver are stale, refreshing.. 2018/12/28 09:54:31.051 INFO util.rs:190: cancel region heartbeat sender 2018/12/28 09:54:31.051 WARN util.rs:205: updating PD client done, spent 12.273793ms 2018/12/28 09:54:31.059 ERRO util.rs:336: fail to request: Grpc(RpcFailure(RpcStatus { status: Unknown, details: Some("version should compatible with version 3.0.0-alpha, got 2.1.2") })) 2018/12/28 09:54:31.062 ERRO util.rs:336: fail to request: Grpc(RpcFailure(RpcStatus { status: Unknown, details: Some("version should compatible with version 3.0.0-alpha, got 2.1.2") })) 2018/12/28 09:54:31.067 ERRO util.rs:336: fail to request: Grpc(RpcFailure(RpcStatus { status: Unknown, details: Some("version should compatible with version 3.0.0-alpha, got 2.1.2") })) 2018/12/28 09:54:31.070 ERRO util.rs:336: fail to request: Grpc(RpcFailure(RpcStatus { status: Unknown, details: Some("version should compatible with version 3.0.0-alpha, got 2.1.2") })) 2018/12/28 09:54:31.079 ERRO util.rs:336: fail to request: Grpc(RpcFailure(RpcStatus { status: Unknown, details: Some("version should compatible with version 3.0.0-alpha, got 2.1.2") })) 2018/12/28 09:54:31.083 ERRO util.rs:336: fail to request: Grpc(RpcFailure(RpcStatus { status: Unknown, details: Some("version should compatible with version 3.0.0-alpha, got 2.1.2") })) 2018/12/28 09:54:31.091 ERRO util.rs:336: fail to request: Grpc(RpcFailure(RpcStatus { status: Unknown, details: Some("version should compatible with version 3.0.0-alpha, got 2.1.2") })) 2018/12/28 09:54:31.093 ERRO util.rs:336: fail to request: Grpc(RpcFailure(RpcStatus { status: Unknown, details: Some("version should compatible with version 3.0.0-alpha, got 2.1.2") })) 2018/12/28 09:54:31.097 ERRO util.rs:336: fail to request: Grpc(RpcFailure(RpcStatus { status: Unknown, details: Some("version should compatible with version 3.0.0-alpha, got 2.1.2") })) 2018/12/28 09:54:31.097 ERRO tikv-server.rs:235: failed to start node: Pd(Other(StringError("[src/pd/util.rs:344]: fail to request"))) 2018/12/28 09:58:16.315 ERRO store.rs:1164: Send compaction finished event to raftstore failed: Closed 2018/12/28 09:58:16.390 ERRO store.rs:1164: Send compaction finished event to raftstore failed: Closed 2018/12/28 09:58:17.593 ERRO store.rs:1164: Send compaction finished event to raftstore failed: Closed @shenli 我觉得最有价值的就是这个了.

jackysp commented 5 years ago

"version should compatible with version 3.0.0-alpha, got 2.1.2" is the root cause. It means the cluster is downgraded not upgraded. There is a simple way to fix it.

474420502 commented 5 years ago

Release Version: v2.1.0-rc.3-381-g7108881 Git Commit Hash: 71088815e7c5121e069d7ec10595d176a44b9bea Git Branch: master UTC Build Time: 2018-12-27 09:17:13 GoVersion: go version go1.11.2 linux/amd64 Race Enabled: false TiKV Min Version: 2.1.0-alpha.1-ff3dd160846b7d1aed9079c389fc188f7f5ea13e Check Table Before Drop: false

Release Version: 3.0.0-alpha Git Commit Hash: 72bf02f8a71fef7d66c69deb1f9e0b25ba49a9ee Git Commit Branch: master UTC Build Time: 2018-12-27 03:07:19 Rust Version: rustc 1.29.0-nightly (4f3c7a472 2018-07-17)

Release Version: v3.0.0-alpha-74-ga573364 Git Commit Hash: a5733642e029d09a50651f2881c35737dcfc7512 Git Branch: master UTC Build Time: 2018-12-27 02:22:05 我是更新了一次2.1.0 rc版本后一直追着更新, 然后看到有2.1release版本后想更新到这个v2.1.2版本

rleungx commented 5 years ago

@474420502 You can use pd-ctl to downgrade the cluster version:

>> config set cluster-version <your_version>

BTW, the usage of pd-ctl can be found in this documentation.

474420502 commented 5 years ago

那我能不升级吗, 就留着这个? 等以后有一个release版本比这个新了, 然后我更新到稳定版本. @rleungx