pingcap / tidb

TiDB is an open-source, cloud-native, distributed, MySQL-Compatible database for elastic scale and real-time analytics. Try AI-powered Chat2Query free at : https://www.pingcap.com/tidb-serverless/
https://pingcap.com
Apache License 2.0
37.13k stars 5.83k forks source link

tidb-lightning问题 #9269

Closed longhuguoji closed 4 months ago

longhuguoji commented 5 years ago

我使用的tidb2.0,tidb-lightning2.0,tikv-importer启动成功,tikv-lightning启动报错,日志如下:

2019/02/11 15:31:26.318 version.go:41: [info] Welcome to lightning
2019/02/11 15:31:26.318 version.go:42: [info] Release Version:
2019/02/11 15:31:26.318 version.go:43: [info] Git Commit Hash: 77470cc570fbda32ebbd976145aadc2057938d47
2019/02/11 15:31:26.318 version.go:44: [info] Git Branch: release-2.0
2019/02/11 15:31:26.318 version.go:45: [info] UTC Build Time: 2018-11-22 02:43:00
2019/02/11 15:31:26.318 version.go:46: [info] Go Version: go version go1.11 linux/amd64
2019/02/11 15:31:26.319 lightning.go:59: [info] cfg {"lightning":{"level":"info","file":"tidb-lightning.log","max-size":128,"max-days":28,"max-backups":14,"table-concurrency":8,"region-concurrency":75,"pprof-port":10089,"check-requirements":true},"tidb":{"host":"20.4.18.43","port":4000,"user":"root","status-port":10080,"pd-addr":"20.4.18.43:2379","sql-mode":"STRICT_TRANS_TABLES,NO_ENGINE_SUBSTITUTION","log-level":"error","distsql-scan-concurrency":16},"checkpoint":{"enable":true,"schema":"tidb_lightning_checkpoint","keep-after-success":false},"mydumper":{"read-block-size":4096,"region-min-size":268435456,"data-source-dir":"/data/my_database","no-schema":false},"tikv-importer":{"addr":"20.4.18.49:20170","batch-size":214748364800},"post-restore":{"compact":true,"checksum":true,"analyze":true},"config-file":"../conf/tidb-lightning.toml"}
2019/02/11 15:31:27.227 restore.go:182: [error] run cause error : rpc error: code = Unknown desc = Grpc(RpcFailure(RpcStatus { status: Unavailable, details: Some("Connect Failed") }))
/home/jenkins/workspace/build_tidb_lightning/go/src/github.com/pingcap/tidb-lightning/lightning/kv/importer.go:95:
/home/jenkins/workspace/build_tidb_lightning/go/src/github.com/pingcap/tidb-lightning/lightning/restore/restore.go:555:
/home/jenkins/workspace/build_tidb_lightning/go/src/github.com/pingcap/tidb-lightning/lightning/restore/restore.go:547:
2019/02/11 15:31:27.227 restore.go:189: [info] Timing statistic :

2019/02/11 15:31:27.227 restore.go:190: [info] the whole procedure takes 146.796231ms
2019/02/11 15:31:27.227 main.go:53: [error] tidb lightning encountered error:rpc error: code = Unknown desc = Grpc(RpcFailure(RpcStatus { status: Unavailable, details: Some("Connect Failed") }))
/home/jenkins/workspace/build_tidb_lightning/go/src/github.com/pingcap/tidb-lightning/lightning/kv/importer.go:95:
/home/jenkins/workspace/build_tidb_lightning/go/src/github.com/pingcap/tidb-lightning/lightning/restore/restore.go:555:
/home/jenkins/workspace/build_tidb_lightning/go/src/github.com/pingcap/tidb-lightning/lightning/restore/restore.go:547:
/home/jenkins/workspace/build_tidb_lightning/go/src/github.com/pingcap/tidb-lightning/lightning/restore/restore.go:194:
/home/jenkins/workspace/build_tidb_lightning/go/src/github.com/pingcap/tidb-lightning/lightning/lightning.go:121:
/home/jenkins/workspace/build_tidb_lightning/go/src/github.com/pingcap/tidb-lightning/lightning/lightning.go:73:

tidb-lightning.toml配置如下:

# TiDB-Lightning 配置文件模版

[lightning]
# 用于调试和 Prometheus 监控的 HTTP 端口。输入 0 关闭。
pprof-port = 10089

# 开始导入前先检查集群版本是否支持。
#check-requirements = true

# 控制同时处理的表的数量。这个值会影响 tikv-importer 的内存使用量。
# 不能超过 tikv-importer 中 max-open-engines 的值。
table-concurrency = 8
# 转换数据的并发数,默认为逻辑 CPU 数量,不需要配置。
# 混合部署的情况下可以配置为逻辑 CPU 的 75% 大小。
region-concurrency = 75

# 日志
level = "info"
file = "tidb-lightning.log"
max-size = 128 # MB
max-days = 28
max-backups = 14

[checkpoint]
# 启用断点续传。
# 导入时,Lightning 会记录当前进度。
# 若 Lightning 或其他组件异常退出,在重启时可以避免重复再导入已完成的数据。
enable = true
# 存储断点的数据库名称。
schema = "tidb_lightning_checkpoint"
# 存储断点的数据库连接参数 (DSN),格式为“用户:密码@tcp(地址:端口)”。
# 默认会重用 [tidb] 设置目标数据库来存储断点。
# 为避免加重目标集群的压力,建议另外使用一个兼容 MySQL 协议的数据库服务器。
# dsn = "root@tcp(127.0.0.1:4000)/"
# 导入成功后是否保留断点。默认为删除。
# 保留断点可用于调试,但有可能泄漏数据源的元数据。
# keep-after-success = false

[tikv-importer]
# tikv-importer 的监听地址,需改成 tikv-importer 服务器的实际地址。
addr = "20.4.18.49:20170"

[mydumper]
# 文件读取区块大小。
read-block-size = 4096 # 字节 (默认 = 4 KB)
# 每个文档在转换时会切分为多个 Chunk 并发处理,此为每个 Chunk 的大小。
region-min-size = 268435456 # 字节 (默认 = 256 MB)
# mydumper 源数据目录。
data-source-dir = "/data/my_database"
# 如果 no-schema 设置为 true,tidb-lightning 将直接去 tidb-server 获取表结构信息,
# 而不是根据 data-source-dir 的 schema 文件来创建库/表,
# 适用于手动创建表或者 TiDB 本来就有表结构的情况。
no-schema = false

[tidb]
# 目标集群的信息。tidb-server 的监听地址,填一个即可。
host = "20.4.18.43"
port = 4000
user = "root"
password = "Root@1233"
# 表架构信息在从 TiDB 的“状态端口”获取。
status-port = 10080
# pd-server 的监听地址,填一个即可。
pd-addr = "20.4.18.43:2379"
# tidb-lightning 引用了 TiDB 库,而它自己会产生一些日志。此设置控制 TiDB 库的日志等级。
log-level = "error"
# 设置 TiDB 会话变量,提升 CHECKSUM 和 ANALYZE 的速度。
distsql-scan-concurrency = 16

# 导完数据以后可以自动进行校验和 (CHECKSUM)、压缩 (Compact) 和分析 (ANALYZE) 的操作。
# 生产环境建议都设为 true
# 执行顺序是: CHECKSUM -> Compact -> ANALYZE。
[post-restore]
# 如果设置为 true,会对每个表逐个做 `ADMIN CHECKSUM TABLE <table>` 操作。
checksum = true
# 如果设置为 true,会对所有数据做一次全量 Compact。
compact = true
# 如果设置为 true,会对每个表逐个做 `ANALYZE TABLE <table>` 操作。
analyze = true
longhuguoji commented 5 years ago

ligntning启动后,tikv-importer报错如下。

2019/02/11 15:29:18.211 INFO mod.rs:26: Welcome to TiKV.
Release Version:   2.1.0-beta
Git Commit Hash:   0b4d98b7fdb30d2d783709a884afb3cf777ab58c
Git Commit Branch: importer-2.0.x
UTC Build Time:    2018-11-22 09:38:33
Rust Version:      rustc 1.29.0-nightly (4f3c7a472 2018-07-17)
2019/02/11 15:29:18.212 WARN setup.rs:166: environment variable `TZ` is missing, using `/etc/localtime`
2019/02/11 15:29:18.262 INFO tikv-importer.rs:132: import server started
2019/02/11 15:31:27.218 INFO util.rs:392: connect to PD endpoint: "20.4.18.43:2379"
2019/02/11 15:31:27.220 INFO util.rs:392: connect to PD endpoint: "http://20.4.18.43:2379"
2019/02/11 15:31:27.222 INFO util.rs:392: connect to PD endpoint: "http://20.4.18.43:2379"
2019/02/11 15:31:27.223 INFO util.rs:451: connect to PD leader "http://20.4.18.43:2379"
2019/02/11 15:31:27.223 INFO util.rs:380: All PD endpoints are consistent: ["20.4.18.43:2379"]
2019/02/11 15:31:27.227 ERRO kv_service.rs:74: switch cluster mode: Import: Grpc(RpcFailure(RpcStatus { status: Unavailable, details: Some("Connect Failed") }))
eurekaka commented 5 years ago

@longhuguoji thanks for the feedback, @kennytm PTAL

lonng commented 5 years ago

@longhuguoji Check your tikv-importer listen address, and make tidb-lightning can connect to it.

longhuguoji commented 5 years ago

@eurekaka 我的tikv-importer和tidb-lightning在同一台机子上,两个服务通讯正常。

lonng commented 5 years ago

@longhuguoji Can tikv-importer (20.4 .18.49) connect to PD server (20.4.18.43)

longhuguoji commented 5 years ago

@lonng 如何测试?tikv-importer (20.4.18.49)可以ping通PD server (20.4.18.43)。

lonng commented 5 years ago

@longhuguoji

At 20.4.18.49 telnet 20.4.18.43 2379

longhuguoji commented 5 years ago

@lonng telnet通的了。

lonng commented 5 years ago

PTAL @kennytm

lonng commented 5 years ago

@longhuguoji Could you get stores info by ./pd-ctl -d store -u http://20.4.18.43 2379?

kennytm commented 5 years ago

@longhuguoji Hi,

Could you verify if the node hosting Importer can contact every TiKV instances obtained from PD? You could get the addresses of all TiKV stores using pd-ctl.

$ bin/pd-ctl -u http://20.4.18.43:2379
» store
{
  "count": 1,
  "stores": [
    {
      "store": {
        "id": 1,
        "address": "123.45.67.89:20172",   // <--- telnet these!
...
longhuguoji commented 5 years ago

@kennytm 我执行了上述命令,我有三个TIKV节点,port都是20160,而且都可以telnet通。

longhuguoji commented 5 years ago

@kennytm @lonng 我怀疑是TIDB集群版本过低的原因,我的是2.0.8,而使用tidb-lightning要求最低的是2.0.9。我正在尝试升级版本,看是否能解决问题。遇到一个问题,swap is on, for best performance, turn swap off。请问这个问题怎么解决?是因为我使用虚拟机的原因吧。

lonng commented 5 years ago

Try to upgrade cluster to 2.1 if possible. :)

longhuguoji commented 5 years ago

@lonng 升级过程中,有一步骤报错。 TASK [wait until the TiKV port is down] ** fatal: [20.4.18.41]: FAILED! => {"changed": false, "elapsed": 300, "msg": "the TiKV port 20160 is not down"} 请问,是否需要手动停止所有tikv节点。

longhuguoji commented 5 years ago

@lonng TIDB版本升级过程中报错最终定位如下,和tidb-lightning使用报错一致: failed to connect to http://20.4.18.43:2379,Grpc(RpcFailure(RpcStatus { status: Unavailable, details: Some("Connect Failed") })) 请问这个问题如何解决?

longhuguoji commented 5 years ago

@kennytm 我按照你的方法执行了,telnet都通的了,还是报错Grpc(RpcFailure(RpcStatus { status: Unavailable, details: Some("Connect Failed") }))

kennytm commented 5 years ago

@longhuguoji

  1. Are all TiKV nodes reachable from the machine running Ansible?
  2. Could you use tikv-ctl to verify if the nodes are really behaving correctly? A firewall/router/reverse proxy leave the port open but not connected to the actual server, causing telnet to be able to connect but not able to communicate.
tikv-ctl --host 20.4.18.41:20160 metrics | tail

Thanks!

longhuguoji commented 5 years ago

@kennytm 你好,我执行tikv-ctl --host 20.4.18.41:20160 metrics | tail后,有下面输出: [root@tidb4 bin]# ./tikv-ctl --host 20.4.18.41:20160 metrics | tail tikv_worker_pending_task_total{name="consistency check worker"} 0 tikv_worker_pending_task_total{name="end-point-worker"} 0 tikv_worker_pending_task_total{name="pd worker"} 0 tikv_worker_pending_task_total{name="raft gc worker"} 0 tikv_worker_pending_task_total{name="snap-handler"} 0 tikv_worker_pending_task_total{name="snapshot worker"} 0 tikv_worker_pending_task_total{name="split check worker"} 0 tikv_worker_pending_task_total{name="storage-scheduler"} 0 tikv_worker_pending_task_total{name="store address resolve worker"} 0

longhuguoji commented 5 years ago

@kennytm 你好,我执行tikv-ctl --host 20.4.18.4*:20160 metrics | tail后,一共3个TIKV节点,分别有有下面输出:

[root@tidb4 bin]# ./tikv-ctl --host 20.4.18.41:20160 metrics | tail
tikv_worker_pending_task_total{name="consistency check worker"} 0
tikv_worker_pending_task_total{name="end-point-worker"} 0
tikv_worker_pending_task_total{name="pd worker"} 0
tikv_worker_pending_task_total{name="raft gc worker"} 0
tikv_worker_pending_task_total{name="snap-handler"} 0
tikv_worker_pending_task_total{name="snapshot worker"} 0
tikv_worker_pending_task_total{name="split check worker"} 0
tikv_worker_pending_task_total{name="storage-scheduler"} 0
tikv_worker_pending_task_total{name="store address resolve worker"} 0

[root@tidb4 bin]# ./tikv-ctl --host 20.4.18.40:20160 metrics | tail
tikv_worker_pending_task_total{name="consistency-check"} 0
tikv_worker_pending_task_total{name="gc-worker"} 0
tikv_worker_pending_task_total{name="local-reader"} 0
tikv_worker_pending_task_total{name="pd-worker"} 0
tikv_worker_pending_task_total{name="raft-gc-worker"} 0
tikv_worker_pending_task_total{name="snap-handler"} 0
tikv_worker_pending_task_total{name="snapshot-worker"} 0
tikv_worker_pending_task_total{name="split-check"} 0
tikv_worker_pending_task_total{name="storage-scheduler"} 0

[root@tidb4 bin]# ./tikv-ctl --host 20.4.18.42:20160 metrics | tail
tikv_worker_pending_task_total{name="raft gc worker"} 0
tikv_worker_pending_task_total{name="snap-handler"} 0
tikv_worker_pending_task_total{name="snapshot worker"} 0
tikv_worker_pending_task_total{name="split check worker"} 0
tikv_worker_pending_task_total{name="storage-scheduler"} 0
tikv_worker_pending_task_total{name="store address resolve worker"} 0
# HELP update_region_size_count_by_compaction Total number of update region size caused by compaction.
# TYPE update_region_size_count_by_compaction counter
update_region_size_count_by_compaction 2324
kennytm commented 5 years ago

@longhuguoji OK. Is the rolling upgrade successful (i.e. https://github.com/pingcap/tidb/issues/9269#issuecomment-462585841 no longer happens)? And the error in https://github.com/pingcap/tidb/issues/9269#issuecomment-462600917 still appears? Could you tell us how this error is reproduced?

Thanks.

longhuguoji commented 5 years ago

@kennytm 现在版本滚动升级还没有成功。滚动升级TIDB集群(ansible-playbook rolling_update.yml)时报错如下: TASK [wait until the TiKV port is down] ** fatal: [20.4.18.41]: FAILED! => {"changed": false, "elapsed": 300, "msg": "the TiKV port 20160 is not down"}。 然后我查看tikv节点日志,报错如下: 2019/02/12 10:58:56.580 util.rs:431: [ERROR] failed to connect to http://20.4.18.43:2379, Grpc(RpcFailure(RpcStatus { status: Unavailable, details: Some("Connect Failed") })) 2019/02/12 10:58:57.549 util.rs:431: [ERROR] failed to connect to http://20.4.18.43:2379, Grpc(RpcFailure(RpcStatus { status: Unavailable, details: Some("Connect Failed") })) 2019/02/12 10:58:58.538 util.rs:431: [ERROR] failed to connect to http://20.4.18.43:2379, Grpc(RpcFailure(RpcStatus { status: Unavailable, details: Some("Connect Failed") })) 2019/02/12 10:58:59.539 util.rs:431: [ERROR] failed to connect to http://20.4.18.43:2379, Grpc(RpcFailure(RpcStatus { status: Unavailable, details: Some("Connect Failed") })) 2019/02/12 10:59:00.700 client.rs:337: [ERROR] failed to send heartbeat: Grpc(RpcFinished(Some(RpcStatus { status: Unavailable, details: Some("Endpoint read failed") }))) 2019/02/12 10:59:00.700 util.rs:264: [ERROR] request failed: Grpc(RpcFinished(Some(RpcStatus { status: Unavailable, details: Some("Endpoint read failed") }))) 2019/02/12 10:59:10.066 raft_client.rs:176: [ERROR] server: drop conn with tikv endpoint 20.4.18.40:20160 flush conn error: SendError("...") 2019/02/12 10:59:26.423 scheduler.rs:1306: [ERROR] get snapshot failed for cids=[100709], error Request(message: "peer is not leader" not_leader {region_id: 490 leader {id: 492 store_id: 5}}) 2019/02/12 10:59:29.184 scheduler.rs:1306: [ERROR] get snapshot failed for cids=[100749], error Request(message: "peer is not leader" not_leader {region_id: 1742 leader {id: 1744 store_id: 5}}) 现在我不知道问题出在哪儿?

kennytm commented 5 years ago

@longhuguoji It seems the TiKV node cannot reach PD? Could telnet or curl work with http://20.4.18.43:2379 from 20.4.18.41?

longhuguoji commented 5 years ago

@kennytm curl http://20.4.18.43:2379后,所有TIKV都返回404 page not found。

kennytm commented 5 years ago

@longhuguoji That is expected when PD is reachable. 🤔. Does manually stopping TiKV work?

# `ssh tidb@20.4.18.41` and run
cd /data1/deploy
scripts/stop_tikv.sh

# makes sure 20.4.18.41 has disappeared
tikv-ctl --host 20.4.18.41:20160 metrics | tail 
# should return `DebugClient::metrics: RpcFailure(RpcStatus { status: Unavailable, details: Some("Connect Failed") })`

Try to continue rolling update after that.

longhuguoji commented 5 years ago

@kennytm 执行stop_tikv.sh不能停止服务,我查了一下tikv-server的进程,然后kill调,执行升级,就通过了,后续tikv的进程会自动提起来。现已经升级成功,非常感谢!!!

但是现在Tidb-lightning仍然报错,排除了TIDB的版本问题,如下:

2019/02/12 16:50:43.531 version.go:41: [info] Welcome to lightning
2019/02/12 16:50:43.531 version.go:42: [info] Release Version: 
2019/02/12 16:50:43.531 version.go:43: [info] Git Commit Hash: 77470cc570fbda32ebbd976145aadc2057938d47
2019/02/12 16:50:43.531 version.go:44: [info] Git Branch: release-2.0
2019/02/12 16:50:43.531 version.go:45: [info] UTC Build Time: 2018-11-22 02:43:00
2019/02/12 16:50:43.531 version.go:46: [info] Go Version: go version go1.11 linux/amd64
2019/02/12 16:50:43.532 lightning.go:59: [info] cfg {"lightning":{"level":"info","file":"tidb-lightning.log","max-size":128,"max-days":28,"max-backups":14,"table-concurrency":8,"region-concurrency":75,"pprof-port":10089,"check-requirements":true},"tidb":{"host":"20.4.18.43","port":4000,"user":"root","status-port":10080,"pd-addr":"20.4.18.43:2379","sql-mode":"STRICT_TRANS_TABLES,NO_ENGINE_SUBSTITUTION","log-level":"error","distsql-scan-concurrency":16},"checkpoint":{"enable":true,"schema":"tidb_lightning_checkpoint","keep-after-success":false},"mydumper":{"read-block-size":4096,"region-min-size":268435456,"data-source-dir":"/data/my_database","no-schema":false},"tikv-importer":{"addr":"20.4.18.49:20170","batch-size":214748364800},"post-restore":{"compact":true,"checksum":true,"analyze":true},"config-file":"../conf/tidb-lightning.toml"}
2019/02/12 16:50:43.555 restore.go:182: [error] run cause error : rpc error: code = Unknown desc = Grpc(RpcFailure(RpcStatus { status: Unavailable, details: Some("Connect Failed") }))
/home/jenkins/workspace/build_tidb_lightning/go/src/github.com/pingcap/tidb-lightning/lightning/kv/importer.go:95: 
/home/jenkins/workspace/build_tidb_lightning/go/src/github.com/pingcap/tidb-lightning/lightning/restore/restore.go:555: 
/home/jenkins/workspace/build_tidb_lightning/go/src/github.com/pingcap/tidb-lightning/lightning/restore/restore.go:547: 
2019/02/12 16:50:43.556 restore.go:189: [info] Timing statistic :

2019/02/12 16:50:43.556 restore.go:190: [info] the whole procedure takes 13.510307ms
2019/02/12 16:50:43.556 main.go:53: [error] tidb lightning encountered error:rpc error: code = Unknown desc = Grpc(RpcFailure(RpcStatus { status: Unavailable, details: Some("Connect Failed") }))
/home/jenkins/workspace/build_tidb_lightning/go/src/github.com/pingcap/tidb-lightning/lightning/kv/importer.go:95: 
/home/jenkins/workspace/build_tidb_lightning/go/src/github.com/pingcap/tidb-lightning/lightning/restore/restore.go:555: 
/home/jenkins/workspace/build_tidb_lightning/go/src/github.com/pingcap/tidb-lightning/lightning/restore/restore.go:547: 
/home/jenkins/workspace/build_tidb_lightning/go/src/github.com/pingcap/tidb-lightning/lightning/restore/restore.go:194: 
/home/jenkins/workspace/build_tidb_lightning/go/src/github.com/pingcap/tidb-lightning/lightning/lightning.go:121: 
/home/jenkins/workspace/build_tidb_lightning/go/src/github.com/pingcap/tidb-lightning/lightning/lightning.go:73:

请问如何解决?

kennytm commented 5 years ago

@longhuguoji To confirm, is the PD node http://20.4.18.43:2379 still reachable from the importer machine 20.4.18.49 (i.e. curl returns 404), and are all TiKV nodes (20.4.18.{40,41,42}:20160) reachable from 20.4.18.49 with tikv-ctl?

longhuguoji commented 5 years ago

@kennytm 1、从20.4.18.49访问pd节点http://20.4.18.43:2379,通讯正常(返回404)。 2、20.4.18.49没有tikv-ctl服务,我是手动部署的tidb-lightning,在20.4.18.49进行curl http://20.4.18.{40,41,42}:20160 都有返回值,通讯正常。所有机器之间的通讯应该没问题。 3、我在20.4.18.49执行./tidb-lightning时,查看tikv各节点(20.4.18.{40,41,42})日志。有以下输出: 2019/02/12 18:03:31.080 INFO sst_service.rs:90: switch mode Import 2019/02/12 18:03:31.080 WARN sst_service.rs:97: send rpc response: RemoteStopped

请问是否可以定位到问题?

longhuguoji commented 5 years ago

@kennytm 我在20.4.18.49执行./tidb-lightning-ctl -switch-mode=normal -config=../conf/tidb-lightning.toml后,控制台输入以下日志: [root@app bin]# ./tidb-lightning-ctl -switch-mode=normal -config ../conf/tidb-lightning.toml rpc error: code = Unknown desc = Grpc(RpcFailure(RpcStatus { status: Unavailable, details: Some("Connect Failed") })) github.com/pingcap/tidb-lightning/vendor/github.com/pkg/errors.AddStack /home/jenkins/workspace/build_tidb_lightning/go/src/github.com/pingcap/tidb-lightning/vendor/github.com/pkg/errors/errors.go:174 github.com/pingcap/tidb-lightning/vendor/github.com/pkg/errors.Trace /home/jenkins/workspace/build_tidb_lightning/go/src/github.com/pingcap/tidb-lightning/vendor/github.com/pkg/errors/juju_adaptor.go:11 github.com/pingcap/tidb-lightning/lightning/kv.(*Importer).SwitchMode /home/jenkins/workspace/build_tidb_lightning/go/src/github.com/pingcap/tidb-lightning/lightning/kv/importer.go:95 main.switchMode /home/jenkins/workspace/build_tidb_lightning/go/src/github.com/pingcap/tidb-lightning/cmd/tidb-lightning-ctl/main.go:104 main.run /home/jenkins/workspace/build_tidb_lightning/go/src/github.com/pingcap/tidb-lightning/cmd/tidb-lightning-ctl/main.go:54 main.main /home/jenkins/workspace/build_tidb_lightning/go/src/github.com/pingcap/tidb-lightning/cmd/tidb-lightning-ctl/main.go:19 runtime.main /usr/local/go/src/runtime/proc.go:201 runtime.goexit /usr/local/go/src/runtime/asm_amd64.s:1333

tikv节点输出如下日志: [tidb@tidb2 log]$ tail -n 2 tikv.log 2019/02/13 09:52:19.499 INFO sst_service.rs:90: switch mode Normal 2019/02/13 09:52:19.500 WARN sst_service.rs:97: send rpc response: RemoteStopped

可以看出,tikv节点收到了tidb-lightning的命令,并执行了。然后send rpc response: RemoteStopped,这个问题不知道怎么引起的?

kennytm commented 5 years ago

@longhuguoji

  1. Is tikv-importer also running on 20.4.18.49? The commands are forwarded like this:

    Lightning ←→ Importer ←→ TiKV

    We need to check if any of the communication link is blocked. Please also monitor the log from Importer?

  2. As there is action from tikv but the reply is not sent, perhaps try if 20.4.18.49:20170 is reachable from 20.4.18.{41,42,43}?

BTW we no longer recommend deploying Importer and Lightning on the same machine, since this would put too much pressure on it and crash the process.

longhuguoji commented 5 years ago

1、我现在是测试环境,大概测试一下tidb-lightning的性能,生产会分开部署。importer日志如下: 2019/02/13 10:18:48.431 INFO mod.rs:26: Welcome to TiKV. Release Version: 2.1.0-rc.5 Git Commit Hash: ec7c77dcbec9a0b2efbd3dfc642d56d065954dfd Git Commit Branch: release-2.1 UTC Build Time: 2018-11-22 10:02:35 Rust Version: rustc 1.29.0-nightly (4f3c7a472 2018-07-17) 2019/02/13 10:18:48.435 WARN setup.rs:172: environment variable TZ is missing, using /etc/localtime 2019/02/13 10:18:48.472 INFO tikv-importer.rs:142: import server started 2019/02/13 10:19:56.753 INFO util.rs:406: connecting to PD endpoint: "20.4.18.43:2379" 2019/02/13 10:19:56.755 INFO util.rs:406: connecting to PD endpoint: "http://20.4.18.43:2379" 2019/02/13 10:19:56.757 INFO util.rs:406: connecting to PD endpoint: "http://20.4.18.43:2379" 2019/02/13 10:19:56.758 INFO util.rs:465: connected to PD leader "http://20.4.18.43:2379" 2019/02/13 10:19:56.758 INFO util.rs:394: All PD endpoints are consistent: ["20.4.18.43:2379"] 2019/02/13 10:19:56.762 ERRO kv_service.rs:74: switch cluster mode: Import: Grpc(RpcFailure(RpcStatus { status: Unavailable, details: Some("Connect Failed") })) 2、tikv-importer提起来之后,端口20170打开。从TIKV的三个节点20.4.18.{41,42,43}都可以访问20.4.18.49:20170。 请问可以帮助我找一下GrpcFailure的原因吗?

kennytm commented 5 years ago

@longhuguoji that's really strange. To confirm, (1) is TiKV upgraded to 2.1? The versions of all components are 2.1? (2) is it possible there's firewall between the machines which blocked grpc traffic?

longhuguoji commented 5 years ago

@kennytm 我在tikv节点的/data/deploy/bin执行tikv-server -V,返回版本都是v2.1.3(tidb-ansible-release-2.1.zip),Tidb-lightning使用的是tidb-lightning-release-2.1-linux-amd64.tar.gz。两者版本一致。 现在我搞不懂为啥会报错GrpcFailure,所有节点之间全部通讯正常。

ice1000 commented 5 years ago

2019/02/13 09:52:19.500 WARN sst_service.rs:97: send rpc response: RemoteStopped

Same error message as the unstable test in grpc-rs ……

longhuguoji commented 5 years ago

@ice1000 @kennytm 请问如何解决这个问题?

overvenus commented 5 years ago

@longhuguoji Could your rerun tikv-importer with environment variables GRPC_TRACE=all GRPC_VERBOSITY=DEBUG?

longhuguoji commented 5 years ago

@overvenus @ice1000 @kennytm 是在/etc/profile中export GRPC_TRACE=all 吗?我这样放进去,重新执行后,还是报错。

longhuguoji commented 5 years ago

@overvenus @ice1000 @kennytm 有人可以帮我解决一下问题吗?

kennytm commented 5 years ago

@longhuguoji

是在/etc/profile中export GRPC_TRACE=all 吗?我这样放进去,重新执行后,还是报错。

Yes. Run importer like this:

GRPC_TRACE=all GRPC_VERBOSITY=DEBUG bin/tikv-importer --config conf/tikv-importer.toml 2> log.txt

This will print a huge amount of gRPC log into log.txt (about several hundred MB). After we've received the gRPC log we could inspect what is causing gRPC not able to connect.

longhuguoji commented 5 years ago

@kennytm

1、我的环境变量配置如下 [root@app bin]# tail -n 2 /etc/profile export grpc_trace=all export grpc_verbosity=debug [root@app bin]# [root@app bin]# echo $grpc_trace all [root@app bin]# echo $grpc_verbosity debug 2、启动tikv-importer后,再启动tidb-lightning,log没有日志打印记录,log2的打印记录仍然是grpc错误,启动命令如下: grpc_trace=all grpc_verbosity=debug ./tikv-importer -C ../conf/tikv-importer.toml 2> log.txt grpc_trace=all grpc_verbosity=debug ./tidb-lightning -config ../conf/tidb-lightning.toml 2> log2.txt log2的日志如下: Error: rpc error: code = Unknown desc = Grpc(RpcFailure(RpcStatus { status: Unavailable, details: Some("Connect Failed") }))

请问,是我执行的有问题吗?我的log并没有大量日志。

lonng commented 5 years ago

@longhuguoji GRPC_TRACE=all GRPC_VERBOSITY=DEBUG is case sensitive

longhuguoji commented 5 years ago

@kennytm @lonng tikv-importer输出日志如下: log.txt

longhuguoji commented 5 years ago

@kennytm @lonng

我这面有一个更大的文件,超过上传大小限制。 如果上面的log.txt不能满足的话,我应该怎么给你们log文件?

longhuguoji commented 5 years ago

@kennytm @lonng 下面这几行日志在上面的附件中没有,剩下的log文件一直在重复一下信息: I0214 13:02:00.328441803 47612 completion_queue.c:614] cq_end_op_for_next(exec_ctx=0x7ffdff30a570, cq=0x7f59cb4683c0, tag=0x7f59bdbfc980, error={"created":"@1550120520.297559116","description":"Server Shutdown","file":"/home/jenkins/.cargo/registry/src/github.com-1ecc6299db9ec823/grpcio-sys-0.2.3/grpc/src/core/lib/surface/server.c","file_line":1297}, done=0x56280840d830, done_arg=0x7f59bf3b7630, storage=0x7f59bf3b7660) I0214 13:02:00.328447677 47633 completion_queue.c:831] grpc_completion_queue_next(cq=0x7f59cb4683c0, deadline=gpr_timespec { tv_sec: 9223372036854775807, tv_nsec: 0, clock_type: 1 }, reserved=(nil)) E0214 13:02:00.328451056 47612 completion_queue.c:617] Operation failed: tag=0x7f59bdbfc980, error={"created":"@1550120520.297559116","description":"Server Shutdown","file":"/home/jenkins/.cargo/registry/src/github.com-1ecc6299db9ec823/grpcio-sys-0.2.3/grpc/src/core/lib/surface/server.c","file_line":1297} I0214 13:02:00.328455411 47633 completion_queue.c:935] RETURN_EVENT[0x7f59cb4683c0]: OP_COMPLETE: tag:0x7f59baff5e00 ERROR

kennytm commented 5 years ago

PTAL @overvenus


@longhuguoji You may compress the log as a ZIP file and try to upload. If it still fails, you may email me at chenqingjing[at]pingcap.com.

overvenus commented 5 years ago

@longhuguoji Could you grep -A10 -B10 "Connect"?

longhuguoji commented 5 years ago

@overvenus result.log

kennytm commented 5 years ago
"target_address":"ipv4:127.0.0.1:20160"

The target IP is 127.0.0.1? 🤔

@longhuguoji Is the TiKV's --addr (and --advertise-addr) correctly set to 20.4.18.{40,41,42}:20160?

overvenus commented 5 years ago

Looks like some tikvs have wrong address, as @kennytm says. Could you show ./pd-ctl -u http://pd-address -d store?

longhuguoji commented 5 years ago

@kennytm @overvenus 在PD节点执行./pd-ctl -u http://20.4.18.43:2379 -d store后,日志如下: [root@tidb4 bin]# ./pd-ctl -u http://20.4.18.43:2379 -d store { "count": 3, "stores": [ { "store": { "id": 1, "address": "20.4.18.40:20160", "version": "2.1.3", "state_name": "Up" }, "status": { "capacity": "98 GiB", "available": "94 GiB", "leader_count": 202, "leader_weight": 1, "leader_score": 2504, "leader_size": 2504, "region_count": 698, "region_weight": 1, "region_score": 7508, "region_size": 7508, "start_ts": "2019-02-13T11:23:37+08:00", "last_heartbeat_ts": "2019-02-15T09:10:19.28612686+08:00", "uptime": "45h46m42.28612686s" } }, { "store": { "id": 5, "address": "20.4.18.42:20160", "version": "2.1.3", "state_name": "Up" }, "status": { "capacity": "98 GiB", "available": "89 GiB", "leader_count": 265, "leader_weight": 1, "leader_score": 2547, "leader_size": 2547, "region_count": 698, "region_weight": 1, "region_score": 7508, "region_size": 7508, "start_ts": "2019-02-12T16:30:21+08:00", "last_heartbeat_ts": "2019-02-15T09:10:21.595613963+08:00", "uptime": "64h40m0.595613963s" } }, { "store": { "id": 1099, "address": "20.4.18.41:20160", "version": "2.1.3", "state_name": "Up" }, "status": { "capacity": "98 GiB", "available": "95 GiB", "leader_count": 231, "leader_weight": 1, "leader_score": 2457, "leader_size": 2457, "region_count": 698, "region_weight": 1, "region_score": 7508, "region_size": 7508, "start_ts": "2019-02-12T16:25:42+08:00", "last_heartbeat_ts": "2019-02-15T09:10:19.789389529+08:00", "uptime": "64h44m37.789389529s" } } ] }