pingcap / tidb-binlog

A tool used to collect and merge tidb's binlog for real-time data backup and synchronization.
Apache License 2.0
292 stars 131 forks source link

tikv has already be moved to other machine,but drainer still send msg to it #1000

Open jwongz opened 4 years ago

jwongz commented 4 years ago

Bug Report

Please answer these questions before submitting your issue. Thanks!

  1. What did you do? If possible, provide a recipe for reproducing the error.

migrate tikv

  1. What did you expect to see?

normal

  1. What did you see instead?

drainer error


{"log":"[2020/09/03 11:04:23.736 +08:00] [INFO] [client_batch.go:257] [\"batchRecvLoop re-create streaming fail\"] [target=xxx:10000] [error=\"rpc error: code = Unavailable desc = all SubConns are in TransientFailure, latest connection error: connection error: desc = \\\"transport: Error while dialing dial tcp xxx:10000: connect: connection refused\\\"\"]\n","stream":"stdout","time":"2020-09-03T03:04:23.736332662Z"}
{"log":"[2020/09/03 11:04:23.782 +08:00] [INFO] [client_batch.go:257] [\"batchRecvLoop re-create streaming fail\"] [target=xxx:10000] [error=\"rpc error: code = Unavailable desc = all SubConns are in TransientFailure, latest connection error: connection error: desc = \\\"transport: Error while dialing dial tcp xxx:10000: connect: connection refused\\\"\"]\n","stream":"stdout","time":"2020-09-03T03:04:23.782749447Z"}
{"log":"[2020/09/03 11:04:23.872 +08:00] [INFO] [syncer.go:251] [\"write save point\"] [ts=419195062555246597]\n","stream":"stdout","time":"2020-09-03T03:04:23.872408938Z"}
{"log":"[mysql] 2020/09/03 11:04:23 packets.go:36: read tcp xxxa:32768-\u003exxxa:20076: read: connection reset by peer\n","stream":"stderr","time":"2020-09-03T03:04:23.897010135Z"}
{"log":"[2020/09/03 11:04:23.985 +08:00] [FATAL] [syncer.go:254] [\"save checkpoint failed\"] [ts=419195062555246597] [error=\"query sql failed: replace into tidb_binlog.checkpoint values(6625488485425191173, '{\\\"commitTS\\\":419195062555246597,\\\"ts-map\\\":{}}'): invalid connection\"] [errorVerbose=\"invalid connection\\nquery sql failed: replace into tidb_binlog.checkpoint values(6625488485425191173, '{\\\"commitTS\\\":419195062555246597,\\\"ts-map\\\":{}}')\\ngithub.com/pingcap/tidb-binlog/drainer/checkpoint.(*MysqlCheckPoint).Save\\n\\t/home/pirlo.wang/go/src/tidb-binlog/drainer/checkpoint/mysql.go:137\\ngithub.com/pingcap/tidb-binlog/drainer.(*Syncer).savePoint\\n\\t/home/pirlo.wang/go/src/tidb-binlog/drainer/syncer.go:252\\ngithub.com/pingcap/tidb-binlog/drainer.(*Syncer).handleSuccess\\n\\t/home/pirlo.wang/go/src/tidb-binlog/drainer/syncer.go:226\\ngithub.com/pingcap/tidb-binlog/drainer.(*Syncer).run.func1\\n\\t/home/pirlo.wang/go/src/tidb-binlog/drainer/syncer.go:270\\nruntime.goexit\\n\\t/usr/local/go/src/runtime/asm_amd64.s:1337\"] [stack=\"github.com/pingcap/log.Fatal\\n\\t/home/pirlo.wang/go/pkg/mod/github.com/pingcap/log@v0.0.0-20190715063458-479153f07ebd/global.go:59\\ngithub.com/pingcap/tidb-binlog/drainer.(*Syncer).savePoint\\n\\t/home/pirlo.wang/go/src/tidb-binlog/drainer/syncer.go:254\\ngithub.com/pingcap/tidb-binlog/drainer.(*Syncer).handleSuccess\\n\\t/home/pirlo.wang/go/src/tidb-binlog/drainer/syncer.go:226\\ngithub.com/pingcap/tidb-binlog/drainer.(*Syncer).run.func1\\n\\t/home/pirlo.wang/go/src/tidb-binlog/drainer/syncer.go:270\"]\n","stream":"stdout","time":"2020-09-03T03:04:23.98599254Z"}}
  1. Please provide the relate downstream type and version of drainer. (run drainer -V in terminal to get drainer's version)

v3.0.5

kennytm commented 4 years ago

PTAL @july2993.

Error caused by invalid connection to the checkpoint database (verifying the roles of the IP addresses). There should probably be a retry loop in (*MysqlCheckPoint).Save.

july2993 commented 4 years ago

PTAL @july2993.

Error caused by invalid connection to the checkpoint database (verifying the roles of the IP addresses). There should probably be a retry loop in (*MysqlCheckPoint).Save.

There's no try now. It just fatal if failed to save checkpoint(default saved at the downstream db). I don't think the scale-in tikv at upstream is related to this problem

jwongz commented 4 years ago

since tikv has be moved for two days,this log is nerver printed

{"log":"[2020/09/03 11:11:00.859 +08:00] [INFO] [syncer.go:251] [\"write save point\"] [ts=419195167884705793]\n","stream":"stdout","time":"2020-09-03T03:11:00.85920493Z"}

and this log print for two days:

{"log":"[2020/09/03 11:04:23.736 +08:00] [INFO] [client_batch.go:257] [\"batchRecvLoop re-create streaming fail\"] [target=xxx:10000] [error=\"rpc error: code = Unavailable desc = all SubConns are in TransientFailure, latest connection error: connection error: desc = \\\"transport: Error while dialing dial tcp xxx:10000: connect: connection refused\\\"\"]\n","stream":"stdout","time":"2020-09-03T03:04:23.736332662Z"}
jwongz commented 4 years ago

After drainer-server restart, it's ok now

july2993 commented 4 years ago

{"log":"[2020/09/03 11:04:23.736 +08:00] [INFO] [client_batch.go:257] [\"batchRecvLoop re-create streaming fail\"] [target=xxx:10000] [error=\"rpc error: code = Unavailable desc = all SubConns are in TransientFailure, latest connection error: connection error: desc = \\"transport: Error while dialing dial tcp xxx:10000: connect: connection refused\\"\"]\n","stream":"stdout","time":"2020-09-03T03:04:23.736332662Z"}

This is probably is the bug of tikv client https://github.com/pingcap/tidb/pull/10301. and is already fixed in the latest version.

But I not sure what it will cause drainer can no longer work forward and save checkpoint.