pingcap / tiflow

This repo maintains DM (a data migration platform) and TiCDC (change data capture for TiDB)
Apache License 2.0
428 stars 286 forks source link

ticdc restart after network partition recover from one of ticdc #9940

Open Lily2025 opened 1 year ago

Lily2025 commented 1 year ago

What did you do?

1、run workload storage:"s3://benchmark/sysbench_64_7000w" subType:"oltp_read_write" db:"sysbench_64_7000w" tableNum:64 tableSize:70000000 threads:32 ignoreErrors:"2013,1213,1105,1205,8022,8028,9004,9007,1062" 2、inject network partition between one of ticdc and other pods

What did you expect to see?

What did you see instead?

1、one of ticdc restart

[2024/02/28 19:08:37.410 +08:00] [ERROR] [tso_dispatcher.go:562] ["[tso] update connection contexts failed"] [dc=global] [error="rpc error: code = Unavailable desc = connection error: desc = \"transport: Error while dialing: dial tcp [10.200.49.116:2379](http://10.200.49.116:2379/): i/o timeout\""]
[2024/02/28 19:08:37.410 +08:00] [ERROR] [pd.go:228] ["updateTS error"] [txnScope=global] [error="context canceled"] [errorVerbose="context canceled\[ngithub.com/tikv/pd/client.(*tsoRequest).Wait\n\tgithub.com/tikv/pd/client@v0.0.0-20240126020320-567c7d43a008/tso_dispatcher.go:118\ngithub.com/tikv/pd/client.(*client).GetTS\n\tgithub.com/tikv/pd/client@v0.0.0-20240126020320-567c7d43a008/client.go:803\ngithub.com/tikv/client-go/v2/util.InterceptedPDClient.GetTS\n\tgithub.com/tikv/client-go/v2@v2.0.8-0.20240205071126-11cb7985f0ec/util/pd_interceptor.go:81\ngithub.com/tikv/client-go/v2/oracle/oracles.(*pdOracle).getTimestamp\n\tgithub.com/tikv/client-go/v2@v2.0.8-0.20240205071126-11cb7985f0ec/oracle/oracles/pd.go:147\ngithub.com/tikv/client-go/v2/oracle/oracles.(*pdOracle).updateTS.func1\n\tgithub.com/tikv/client-go/v2@v2.0.8-0.20240205071126-11cb7985f0ec/oracle/oracles/pd.go:226\nsync.(*Map).Range\n\tsync/map.go:476\ngithub.com/tikv/client-go/v2/oracle/oracles.(*pdOracle).updateTS\n\tgithub.com/tikv/client-go/v2@v2.0.8-0.20240205071126-11cb7985f0ec/oracle/oracles/pd.go:224\nruntime.goexit\n\truntime/asm_amd64.s:1650\ngithub.com/tikv/client-go/v2/oracle/oracles.(*pdOracle).getTimestamp\n\tgithub.com/tikv/client-go/v2@v2.0.8-0.20240205071126-11cb7985f0ec/oracle/oracles/pd.go:152\ngithub.com/tikv/client-go/v2/oracle/oracles.(*pdOracle).updateTS.func1\n\tgithub.com/tikv/client-go/v2@v2.0.8-0.20240205071126-11cb7985f0ec/oracle/oracles/pd.go:226\nsync.(*Map).Range\n\tsync/map.go:476\ngithub.com/tikv/client-go/v2/oracle/oracles.(*pdOracle).updateTS\n\tgithub.com/tikv/client-go/v2@v2.0.8-0.20240205071126-11cb7985f0ec/oracle/oracles/pd.go:224\nruntime.goexit\n\truntime/asm_amd64.s:1650](http://ngithub.com/tikv/pd/client.(*tsoRequest).Wait/n/tgithub.com/tikv/pd/client@v0.0.0-20240126020320-567c7d43a008/tso_dispatcher.go:118/ngithub.com/tikv/pd/client.(*client).GetTS/n/tgithub.com/tikv/pd/client@v0.0.0-20240126020320-567c7d43a008/client.go:803/ngithub.com/tikv/client-go/v2/util.InterceptedPDClient.GetTS/n/tgithub.com/tikv/client-go/v2@v2.0.8-0.20240205071126-11cb7985f0ec/util/pd_interceptor.go:81/ngithub.com/tikv/client-go/v2/oracle/oracles.(*pdOracle).getTimestamp/n/tgithub.com/tikv/client-go/v2@v2.0.8-0.20240205071126-11cb7985f0ec/oracle/oracles/pd.go:147/ngithub.com/tikv/client-go/v2/oracle/oracles.(*pdOracle).updateTS.func1/n/tgithub.com/tikv/client-go/v2@v2.0.8-0.20240205071126-11cb7985f0ec/oracle/oracles/pd.go:226/nsync.(*Map).Range/n/tsync/map.go:476/ngithub.com/tikv/client-go/v2/oracle/oracles.(*pdOracle).updateTS/n/tgithub.com/tikv/client-go/v2@v2.0.8-0.20240205071126-11cb7985f0ec/oracle/oracles/pd.go:224/nruntime.goexit/n/truntime/asm_amd64.s:1650/ngithub.com/tikv/client-go/v2/oracle/oracles.(*pdOracle).getTimestamp/n/tgithub.com/tikv/client-go/v2@v2.0.8-0.20240205071126-11cb7985f0ec/oracle/oracles/pd.go:152/ngithub.com/tikv/client-go/v2/oracle/oracles.(*pdOracle).updateTS.func1/n/tgithub.com/tikv/client-go/v2@v2.0.8-0.20240205071126-11cb7985f0ec/oracle/oracles/pd.go:226/nsync.(*Map).Range/n/tsync/map.go:476/ngithub.com/tikv/client-go/v2/oracle/oracles.(*pdOracle).updateTS/n/tgithub.com/tikv/client-go/v2@v2.0.8-0.20240205071126-11cb7985f0ec/oracle/oracles/pd.go:224/nruntime.goexit/n/truntime/asm_amd64.s:1650)"]
[2024/02/28 19:08:37.410 +08:00] [INFO] [tso_dispatcher.go:344] ["[tso] exit tso dispatcher"] [dc-location=global]
[2024/02/28 19:08:37.410 +08:00] [INFO] [tso_client.go:139] ["close tso client"]
[2024/02/28 19:08:37.410 +08:00] [INFO] [tso_client.go:150] ["tso client is closed"]
[2024/02/28 19:08:37.410 +08:00] [INFO] [pd_service_discovery.go:664] ["[pd] close pd service discovery client"]
[2024/02/28 19:08:37.410 +08:00] [INFO] [client.go:319] ["[pd] http client closed"] [source=tikv-driver]
[2024/02/28 19:08:37.413 +08:00] [WARN] [upstream.go:299] ["etcd session close failed"] [error="etcdserver: requested lease not found"]
[2024/02/28 19:08:37.413 +08:00] [INFO] [upstream.go:305] ["upstream closed"] [upstreamID=7340490029962833542]
[2024/02/28 19:08:38.370 +08:00] [ERROR] [pd_service_discovery.go:613] ["[pd] failed to update service mode"] [urls="[http://tc-pd-0.tc-pd-peer.endless-ha-test-ticdc-tps-7080582-1-976.svc:2379,http://tc-pd-1.tc-pd-peer.endless-ha-test-ticdc-tps-7080582-1-976.svc:2379,http://tc-pd-2.tc-pd-peer.endless-ha-test-ticdc-tps-7080582-1-976.svc:2379]"] [error="[PD:client:ErrClientGetClusterInfo]error:rpc error: code = DeadlineExceeded desc = context deadline exceeded target:[tc-pd-2.tc](http://tc-pd-2.tc/)-pd-peer.endless-ha-test-ticdc-tps-7080582-1-976.svc:2379 status:READY: error:rpc error: code = DeadlineExceeded desc = context deadline exceeded target:[tc-pd-2.tc](http://tc-pd-2.tc/)-pd-peer.endless-ha-test-ticdc-tps-7080582-1-976.svc:2379 status:READY"]
[2024/02/28 19:08:38.379 +08:00] [WARN] [server.go:315] ["etcd health check: cannot collect all members"] [error="rpc error: code = DeadlineExceeded desc = context deadline exceeded"] [errorVerbose="rpc error: code = DeadlineExceeded desc = context deadline exceeded\[ngithub.com/tikv/pd/client.(*client).respForErr\n\tgithub.com/tikv/pd/client@v0.0.0-20240126020320-567c7d43a008/client.go:1550\ngithub.com/tikv/pd/client.(*client).GetAllMembers\n\tgithub.com/tikv/pd/client@v0.0.0-20240126020320-567c7d43a008/client.go:735\ngithub.com/pingcap/tiflow/pkg/pdutil.(*pdAPIClient).CollectMemberEndpoints\n\tgithub.com/pingcap/tiflow/pkg/pdutil/api_client.go:346\ngithub.com/pingcap/tiflow/cdc/server.(*server).upstreamPDHealthChecker\n\tgithub.com/pingcap/tiflow/cdc/server/server.go:313\ngithub.com/pingcap/tiflow/cdc/server.(*server).run.func1\n\tgithub.com/pingcap/tiflow/cdc/server/server.go:347\ngolang.org/x/sync/errgroup.(*Group).Go.func1\n\tgolang.org/x/sync@v0.6.0/errgroup/errgroup.go:78\nruntime.goexit\n\truntime/asm_amd64.s:1650](http://ngithub.com/tikv/pd/client.(*client).respForErr/n/tgithub.com/tikv/pd/client@v0.0.0-20240126020320-567c7d43a008/client.go:1550/ngithub.com/tikv/pd/client.(*client).GetAllMembers/n/tgithub.com/tikv/pd/client@v0.0.0-20240126020320-567c7d43a008/client.go:735/ngithub.com/pingcap/tiflow/pkg/pdutil.(*pdAPIClient).CollectMemberEndpoints/n/tgithub.com/pingcap/tiflow/pkg/pdutil/api_client.go:346/ngithub.com/pingcap/tiflow/cdc/server.(*server).upstreamPDHealthChecker/n/tgithub.com/pingcap/tiflow/cdc/server/server.go:313/ngithub.com/pingcap/tiflow/cdc/server.(*server).run.func1/n/tgithub.com/pingcap/tiflow/cdc/server/server.go:347/ngolang.org/x/sync/errgroup.(*Group).Go.func1/n/tgolang.org/x/sync@v0.6.0/errgroup/errgroup.go:78/nruntime.goexit/n/truntime/asm_amd64.s:1650)"]
[2024/02/28 19:08:47.414 +08:00] [WARN] [check.go:88] ["check TiKV version failed"] [error="[CDC:ErrGetAllStoresFailed]get stores from pd failed: rpc error: code = DeadlineExceeded desc = context deadline exceeded"] [errorVerbose="[CDC:ErrGetAllStoresFailed]get stores from pd failed: rpc error: code = DeadlineExceeded desc = context deadline exceeded\[ngithub.com/pingcap/errors.AddStack\n\tgithub.com/pingcap/errors@v0.11.5-0.20231212100244-799fae176cfb/errors.go:174\ngithub.com/pingcap/errors.(*Error).GenWithStackByArgs\n\tgithub.com/pingcap/errors@v0.11.5-0.20231212100244-799fae176cfb/normalize.go:164\ngithub.com/pingcap/tiflow/pkg/errors.WrapError\n\tgithub.com/pingcap/tiflow/pkg/errors/helper.go:34\ngithub.com/pingcap/tiflow/pkg/version.CheckStoreVersion\n\tgithub.com/pingcap/tiflow/pkg/version/check.go:209\ngithub.com/pingcap/tiflow/pkg/version.CheckClusterVersion\n\tgithub.com/pingcap/tiflow/pkg/version/check.go:83\ngithub.com/pingcap/tiflow/pkg/upstream.initUpstream\n\tgithub.com/pingcap/tiflow/pkg/upstream/upstream.go:179\ngithub.com/pingcap/tiflow/pkg/upstream.(*Manager).AddDefaultUpstream\n\tgithub.com/pingcap/tiflow/pkg/upstream/manager.go:116\ngithub.com/pingcap/tiflow/cdc/capture.(*captureImpl).reset\n\tgithub.com/pingcap/tiflow/cdc/capture/capture.go:250\ngithub.com/pingcap/tiflow/cdc/capture.(*captureImpl).run\n\tgithub.com/pingcap/tiflow/cdc/capture/capture.go:333\ngithub.com/pingcap/tiflow/cdc/capture.(*captureImpl).Run\n\tgithub.com/pingcap/tiflow/cdc/capture/capture.go:308\ngithub.com/pingcap/tiflow/cdc/server.(*server).run.func6\n\tgithub.com/pingcap/tiflow/cdc/server/server.go:372\ngolang.org/x/sync/errgroup.(*Group).Go.func1\n\tgolang.org/x/sync@v0.6.0/errgroup/errgroup.go:78\nruntime.goexit\n\truntime/asm_amd64.s:1650](http://ngithub.com/pingcap/errors.AddStack/n/tgithub.com/pingcap/errors@v0.11.5-0.20231212100244-799fae176cfb/errors.go:174/ngithub.com/pingcap/errors.(*Error).GenWithStackByArgs/n/tgithub.com/pingcap/errors@v0.11.5-0.20231212100244-799fae176cfb/normalize.go:164/ngithub.com/pingcap/tiflow/pkg/errors.WrapError/n/tgithub.com/pingcap/tiflow/pkg/errors/helper.go:34/ngithub.com/pingcap/tiflow/pkg/version.CheckStoreVersion/n/tgithub.com/pingcap/tiflow/pkg/version/check.go:209/ngithub.com/pingcap/tiflow/pkg/version.CheckClusterVersion/n/tgithub.com/pingcap/tiflow/pkg/version/check.go:83/ngithub.com/pingcap/tiflow/pkg/upstream.initUpstream/n/tgithub.com/pingcap/tiflow/pkg/upstream/upstream.go:179/ngithub.com/pingcap/tiflow/pkg/upstream.(*Manager).AddDefaultUpstream/n/tgithub.com/pingcap/tiflow/pkg/upstream/manager.go:116/ngithub.com/pingcap/tiflow/cdc/capture.(*captureImpl).reset/n/tgithub.com/pingcap/tiflow/cdc/capture/capture.go:250/ngithub.com/pingcap/tiflow/cdc/capture.(*captureImpl).run/n/tgithub.com/pingcap/tiflow/cdc/capture/capture.go:333/ngithub.com/pingcap/tiflow/cdc/capture.(*captureImpl).Run/n/tgithub.com/pingcap/tiflow/cdc/capture/capture.go:308/ngithub.com/pingcap/tiflow/cdc/server.(*server).run.func6/n/tgithub.com/pingcap/tiflow/cdc/server/server.go:372/ngolang.org/x/sync/errgroup.(*Group).Go.func1/n/tgolang.org/x/sync@v0.6.0/errgroup/errgroup.go:78/nruntime.goexit/n/truntime/asm_amd64.s:1650)"]
[2024/02/28 19:08:47.427 +08:00] [INFO] [pd_service_discovery.go:1016] ["[pd] update member urls"] [old-urls="[http://tc-pd-0.tc-pd-peer.endless-ha-test-ticdc-tps-7080582-1-976.svc:2379,http://tc-pd-1.tc-pd-peer.endless-ha-test-ticdc-tps-7080582-1-976.svc:2379,http://tc-pd-2.tc-pd-peer.endless-ha-test-ticdc-tps-7080582-1-976.svc:2379,http://tc-pd:2379]"] [new-urls="[http://tc-pd-0.tc-pd-peer.endless-ha-test-ticdc-tps-7080582-1-976.svc:2379,http://tc-pd-1.tc-pd-peer.endless-ha-test-ticdc-tps-7080582-1-976.svc:2379,http://tc-pd-2.tc-pd-peer.endless-ha-test-ticdc-tps-7080582-1-976.svc:2379]"]
[2024/02/28 19:08:47.427 +08:00] [INFO] [pd_service_discovery.go:1043] ["[pd] switch leader"] [new-leader=http://tc-pd-2.tc-pd-peer.endless-ha-test-ticdc-tps-7080582-1-976.svc:2379/] [old-leader=]
[2024/02/28 19:08:47.427 +08:00] [INFO] [pd_service_discovery.go:525] ["[pd] init cluster id"] [cluster-id=7340490029962833542]
[2024/02/28 19:08:47.427 +08:00] [INFO] [client.go:606] ["[pd] changing service mode"] [old-mode=UNKNOWN_SVC_MODE] [new-mode=PD_SVC_MODE]
[2024/02/28 19:08:47.427 +08:00] [INFO] [tso_client.go:231] ["[tso] switch dc tso global allocator serving address"] [dc-location=global] [new-address=http://tc-pd-2.tc-pd-peer.endless-ha-test-ticdc-tps-7080582-1-976.svc:2379/]
[2024/02/28 19:08:47.428 +08:00] [INFO] [tso_dispatcher.go:323] ["[tso] tso dispatcher created"] [dc-location=global]
[2024/02/28 19:08:47.428 +08:00] [INFO] [client.go:654] ["[pd] service mode changed"] [old-mode=UNKNOWN_SVC_MODE] [new-mode=PD_SVC_MODE]
[2024/02/28 19:08:47.429 +08:00] [INFO] [tikv_driver.go:200] ["using API V1."]
[2024/02/28 19:08:47.429 +08:00] [INFO] [tso_dispatcher.go:441] ["[tso] tso stream is not ready"] [dc=global]
[2024/02/28 19:08:48.371 +08:00] [ERROR] [pd_service_discovery.go:613] ["[pd] failed to update service mode"] [urls="[http://tc-pd-0.tc-pd-peer.endless-ha-test-ticdc-tps-7080582-1-976.svc:2379,http://tc-pd-1.tc-pd-peer.endless-ha-test-ticdc-tps-7080582-1-976.svc:2379,http://tc-pd-2.tc-pd-peer.endless-ha-test-ticdc-tps-7080582-1-976.svc:2379]"] [error="[PD:client:ErrClientGetClusterInfo]error:rpc error: code = DeadlineExceeded desc = context deadline exceeded target:[tc-pd-2.tc](http://tc-pd-2.tc/)-pd-peer.endless-ha-test-ticdc-tps-7080582-1-976.svc:2379 status:READY: error:rpc error: code = DeadlineExceeded desc = context deadline exceeded target:[tc-pd-2.tc](http://tc-pd-2.tc/)-pd-peer.endless-ha-test-ticdc-tps-7080582-1-976.svc:2379 status:READY"]
[2024/02/28 19:08:48.380 +08:00] [WARN] [server.go:315] ["etcd health check: cannot collect all members"] [error="rpc error: code = DeadlineExceeded desc = context deadline exceeded"] [errorVerbose="rpc error: code = DeadlineExceeded desc = context deadline exceeded\[ngithub.com/tikv/pd/client.(*client).respForErr\n\tgithub.com/tikv/pd/client@v0.0.0-20240126020320-567c7d43a008/client.go:1550\ngithub.com/tikv/pd/client.(*client).GetAllMembers\n\tgithub.com/tikv/pd/client@v0.0.0-20240126020320-567c7d43a008/client.go:735\ngithub.com/pingcap/tiflow/pkg/pdutil.(*pdAPIClient).CollectMemberEndpoints\n\tgithub.com/pingcap/tiflow/pkg/pdutil/api_client.go:346\ngithub.com/pingcap/tiflow/cdc/server.(*server).upstreamPDHealthChecker\n\tgithub.com/pingcap/tiflow/cdc/server/server.go:313\ngithub.com/pingcap/tiflow/cdc/server.(*server).run.func1\n\tgithub.com/pingcap/tiflow/cdc/server/server.go:347\ngolang.org/x/sync/errgroup.(*Group).Go.func1\n\tgolang.org/x/sync@v0.6.0/errgroup/errgroup.go:78\nruntime.goexit\n\truntime/asm_amd64.s:1650](http://ngithub.com/tikv/pd/client.(*client).respForErr/n/tgithub.com/tikv/pd/client@v0.0.0-20240126020320-567c7d43a008/client.go:1550/ngithub.com/tikv/pd/client.(*client).GetAllMembers/n/tgithub.com/tikv/pd/client@v0.0.0-20240126020320-567c7d43a008/client.go:735/ngithub.com/pingcap/tiflow/pkg/pdutil.(*pdAPIClient).CollectMemberEndpoints/n/tgithub.com/pingcap/tiflow/pkg/pdutil/api_client.go:346/ngithub.com/pingcap/tiflow/cdc/server.(*server).upstreamPDHealthChecker/n/tgithub.com/pingcap/tiflow/cdc/server/server.go:313/ngithub.com/pingcap/tiflow/cdc/server.(*server).run.func1/n/tgithub.com/pingcap/tiflow/cdc/server/server.go:347/ngolang.org/x/sync/errgroup.(*Group).Go.func1/n/tgolang.org/x/sync@v0.6.0/errgroup/errgroup.go:78/nruntime.goexit/n/truntime/asm_amd64.s:1650)"]
[2024/02/28 19:08:57.429 +08:00] [ERROR] [tso_dispatcher.go:202] ["[tso] tso request is canceled due to timeout"] [dc-location=global] [error="[PD:client:ErrClientGetTSOTimeout]get TSO timeout"]
[2024/02/28 19:08:57.429 +08:00] [ERROR] [tso_dispatcher.go:498] ["[tso] getTS error after processing requests"] [dc-location=global] [stream-addr=http://tc-pd-2.tc-pd-peer.endless-ha-test-ticdc-tps-7080582-1-976.svc:2379/] [error="[PD:client:ErrClientGetTSO]get TSO failed, %v: rpc error: code = Canceled desc = context canceled"]
[2024/02/28 19:08:57.429 +08:00] [ERROR] [capture.go:335] ["reset capture failed"] [error="rpc error: code = Canceled desc = context canceled"] [errorVerbose="rpc error: code = Canceled desc = context canceled\[ngithub.com/tikv/pd/client.(*pdTSOStream).processRequests\n\tgithub.com/tikv/pd/client@v0.0.0-20240126020320-567c7d43a008/tso_stream.go:149\ngithub.com/tikv/pd/client.(*tsoClient).processRequests\n\tgithub.com/tikv/pd/client@v0.0.0-20240126020320-567c7d43a008/tso_dispatcher.go:763\ngithub.com/tikv/pd/client.(*tsoClient).handleDispatcher\n\tgithub.com/tikv/pd/client@v0.0.0-20240126020320-567c7d43a008/tso_dispatcher.go:488\nruntime.goexit\n\truntime/asm_amd64.s:1650\ngithub.com/tikv/pd/client.(*tsoRequest).Wait\n\tgithub.com/tikv/pd/client@v0.0.0-20240126020320-567c7d43a008/tso_dispatcher.go:104\ngithub.com/tikv/pd/client.(*client).GetTS\n\tgithub.com/tikv/pd/client@v0.0.0-20240126020320-567c7d43a008/client.go:803\ngithub.com/pingcap/tiflow/pkg/pdutil.NewClock\n\tgithub.com/pingcap/tiflow/pkg/pdutil/clock.go:62\ngithub.com/pingcap/tiflow/pkg/upstream.initUpstream\n\tgithub.com/pingcap/tiflow/pkg/upstream/upstream.go:197\ngithub.com/pingcap/tiflow/pkg/upstream.(*Manager).AddDefaultUpstream\n\tgithub.com/pingcap/tiflow/pkg/upstream/manager.go:116\ngithub.com/pingcap/tiflow/cdc/capture.(*captureImpl).reset\n\tgithub.com/pingcap/tiflow/cdc/capture/capture.go:250\ngithub.com/pingcap/tiflow/cdc/capture.(*captureImpl).run\n\tgithub.com/pingcap/tiflow/cdc/capture/capture.go:333\ngithub.com/pingcap/tiflow/cdc/capture.(*captureImpl).Run\n\tgithub.com/pingcap/tiflow/cdc/capture/capture.go:308\ngithub.com/pingcap/tiflow/cdc/server.(*server).run.func6\n\tgithub.com/pingcap/tiflow/cdc/server/server.go:372\ngolang.org/x/sync/errgroup.(*Group).Go.func1\n\tgolang.org/x/sync@v0.6.0/errgroup/errgroup.go:78\nruntime.goexit\n\truntime/asm_amd64.s:1650](http://ngithub.com/tikv/pd/client.(*pdTSOStream).processRequests/n/tgithub.com/tikv/pd/client@v0.0.0-20240126020320-567c7d43a008/tso_stream.go:149/ngithub.com/tikv/pd/client.(*tsoClient).processRequests/n/tgithub.com/tikv/pd/client@v0.0.0-20240126020320-567c7d43a008/tso_dispatcher.go:763/ngithub.com/tikv/pd/client.(*tsoClient).handleDispatcher/n/tgithub.com/tikv/pd/client@v0.0.0-20240126020320-567c7d43a008/tso_dispatcher.go:488/nruntime.goexit/n/truntime/asm_amd64.s:1650/ngithub.com/tikv/pd/client.(*tsoRequest).Wait/n/tgithub.com/tikv/pd/client@v0.0.0-20240126020320-567c7d43a008/tso_dispatcher.go:104/ngithub.com/tikv/pd/client.(*client).GetTS/n/tgithub.com/tikv/pd/client@v0.0.0-20240126020320-567c7d43a008/client.go:803/ngithub.com/pingcap/tiflow/pkg/pdutil.NewClock/n/tgithub.com/pingcap/tiflow/pkg/pdutil/clock.go:62/ngithub.com/pingcap/tiflow/pkg/upstream.initUpstream/n/tgithub.com/pingcap/tiflow/pkg/upstream/upstream.go:197/ngithub.com/pingcap/tiflow/pkg/upstream.(*Manager).AddDefaultUpstream/n/tgithub.com/pingcap/tiflow/pkg/upstream/manager.go:116/ngithub.com/pingcap/tiflow/cdc/capture.(*captureImpl).reset/n/tgithub.com/pingcap/tiflow/cdc/capture/capture.go:250/ngithub.com/pingcap/tiflow/cdc/capture.(*captureImpl).run/n/tgithub.com/pingcap/tiflow/cdc/capture/capture.go:333/ngithub.com/pingcap/tiflow/cdc/capture.(*captureImpl).Run/n/tgithub.com/pingcap/tiflow/cdc/capture/capture.go:308/ngithub.com/pingcap/tiflow/cdc/server.(*server).run.func6/n/tgithub.com/pingcap/tiflow/cdc/server/server.go:372/ngolang.org/x/sync/errgroup.(*Group).Go.func1/n/tgolang.org/x/sync@v0.6.0/errgroup/errgroup.go:78/nruntime.goexit/n/truntime/asm_amd64.s:1650)"]
[2024/02/28 19:08:57.430 +08:00] [INFO] [capture.go:328] ["the capture routine has exited"]
[2024/02/28 19:08:57.430 +08:00] [WARN] [server.go:315] ["etcd health check: cannot collect all members"] [error="rpc error: code = Canceled desc = context canceled"] [errorVerbose="rpc error: code = Canceled desc = context canceled\[ngithub.com/tikv/pd/client.(*client).respForErr\n\tgithub.com/tikv/pd/client@v0.0.0-20240126020320-567c7d43a008/client.go:1550\ngithub.com/tikv/pd/client.(*client).GetAllMembers\n\tgithub.com/tikv/pd/client@v0.0.0-20240126020320-567c7d43a008/client.go:735\ngithub.com/pingcap/tiflow/pkg/pdutil.(*pdAPIClient).CollectMemberEndpoints\n\tgithub.com/pingcap/tiflow/pkg/pdutil/api_client.go:346\ngithub.com/pingcap/tiflow/cdc/server.(*server).upstreamPDHealthChecker\n\tgithub.com/pingcap/tiflow/cdc/server/server.go:313\ngithub.com/pingcap/tiflow/cdc/server.(*server).run.func1\n\tgithub.com/pingcap/tiflow/cdc/server/server.go:347\ngolang.org/x/sync/errgroup.(*Group).Go.func1\n\tgolang.org/x/sync@v0.6.0/errgroup/errgroup.go:78\nruntime.goexit\n\truntime/asm_amd64.s:1650](http://ngithub.com/tikv/pd/client.(*client).respForErr/n/tgithub.com/tikv/pd/client@v0.0.0-20240126020320-567c7d43a008/client.go:1550/ngithub.com/tikv/pd/client.(*client).GetAllMembers/n/tgithub.com/tikv/pd/client@v0.0.0-20240126020320-567c7d43a008/client.go:735/ngithub.com/pingcap/tiflow/pkg/pdutil.(*pdAPIClient).CollectMemberEndpoints/n/tgithub.com/pingcap/tiflow/pkg/pdutil/api_client.go:346/ngithub.com/pingcap/tiflow/cdc/server.(*server).upstreamPDHealthChecker/n/tgithub.com/pingcap/tiflow/cdc/server/server.go:313/ngithub.com/pingcap/tiflow/cdc/server.(*server).run.func1/n/tgithub.com/pingcap/tiflow/cdc/server/server.go:347/ngolang.org/x/sync/errgroup.(*Group).Go.func1/n/tgolang.org/x/sync@v0.6.0/errgroup/errgroup.go:78/nruntime.goexit/n/truntime/asm_amd64.s:1650)"]
[2024/02/28 19:08:57.430 +08:00] [ERROR] [server.go:298] ["http server error"] [error="[CDC:ErrServeHTTP]serve http error: mux: server closed"] [errorVerbose="[CDC:ErrServeHTTP]serve http error: mux: server closed\[ngithub.com/pingcap/errors.AddStack\n\tgithub.com/pingcap/errors@v0.11.5-0.20231212100244-799fae176cfb/errors.go:174\ngithub.com/pingcap/errors.(*Error).GenWithStackByArgs\n\tgithub.com/pingcap/errors@v0.11.5-0.20231212100244-799fae176cfb/normalize.go:164\ngithub.com/pingcap/tiflow/pkg/errors.WrapError\n\tgithub.com/pingcap/tiflow/pkg/errors/helper.go:34\ngithub.com/pingcap/tiflow/cdc/server.(*server).startStatusHTTP.func1\n\tgithub.com/pingcap/tiflow/cdc/server/server.go:298\nruntime.goexit\n\truntime/asm_amd64.s:1650](http://ngithub.com/pingcap/errors.AddStack/n/tgithub.com/pingcap/errors@v0.11.5-0.20231212100244-799fae176cfb/errors.go:174/ngithub.com/pingcap/errors.(*Error).GenWithStackByArgs/n/tgithub.com/pingcap/errors@v0.11.5-0.20231212100244-799fae176cfb/normalize.go:164/ngithub.com/pingcap/tiflow/pkg/errors.WrapError/n/tgithub.com/pingcap/tiflow/pkg/errors/helper.go:34/ngithub.com/pingcap/tiflow/cdc/server.(*server).startStatusHTTP.func1/n/tgithub.com/pingcap/tiflow/cdc/server/server.go:298/nruntime.goexit/n/truntime/asm_amd64.s:1650)"]
[2024/02/28 19:08:57.430 +08:00] [WARN] [server.go:139] ["cdc server exits with error"] [error="rpc error: code = Canceled desc = context canceled"] [errorVerbose="rpc error: code = Canceled desc = context canceled\[ngithub.com/tikv/pd/client.(*pdTSOStream).processRequests\n\tgithub.com/tikv/pd/client@v0.0.0-20240126020320-567c7d43a008/tso_stream.go:149\ngithub.com/tikv/pd/client.(*tsoClient).processRequests\n\tgithub.com/tikv/pd/client@v0.0.0-20240126020320-567c7d43a008/tso_dispatcher.go:763\ngithub.com/tikv/pd/client.(*tsoClient).handleDispatcher\n\tgithub.com/tikv/pd/client@v0.0.0-20240126020320-567c7d43a008/tso_dispatcher.go:488\nruntime.goexit\n\truntime/asm_amd64.s:1650\ngithub.com/tikv/pd/client.(*tsoRequest).Wait\n\tgithub.com/tikv/pd/client@v0.0.0-20240126020320-567c7d43a008/tso_dispatcher.go:104\ngithub.com/tikv/pd/client.(*client).GetTS\n\tgithub.com/tikv/pd/client@v0.0.0-20240126020320-567c7d43a008/client.go:803\ngithub.com/pingcap/tiflow/pkg/pdutil.NewClock\n\tgithub.com/pingcap/tiflow/pkg/pdutil/clock.go:62\ngithub.com/pingcap/tiflow/pkg/upstream.initUpstream\n\tgithub.com/pingcap/tiflow/pkg/upstream/upstream.go:197\ngithub.com/pingcap/tiflow/pkg/upstream.(*Manager).AddDefaultUpstream\n\tgithub.com/pingcap/tiflow/pkg/upstream/manager.go:116\ngithub.com/pingcap/tiflow/cdc/capture.(*captureImpl).reset\n\tgithub.com/pingcap/tiflow/cdc/capture/capture.go:250\ngithub.com/pingcap/tiflow/cdc/capture.(*captureImpl).run\n\tgithub.com/pingcap/tiflow/cdc/capture/capture.go:333\ngithub.com/pingcap/tiflow/cdc/capture.(*captureImpl).Run\n\tgithub.com/pingcap/tiflow/cdc/capture/capture.go:308\ngithub.com/pingcap/tiflow/cdc/server.(*server).run.func6\n\tgithub.com/pingcap/tiflow/cdc/server/server.go:372\ngolang.org/x/sync/errgroup.(*Group).Go.func1\n\tgolang.org/x/sync@v0.6.0/errgroup/errgroup.go:78\nruntime.goexit\n\truntime/asm_amd64.s:1650](http://ngithub.com/tikv/pd/client.(*pdTSOStream).processRequests/n/tgithub.com/tikv/pd/client@v0.0.0-20240126020320-567c7d43a008/tso_stream.go:149/ngithub.com/tikv/pd/client.(*tsoClient).processRequests/n/tgithub.com/tikv/pd/client@v0.0.0-20240126020320-567c7d43a008/tso_dispatcher.go:763/ngithub.com/tikv/pd/client.(*tsoClient).handleDispatcher/n/tgithub.com/tikv/pd/client@v0.0.0-20240126020320-567c7d43a008/tso_dispatcher.go:488/nruntime.goexit/n/truntime/asm_amd64.s:1650/ngithub.com/tikv/pd/client.(*tsoRequest).Wait/n/tgithub.com/tikv/pd/client@v0.0.0-20240126020320-567c7d43a008/tso_dispatcher.go:104/ngithub.com/tikv/pd/client.(*client).GetTS/n/tgithub.com/tikv/pd/client@v0.0.0-20240126020320-567c7d43a008/client.go:803/ngithub.com/pingcap/tiflow/pkg/pdutil.NewClock/n/tgithub.com/pingcap/tiflow/pkg/pdutil/clock.go:62/ngithub.com/pingcap/tiflow/pkg/upstream.initUpstream/n/tgithub.com/pingcap/tiflow/pkg/upstream/upstream.go:197/ngithub.com/pingcap/tiflow/pkg/upstream.(*Manager).AddDefaultUpstream/n/tgithub.com/pingcap/tiflow/pkg/upstream/manager.go:116/ngithub.com/pingcap/tiflow/cdc/capture.(*captureImpl).reset/n/tgithub.com/pingcap/tiflow/cdc/capture/capture.go:250/ngithub.com/pingcap/tiflow/cdc/capture.(*captureImpl).run/n/tgithub.com/pingcap/tiflow/cdc/capture/capture.go:333/ngithub.com/pingcap/tiflow/cdc/capture.(*captureImpl).Run/n/tgithub.com/pingcap/tiflow/cdc/capture/capture.go:308/ngithub.com/pingcap/tiflow/cdc/server.(*server).run.func6/n/tgithub.com/pingcap/tiflow/cdc/server/server.go:372/ngolang.org/x/sync/errgroup.(*Group).Go.func1/n/tgolang.org/x/sync@v0.6.0/errgroup/errgroup.go:78/nruntime.goexit/n/truntime/asm_amd64.s:1650)"]
[2024/02/28 19:08:57.430 +08:00] [INFO] [capture.go:707] ["message router closed"] [captureID=a277c9b2-c0b6-4ef0-aa9d-3d51b50cd83f]
[2024/02/28 19:08:57.432 +08:00] [INFO] [server.go:424] ["sort engine manager closed"] [duration=2.032547ms]
[2024/02/28 19:08:57.432 +08:00] [INFO] [pd_service_discovery.go:577] ["[pd] exit member loop due to context canceled"]
[2024/02/28 19:08:57.432 +08:00] [INFO] [resource_manager_client.go:295] ["[resource manager] exit resource token dispatcher"]
[2024/02/28 19:08:57.432 +08:00] [INFO] [tso_dispatcher.go:240] ["exit tso dispatcher loop"]
[2024/02/28 19:08:57.432 +08:00] [INFO] [tso_dispatcher.go:410] ["[tso] stop fetching the pending tso requests due to context canceled"] [dc-location=global]
[2024/02/28 19:08:57.432 +08:00] [INFO] [tso_dispatcher.go:344] ["[tso] exit tso dispatcher"] [dc-location=global]
[2024/02/28 19:08:57.432 +08:00] [INFO] [tso_dispatcher.go:186] ["exit tso requests cancel loop"]
[2024/02/28 19:08:57.432 +08:00] [ERROR] [pd_service_discovery.go:613] ["[pd] failed to update service mode"] [urls="[http://tc-pd-0.tc-pd-peer.endless-ha-test-ticdc-tps-7080582-1-976.svc:2379,http://tc-pd-1.tc-pd-peer.endless-ha-test-ticdc-tps-7080582-1-976.svc:2379,http://tc-pd-2.tc-pd-peer.endless-ha-test-ticdc-tps-7080582-1-976.svc:2379]"] [error="[PD:client:ErrClientGetClusterInfo]error:rpc error: code = Canceled desc = context canceled target:[tc-pd-2.tc](http://tc-pd-2.tc/)-pd-peer.endless-ha-test-ticdc-tps-7080582-1-976.svc:2379 status:READY: error:rpc error: code = Canceled desc = context canceled target:[tc-pd-2.tc](http://tc-pd-2.tc/)-pd-peer.endless-ha-test-ticdc-tps-7080582-1-976.svc:2379 status:READY"]
[2024/02/28 19:08:57.432 +08:00] [ERROR] [pd_service_discovery.go:613] ["[pd] failed to update service mode"] [urls="[http://tc-pd-0.tc-pd-peer.endless-ha-test-ticdc-tps-7080582-1-976.svc:2379,http://tc-pd-1.tc-pd-peer.endless-ha-test-ticdc-tps-7080582-1-976.svc:2379,http://tc-pd-2.tc-pd-peer.endless-ha-test-ticdc-tps-7080582-1-976.svc:2379]"] [error="[PD:client:ErrClientGetClusterInfo]error:rpc error: code = Canceled desc = context canceled target:[tc-pd-2.tc](http://tc-pd-2.tc/)-pd-peer.endless-ha-test-ticdc-tps-7080582-1-976.svc:2379 status:READY: error:rpc error: code = Canceled desc = context canceled target:[tc-pd-2.tc](http://tc-pd-2.tc/)-pd-peer.endless-ha-test-ticdc-tps-7080582-1-976.svc:2379 status:READY"]
[2024/02/28 19:08:57.432 +08:00] [INFO] [tso_client.go:134] ["closing tso client"]
[2024/02/28 19:08:57.432 +08:00] [INFO] [tso_client.go:139] ["close tso client"]
[2024/02/28 19:08:57.432 +08:00] [INFO] [tso_client.go:150] ["tso client is closed"]
[2024/02/28 19:08:57.432 +08:00] [INFO] [pd_service_discovery.go:664] ["[pd] close pd service discovery client"]
[2024/02/28 19:08:59.752 +08:00] [INFO] [helper.go:54] ["init log"] [file=/var/lib/ticdc/log/ticdc.log] [level=info]
[2024/02/28 19:08:59.752 +08:00] [INFO] [tz.go:34] ["Use the timezone of the TiCDC server machine"] [timezoneName=System] [timezone=Asia/Shanghai]
[2024/02/28 19:08:59.752 +08:00] [INFO] [version.go:47] ["Welcome to Change Data Capture (CDC)"] [release-version=v8.0.0-alpha] [git-hash=25ce29c2a1802bbb4cd26008f322728959a91f7a] [git-branch=heads/refs/tags/v8.0.0-alpha] [utc-build-time="2024-02-27 11:37:29"] [go-version="go version go1.21.6 linux/amd64"] [failpoint-build=false]
[2024/02/28 19:08:59.752 +08:00] [INFO] [server.go:125] ["CDC server created"] [pd="[http://tc-pd:2379/]"] [config="{\"addr\":\"[0.0.0.0:8301](http://0.0.0.0:8301/)\",\"advertise-addr\":\"[tc-ticdc-1.tc](http://tc-ticdc-1.tc/)-ticdc-peer.endless-ha-test-ticdc-tps-7080582-1-976.svc:8301\",\"log-file\":\"/var/lib/ticdc/log/ticdc.log\",\"log-level\":\"info\",\"log\":{\"file\":{\"max-size\":301,\"max-days\":0,\"max-backups\":0},\"error-output\":\"stderr\"},\"data-dir\":\"\",\"gc-ttl\":86400,\"tz\":\"System\",\"capture-session-ttl\":10,\"owner-flush-interval\":50000000,\"processor-flush-interval\":50000000,\"sorter\":{\"sort-dir\":\"/tmp/sorter\",\"cache-size-in-mb\":128},\"security\":{\"ca-path\":\"\",\"cert-path\":\"\",\"key-path\":\"\",\"cert-allowed-cn\":null,\"mtls\":false,\"client-user-required\":false,\"client-allowed-user\":null},\"kv-client\":{\"enable-multiplexing\":true,\

Versions of the cluster

git hash:25ce29c2a1802bbb4cd26008f322728959a91f7a

Lily2025 commented 1 year ago

/remove-area dm /area ticdc

Lily2025 commented 1 year ago

/type question

ti-chi-bot[bot] commented 1 year ago

@Lily2025: These labels are not set on the issue: type/bug.

In response to [this](https://github.com/pingcap/tiflow/issues/9940#issuecomment-1774382049): >/remove-type bug Instructions for interacting with me using PR comments are available [here](https://prow.tidb.net/command-help). If you have questions or suggestions related to my behavior, please file an issue against the [ti-community-infra/tichi](https://github.com/ti-community-infra/tichi/issues/new?title=Prow%20issue:) repository.
Lily2025 commented 5 months ago

from @asddongmen, it was by designed First of all, after the error injection, the connection between cdc and pd is disconnected and the session times out. In this case, capture fails. The capture thread will automatically try to rebuild. During the reconstruction process, PD will be connected several times, and some timeouts will be set within cdc. If this timeout is set within the cdc repository, it will only cause the capture thread to retry again. The reason for the restart is that capture attempts to reconnect PD and passes some operations, and reaches the PDClient.GetTs interface. This interface sets a timeout period by itself, which will return an error not handled by cdc. This error is thrown outside the capture thread, causing a restart at the server process level.

Why I'm not going to add this error to the error list of internal retries of the capture thread: Because it is actually difficult to get to this step, plus it also represents the connection problem between CDC and PD, at this time, directly restart the process and refresh the memory state may recover from the error faster. (The log also proves that the server recovered immediately after restart without causing any worse phenomena, so I think it is a good choice to keep the status quo)

fubinzh commented 5 months ago

@Lily2025 What is the CDC lag in this scenario?

asddongmen commented 5 months ago

from @asddongmen, it was by designed First of all, after the error injection, the connection between cdc and pd is disconnected and the session times out. In this case, capture fails. The capture thread will automatically try to rebuild. During the reconstruction process, PD will be connected several times, and some timeouts will be set within cdc. If this timeout is set within the cdc repository, it will only cause the capture thread to retry again. The reason for the restart is that capture attempts to reconnect PD and passes some operations, and reaches the PDClient.GetTs interface. This interface sets a timeout period by itself, which will return an error not handled by cdc. This error is thrown outside the capture thread, causing a restart at the server process level.

Why I'm not going to add this error to the error list of internal retries of the capture thread: Because it is actually difficult to get to this step, plus it also represents the connection problem between CDC and PD, at this time, directly restart the process and refresh the memory state may recover from the error faster. (The log also proves that the server recovered immediately after restart without causing any worse phenomena, so I think it is a good choice to keep the status quo)

@Lily2025 Thank you for your detailed feedback. I will review the code to determine if it's feasible.