pingcap / tiflow

This repo maintains DM (a data migration platform) and TiCDC (change data capture for TiDB)
Apache License 2.0
425 stars 283 forks source link

Exec command timeout on cdc node sometimes #9258

Open fubinzh opened 1 year ago

fubinzh commented 1 year ago

What did you do?

Run cdc redo log system test cases, for each case basically below steps included:

  1. create changefeed with redo log enabled
  2. run workload
  3. run redo apply
  4. remove changefeed

What did you expect to see?

execute command on cdc node should not timeout

What did you see instead?

Execute command on TiCDC seen timeout error seen sometimes:

example 1: execute echo commend timeout

2023-06-18T21:52:20.768Z        INFO    host/ticdc.go:283       Execute command on ticdc        {"command": "echo -e >/tmp/changefeed.toml \"[cons
istent] \n\"\"level = \\\"eventual\\\"\nstorage = \\\"s3://tmp/test-infra-redolog/redo-apply-cdc-all-node-restart-sync-bee28932-a8f8-4289-bd09-9d1
37926dcad?access-key=minioadmin&secret-access-key=minioadmin&endpoint=http://172.16.6.5:9000&force-path-style=true\\\"\nmax-log-size = 64 \n\"", "
timeout": "10s", "ticdc peer": 0}
2023-06-18T21:52:20.768Z        INFO    k8s/client.go:132       it should be noted that a long-running command will not be interrupted even the us
e case has ended. For more information, please refer to https://github.com/pingcap/test-infra/discussions/129
[2023/06/18 21:52:30.770 +00:00] [PANIC] [main.go:114] ["run case failed"] [error="Exec command timeout"] [errorVerbose="Exec command timeout\ngit
hub.com/pingcap/test-infra/caselib/pkg/host.init\n\tgithub.com/pingcap/test-infra/caselib/pkg/host/host.go:16\nruntime.doInit\n\truntime/proc.go:6
506\nruntime.doInit\n\truntime/proc.go:6483\nruntime.doInit\n\truntime/proc.go:6483\nruntime.main\n\truntime/proc.go:233\nruntime.goexit\n\truntim
e/asm_amd64.s:1598"] [stack="main.main\n\t./main.go:114\nruntime.main\n\truntime/proc.go:250"]
panic: run case failed

example 2: execute changefeed remove command timeout

2023-06-18T22:42:45.719Z        INFO    host/ticdc.go:299       Execute command on ticdc        {"command": "/cdc version", "ticdc peer": 0}
2023-06-18T22:42:45.719Z        INFO    k8s/client.go:132       it should be noted that a long-running command will not be interrupted even the us
e case has ended. For more information, please refer to https://github.com/pingcap/test-infra/discussions/129
[2023/06/18 22:43:04.528 +00:00] [INFO] [ticdc.go:199] ["get TiCDC version"] [cdcVersion=v7.2.0-alpha]
2023-06-18T22:43:04.528Z        INFO    host/ticdc.go:283       Execute command on ticdc        {"command": "/cdc  cli  changefeed  remove \"--ser
ver=127.0.0.1:8301\" \"--changefeed-id=redo-enable-cdc-all-node-restart-sync\"", "timeout": "10s", "ticdc peer": 0}
2023-06-18T22:43:04.529Z        INFO    k8s/client.go:132       it should be noted that a long-running command will not be interrupted even the us
e case has ended. For more information, please refer to https://github.com/pingcap/test-infra/discussions/129
[2023/06/18 22:43:14.532 +00:00] [ERROR] [ticdc.go:72] ["run changefeed operation timeout"] [stack="github.com/pingcap/test-infra/caselib/pkg/host
.(*TiCDCHost).ChangeFeed\n\tgithub.com/pingcap/test-infra/caselib/pkg/host/ticdc.go:72\ngithub.com/pingcap/test-infra/caselib/pkg/steps.(*changeFe
edTask).Execute\n\tgithub.com/pingcap/test-infra/caselib/pkg/steps/cdc.go:80\ngithub.com/pingcap/test-infra/caselib/pkg/steps.(*Serial).Execute\n\
tgithub.com/pingcap/test-infra/caselib/pkg/steps/step.go:31\nmain.main\n\t./main.go:113\nruntime.main\n\truntime/proc.go:250"]
[2023/06/18 22:43:14.532 +00:00] [INFO] [cdc.go:81] ["run changeFeed task finished"] [result=null] [error="Exec command timeout"] [errorVerbose="E
xec command timeout\ngithub.com/pingcap/test-infra/caselib/pkg/host.init\n\tgithub.com/pingcap/test-infra/caselib/pkg/host/host.go:16\nruntime.doI
nit\n\truntime/proc.go:6506\nruntime.doInit\n\truntime/proc.go:6483\nruntime.doInit\n\truntime/proc.go:6483\nruntime.main\n\truntime/proc.go:233\n
runtime.goexit\n\truntime/asm_amd64.s:1598"]

Versions of the cluster

TiCDC version (execute cdc version):

["Welcome to Change Data Capture (CDC)"] [release-version=v7.2.0-alpha] [git-hash=5c368b52c0de831c0b9dd68ff66874d5a37f74ca] [git-branch=heads/refs/tags/v7.2.0-alpha] [utc-build-time="2023-06-16 11:35:10"] [go-version="go version go1.20.5 linux/amd64"] [failpoint-build=false]
fubinzh commented 1 year ago

/found automation

fubinzh commented 1 year ago

/severity moderate

fubinzh commented 1 year ago

also sometime exec timeout seen when creating changefeed:

[2023/07/04 00:41:18.905 +00:00] [INFO] [ticdc.go:201] ["get TiCDC version"] [cdcVersion=v7.3.0-alpha]
2023-07-04T00:41:18.905Z    INFO    1815723 host/ticdc.go:301   Execute command on ticdc    {"command": "/cdc  cli  changefeed  create \"--server=127.0.0.1:8301\" \"--sink-uri=mysql://root:@downstream.cdc-testbed-tps-1815723-1-93:3306\" \"--changefeed-id=cdc-scale-out-in\" \"--no-confirm\"", "timeout": "10s", "ticdc peer": 0}
2023-07-04T00:41:18.905Z    INFO    k8s/client.go:132   it should be noted that a long-running command will not be interrupted even the use case has ended. For more information, please refer to https://github.com/pingcap/test-infra/discussions/129
[2023/07/04 00:41:28.906 +00:00] [ERROR] [ticdc.go:72] ["run changefeed operation timeout"] [stack="github.com/pingcap/test-infra/caselib/pkg/host.(*TiCDCHost).ChangeFeed\n\tgithub.com/pingcap/test-infra/caselib/pkg/host/ticdc.go:72\ngithub.com/pingcap/test-infra/caselib/pkg/steps.(*changeFeedTask).Execute\n\tgithub.com/pingcap/test-infra/caselib/pkg/steps/cdc.go:80\ngithub.com/pingcap/test-infra/caselib/pkg/steps.withRecover\n\tgithub.com/pingcap/test-infra/caselib/pkg/steps/step.go:24\ngithub.com/pingcap/test-infra/caselib/pkg/steps.(*Serial).Execute\n\tgithub.com/pingcap/test-infra/caselib/pkg/steps/step.go:45\nmain.main\n\ttest-infra/caselib/ticdc/main.go:112\nruntime.main\n\truntime/proc.go:250"]
[2023/07/04 00:41:28.906 +00:00] [INFO] [cdc.go:81] ["run changeFeed task finished"] [result=null] [error="Exec command timeout"] [errorVerbose="Exec command timeout\ngithub.com/pingcap/test-infra/caselib/pkg/host.init\n\tgithub.com/pingcap/test-infra/caselib/pkg/host/host.go:16\nruntime.doInit\n\truntime/proc.go:6506\nruntime.doInit\n\truntime/proc.go:6483\nruntime.doInit\n\truntime/proc.go:6483\nruntime.main\n\truntime/proc.go:233\nruntime.goexit\n\truntime/asm_amd64.s:1598"]
[2023/07/04 00:41:28.906 +00:00] [PANIC] [main.go:113] ["run case failed"] [error="Exec command timeout"] [errorVerbose="Exec command timeout\ngithub.com/pingcap/test-infra/caselib/pkg/host.init\n\tgithub.com/pingcap/test-infra/caselib/pkg/host/host.go:16\nruntime.doInit\n\truntime/proc.go:6506\nruntime.doInit\n\truntime/proc.go:6483\nruntime.doInit\n\truntime/proc.go:6483\nruntime.main\n\truntime/proc.go:233\nruntime.goexit\n\truntime/asm_amd64.s:1598"] [case=cdc_scale_sync] [stack="main.main\n\ttest-infra/caselib/ticdc/main.go:113\nruntime.main\n\truntime/proc.go:250"]