pingcap / tiflash

The analytical engine for TiDB and TiDB Cloud. Try free: https://tidbcloud.com/free-trial
https://docs.pingcap.com/tidb/stable/tiflash-overview
Apache License 2.0
941 stars 409 forks source link

all tiflash wn node oom after one of tiflash compute node network partition last for 10mins and recover #8007

Closed Lily2025 closed 1 year ago

Lily2025 commented 1 year ago

Bug Report

Please answer these questions before submitting your issue. Thanks!

1. Minimal reproduce step (Required)

1、run ch go-tpc ch run -D tpcc --host tc-tidb.ha-test-serverless-htap-tps-2100173-1-190 -P4000 --warehouses 2000 -T 32 --acThreads 1 --queries q22 --ignore-error '2013,1213,1105,1205,8022,8028,9004,9007,1062' --time 36000m --user keyspace_a.root --password '' --interval '10s' 2、inject network partition between one of tiflash compute node and all other nodes last for 10mins [2023/08/23 08:02:37.742 +08:00] [INFO] [chaos.go:172] ["Run chaos"] [name=network_partition] [selectors="[ha-test-serverless-htap-tps-2100173-1-190/secondary-tc-tiflash-1]"] [SelectorsRetainPolicy(selectors)="[ha-test-serverless-htap-tps-2100173-1-190/secondary-tc-tiflash-1]"] [targetSelectors="[ha-test-serverless-htap-tps-2100173-1-190/tc-tidb-0,ha-test-serverless-htap-tps-2100173-1-190/tc-tidb-1,ha-test-serverless-htap-tps-2100173-1-190/tc-pd-0,ha-test-serverless-htap-tps-2100173-1-190/tc-pd-1,ha-test-serverless-htap-tps-2100173-1-190/tc-pd-2,ha-test-serverless-htap-tps-2100173-1-190/tc-tikv-0,ha-test-serverless-htap-tps-2100173-1-190/tc-tikv-1,ha-test-serverless-htap-tps-2100173-1-190/tc-tikv-2,ha-test-serverless-htap-tps-2100173-1-190/tc-tikv-3,ha-test-serverless-htap-tps-2100173-1-190/tc-tiflash-0,ha-test-serverless-htap-tps-2100173-1-190/tc-tiflash-1]"] [TargetSelectorsRetainPolicy(targetSelectors)="[ha-test-serverless-htap-tps-2100173-1-190/tc-tidb-0,ha-test-serverless-htap-tps-2100173-1-190/tc-tidb-1,ha-test-serverless-htap-tps-2100173-1-190/tc-pd-0,ha-test-serverless-htap-tps-2100173-1-190/tc-pd-1,ha-test-serverless-htap-tps-2100173-1-190/tc-pd-2,ha-test-serverless-htap-tps-2100173-1-190/tc-tikv-0,ha-test-serverless-htap-tps-2100173-1-190/tc-tikv-1,ha-test-serverless-htap-tps-2100173-1-190/tc-tikv-2,ha-test-serverless-htap-tps-2100173-1-190/tc-tikv-3,ha-test-serverless-htap-tps-2100173-1-190/tc-tiflash-0,ha-test-serverless-htap-tps-2100173-1-190/tc-tiflash-1]"] [experimentSpec="NetworkPartitionSpec{Duration: \"\", Direction: , Scheduler: }"] [2023/08/23 08:12:37.816 +08:00] [INFO] [chaos.go:239] ["Clean chaos"] [name=network_partition]

2. What did you expect to see? (Required)

no oom

3. What did you see instead (Required)

all tiflash wn node oom after fault recover ae8ebec0-b24a-43ea-b711-2a2b8708cd35

4. What is your TiFlash version? (Required)

[type=tiflash_compute] [instance=secondary-tc-tiflash-1] [version=7.1.0-alpha-307-g69d93fe] [git_hash=69d93feac85a207797ccca56dd8e37c38ed46aa2] [type=tiflash_compute] [instance=secondary-tc-tiflash-0] [version=7.1.0-alpha-307-g69d93fe] [git_hash=69d93feac85a207797ccca56dd8e37c38ed46aa2] [type=tiflash] [instance=tc-tiflash-1] [version=7.1.0-alpha-307-g69d93fe] [git_hash=69d93feac85a207797ccca56dd8e37c38ed46aa2] [type=tiflash] [instance=tc-tiflash-0] [version=7.1.0-alpha-307-g69d93fe] [git_hash=69d93feac85a207797ccca56dd8e37c38ed46aa2]

Lily2025 commented 1 year ago

/type bug /severity critical /assign JinheLin