pingcap / tiflash

The analytical engine for TiDB and TiDB Cloud. Try free: https://tidbcloud.com/free-trial
https://docs.pingcap.com/tidb/stable/tiflash-overview
Apache License 2.0
945 stars 409 forks source link

tiflash deploy error #7784

Open calvin2021y opened 1 year ago

calvin2021y commented 1 year ago

1. Minimal reproduce step

use tiup deplaoy 7.1 lts, tiflash not able to start.

2. What did you expect to see?

expect the service work

3. What did you see instead (Required)

check the logs, this is the last error:

[2023/07/10 13:54:46.512 +00:00] [INFO] [engine.rs:80] ["disabled pagestorage"]
[2023/07/10 13:54:49.775 +00:00] [FATAL] [lib.rs:497] ["called `Result::unwrap()` on an `Err` value: Os { code: 11, kind: WouldBlock, message: \"Resource temporarily unavailable\" }"] [backtrace="   0: tikv_util::set_panic_hook::{{closure}}\n   1: <alloc::boxed::Box<F,A> as core::ops::function::Fn<Args>>::call\n             at /rustc/96ddd32c4bfb1d78f0cd03eb068b1710a8cebeef/library/alloc/src/boxed.rs:2032:9\n      std::panicking::rust_panic_with_hook\n             at /rustc/96ddd32c4bfb1d78f0cd03eb068b1710a8cebeef/library/std/src/panicking.rs:692:13\n   2: std::panicking::begin_panic_handler::{{closure}}\n             at /rustc/96ddd32c4bfb1d78f0cd03eb068b1710a8cebeef/library/std/src/panicking.rs:579:13\n   3: std::sys_common::backtrace::__rust_end_short_backtrace\n             at /rustc/96ddd32c4bfb1d78f0cd03eb068b1710a8cebeef/library/std/src/sys_common/backtrace.rs:137:18\n   4: rust_begin_unwind\n             at /rustc/96ddd32c4bfb1d78f0cd03eb068b1710a8cebeef/library/std/src/panicking.rs:575:5\n   5: core::panicking::panic_fmt\n             at /rustc/96ddd32c4bfb1d78f0cd03eb068b1710a8cebeef/library/core/src/panicking.rs:65:14\n   6: core::result::unwrap_failed\n             at /rustc/96ddd32c4bfb1d78f0cd03eb068b1710a8cebeef/library/core/src/result.rs:1791:5\n   7: tikv::read_pool::build_yatp_read_pool\n   8: proxy_server::run::TiKvServer<ER>::init_servers\n   9: proxy_server::run::run_tikv_proxy\n  10: proxy_server::proxy::run_proxy\n  11: _ZN2DB20RaftStoreProxyRunner20runRaftStoreProxyFFIEPv\n             at /home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/tics/dbms/src/Server/Server.cpp:519:9\n  12: start_thread\n  13: __clone3\n"] [location=/root/.cargo/git/checkouts/yatp-e704b73c3ee279b6/5523a9a/src/pool/builder.rs:116] [thread_name=<unnamed>]

4. What is your TiFlash version? (Required)

tiflash-v7.1.0-linux-amd64.tar.gz

calvin2021y commented 1 year ago
./scripts/run_tiflash.sh 
sync ... 
real    0m0.003s
user    0m0.002s
sys     0m0.000s
ok

arg matches is ArgMatches { args: {"pd-endpoints": MatchedArg { occurs: 1, indices: [8, 9, 10], vals: ["pd1:2379", "pd2:2379", "pd3:2379"] }, "engine-addr": MatchedArg { occurs: 1, indices: [14], vals: ["tiflash2:3930"] }, "config": MatchedArg { occurs: 1, indices: [2], vals: ["/app/tidb/tidb-deploy/tiflash-9000/conf/tiflash-learner.toml"] }, "engine-label": MatchedArg { occurs: 1, indices: [12], vals: ["tiflash"] }, "engine-version": MatchedArg { occurs: 1, indices: [4], vals: ["v7.1.0"] }, "engine-git-hash": MatchedArg { occurs: 1, indices: [6], vals: ["cffc61e6ce008286d6ec5db2c6eb30c29bf065ec"] }}, subcommand: None, usage: Some("USAGE:\n    TiFlash Proxy [FLAGS] [OPTIONS] --engine-git-hash <engine-git-hash> --engine-label <engine-label> --engine-version <engine-version>") }
calvin2021y commented 1 year ago

I guess this cloud related into tiflash ignored /sys/fs/cgroup/cpuset.cpus and nproc, try use all cpu from /proc/cpuinfo.

Is there a options to set sched_setaffinity from configure file ? or disable it ?

JaySon-Huang commented 1 year ago

Seems it is caused by some error thrown in build_yatp_read_pool. @CalvinNeo can you check whether there is a workaround?

calvin2021y commented 1 year ago

@CalvinNeo please feel free to let me know what I can do to help location the problem.

CalvinNeo commented 1 year ago

Maybe due to there are fewer cpus than expected? Maybe we can try to reduce the config. This part of TiFlash is actually an embedded TiKV, please feel free to check if there are some TiKV's config that can help. Then you can modify tiflash-learner.toml accordingly. /cc @calvin2021y @JaySon-Huang

calvin2021y commented 1 year ago

sorry for late reply. we are limit the cpu by cgroup cpuset. nproc show the correct cpu number, but lscpu show much more cpus than process able to use.