tikv / raft-engine

A persistent storage engine for Multi-Raft log
Apache License 2.0
565 stars 88 forks source link

Meet a problem while upgrade tikv to v5.1.0 and enable raft-engine #75

Closed Itachi666 closed 2 years ago

Itachi666 commented 3 years ago

hi,

I'm trying to upgrade my tikv cluster to v5.1.0 to test raft-engine's performance, but I met some problem

my old version cluster is using two disks to store raftdb and rocksdb(by mount /var/lib/tikv/raft to another disk A). But when I try to do the update, i found raft-engine is using /var/lib/tikv/raft-engine dir. If i remount diskA to /var/lib/tikv/raft-engine, tikv cannot dump old raftdb's log.

So my question is how can i upgrade my cluster, can use raft-engine with two disks and don't need to lose my old raft's log?

Thx~

Itachi666 commented 3 years ago

Besides, i also try to just upgrade tikv from raftdb to raft-engine with using one disk, but got a FATAL device or resource busy after raft-engine dumped raft's old logs. It seems tikv cannot remove the old log file?

[2021/07/02 03:42:22.027 +00:00] [INFO] [engine.rs:437] ["Recovered raft log Append.265."]
[2021/07/02 03:42:22.029 +00:00] [INFO] [engine.rs:479] ["Recover raft log takes 17.234144924s"]
[2021/07/02 03:42:22.031 +00:00] [INFO] [raft_engine_switch.rs:244] ["Start to scan raft log from RocksEngine and dump into RaftLogEngine"]
[2021/07/02 03:42:22.031 +00:00] [INFO] [raft_engine_switch.rs:253] ["Scanned all region id and waiting for dump"]
[2021/07/02 03:42:23.185 +00:00] [INFO] [raft_engine_switch.rs:258] ["Finished dump, total regions: 218; Total bytes: 463632899; Consumed time: 1.15466225s"]
[2021/07/02 03:42:23.686 +00:00] [FATAL] [lib.rs:462] ["called `Result::unwrap()` on an `Err` value: Os { code: 16, kind: Other, message: \"Device or resource busy\" }"] [backtrace="stack backtrace:\n   0: tikv_util::set_panic_hook::{{closure}}\n             at components/tikv_util/src/lib.rs:461\n   1: std::panicking::rust_panic_with_hook\n             at library/std/src/panicking.rs:595\n   2: std::panicking::begin_panic_handler::{{closure}}\n             at library/std/src/panicking.rs:497\n   3: std::sys_common::backtrace::__rust_end_short_backtrace\n             at library/std/src/sys_common/backtrace.rs:141\n   4: rust_begin_unwind\n             at library/std/src/panicking.rs:493\n   5: core::panicking::panic_fmt\n             at library/core/src/panicking.rs:92\n   6: core::result::unwrap_failed\n             at library/core/src/result.rs:1355\n   7: core::result::Result<T,E>::unwrap\n             at rustc/16bf626a31cb5b121d0bca2baa969b4f67eb0dab/library/core/src/result.rs:1037\n      server::raft_engine_switch::rename_to_tmp_dir\n             at components/server/src/raft_engine_switch.rs:37\n   8: server::raft_engine_switch::check_and_dump_raft_engine\n             at components/server/src/raft_engine_switch.rs:265\n      server::server::TiKVServer<engine_rocks::engine::RocksEngine>::init_raw_engines\n             at components/server/src/server.rs:1130\n      server::server::run_tikv\n             at components/server/src/server.rs:152\n   9: tikv_server::main\n             at cmd/tikv-server/src/main.rs:181\n  10: core::ops::function::FnOnce::call_once\n             at rustc/16bf626a31cb5b121d0bca2baa969b4f67eb0dab/library/core/src/ops/function.rs:227\n      std::sys_common::backtrace::__rust_begin_short_backtrace\n             at rustc/16bf626a31cb5b121d0bca2baa969b4f67eb0dab/library/std/src/sys_common/backtrace.rs:125\n  11: main\n  12: __libc_start_main\n  13: <unknown>\n"] [location=components/server/src/raft_engine_switch.rs:37] [thread_name=main]
tabokie commented 3 years ago

@Itachi666 Hi, so sorry for the late response. The first issue is expected, because raft-engine by default uses data-dir/raft-engine, regardless of how raftdb is placed. Later on we might improve it by creating sibling directory next to raftdb. For now, you can designate the raft-engine directory by config raft-engine.dir (it's not listed in the template).

For the second issue, it looks like you opened two TiKV instances simultaneously, and there is a conflict between their operations on the same files. I have this suspicion because the log Start to scan raft log from RocksEngine and dump into RaftLogEngine means transferring logs from RocksEngine to RaftEngine, while the panic stacktrace check_and_dump_raft_engine suggests it is transferring in the opposite direction.