pingcap / tiflash

The analytical engine for TiDB and TiDB Cloud. Try free: https://tidbcloud.com/free-trial
https://docs.pingcap.com/tidb/stable/tiflash-overview
Apache License 2.0
945 stars 409 forks source link

restart tiflash during flashback, tiflash fail with "Code: 49, e.displayText() = DB::Exception: Found existing key in hex: " #6379

Closed seiya-annie closed 1 year ago

seiya-annie commented 1 year ago

Bug Report

Please answer these questions before submitting your issue. Thanks!

1. Minimal reproduce step (Required)

  1. tpch prepare
  2. set tiflash replica for all tables;
  3. tpch run workload
  4. execute flashback and during this restart tikv/pd/tiflash

2. What did you expect to see? (Required)

flashback successfully tidb cluster work normal

3. What did you see instead (Required)

tiflash restart fail

[2022/11/29 12:38:55.886 +08:00] [ERROR] [Exception.cpp:89] ["Code: 49, e.displayText() = DB::Exception: Found existing key in hex: 7480000000000001FFC95F720380000000FF0000001903800000FF0000008DEE000000FCF9ECF9619EE3FFFB, e.what() = DB::Exception, Stack trace:\n\n\n       0x160c89e\tDB::Exception::Exception(std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&, int) [tiflash+23120030]\n                \tdbms/src/Common/Exception.h:46\n       0x5f289dc\tDB::RegionCFDataBase<DB::RegionDefaultCFDataTrait>::insert(DB::StringObject<true>&&, DB::StringObject<false>&&) [tiflash+99781084]\n                \tdbms/src/Storages/Transaction/RegionCFDataBase.cpp:50\n       0x5f20a01\tauto DB::Region::handleWriteRaftCmd(DB::WriteCmdsView const&, unsigned long, unsigned long, DB::TMTContext&)::$_12::operator()<unsigned long>(unsigned long) const [tiflash+99748353]\n                \tdbms/src/Storages/Transaction/Region.cpp:650\n       0x5f1fa32\tDB::Region::handleWriteRaftCmd(DB::WriteCmdsView const&, unsigned long, unsigned long, DB::TMTContext&) [tiflash+99744306]\n                \tdbms/src/Storages/Transaction/Region.cpp:716\n       0x5ee536e\tDB::KVStore::handleWriteRaftCmd(DB::WriteCmdsView const&, unsigned long, unsigned long, unsigned long, DB::TMTContext&) [tiflash+99505006]\n                \tdbms/src/Storages/Transaction/KVStore.cpp:293\n       0x5f048b5\tHandleWriteRaftCmd [tiflash+99633333]\n                \tdbms/src/Storages/Transaction/ProxyFFI.cpp:95\n  0x7f887ef02f2d\t_$LT$engine_store_ffi..observer..TiFlashObserver$u20$as$u20$raftstore..coprocessor..QueryObserver$GT$::post_exec_query::h1a8a961972b8ccf8 [libtiflash_proxy.so+17366829]\n  0x7f887fdd000d\traftstore::store::fsm::apply::ApplyDelegate$LT$EK$GT$::apply_raft_cmd::h0b08541bf9853547 [libtiflash_proxy.so+32886797]\n  0x7f887fde7545\traftstore::store::fsm::apply::ApplyDelegate$LT$EK$GT$::process_raft_cmd::ha4bde57f9f25a6ab [libtiflash_proxy.so+32982341]\n  0x7f887fdeca14\traftstore::store::fsm::apply::ApplyDelegate$LT$EK$GT$::handle_raft_committed_entries::h88c3ac142c377f65 [libtiflash_proxy.so+33004052]\n  0x7f887fdc200c\traftstore::store::fsm::apply::ApplyFsm$LT$EK$GT$::handle_apply::hf865ae2f9856a0b6 [libtiflash_proxy.so+32829452]\n  0x7f887fdc5628\traftstore::store::fsm::apply::ApplyFsm$LT$EK$GT$::handle_tasks::h91ceadd162e4f65b [libtiflash_proxy.so+32843304]\n  0x7f887f49cf5c\t_$LT$raftstore..store..fsm..apply..ApplyPoller$LT$EK$GT$$u20$as$u20$batch_system..batch..PollHandler$LT$raftstore..store..fsm..apply..ApplyFsm$LT$EK$GT$$C$raftstore..store..fsm..apply..ControlFsm$GT$$GT$::handle_normal::hed686c0d358455a2 [libtiflash_proxy.so+23240540]\n  0x7f887f43c723\tbatch_system::batch::Poller$LT$N$C$C$C$Handler$GT$::poll::h2c84599628ed3b21 [libtiflash_proxy.so+22845219]\n  0x7f887f4eaad2\tstd::sys_common::backtrace::__rust_begin_short_backtrace::h385c7c98d4548244 [libtiflash_proxy.so+23558866]\n  0x7f887f52db4e\tcore::ops::function::FnOnce::call_once$u7b$$u7b$vtable.shim$u7d$$u7d$::h626760f3e1255bb8 [libtiflash_proxy.so+23833422]\n  0x7f8880475c75\tstd::sys::unix::thread::Thread::new::thread_start::hd2791a9cabec1fda [libtiflash_proxy.so+39857269]\n                \t/rustc/96ddd32c4bfb1d78f0cd03eb068b1710a8cebeef/library/std/src/sys/unix/thread.rs:108\n  0x7f887dc5eea5\tstart_thread [libpthread.so.0+32421]\n  0x7f887d063b0d\tclone [libc.so.6+1043213]"] [source="DB::EngineStoreApplyRes DB::HandleWriteRaftCmd(const DB::EngineStoreServerWrap *, DB::WriteCmdsView, DB::RaftCmdHeader)"] [thread_id=74]

4. What is your TiFlash version? (Required)

2022/11/29 12:32:08.907 +08:00] [INFO] [<unknown>] ["Starting daemon with revision 54381"] [thread_id=1]
[2022/11/29 12:32:08.907 +08:00] [INFO] [<unknown>] ["TiFlash build info: TiFlash\nRelease Version: v6.5.0-alpha\nEdition:         Community\nGit Commit Hash: f1212364f1f77bd8b82e93010873baf7cd57489a\nGit Branch:      heads/refs/tags/v6.5.0-alpha\nUTC Build Time:  2022-11-22 14:26:53\nEnable Features: jemalloc sm4(GmSSL) avx2 avx512 unwind thinlto\nProfile:         RELWITHDEBINFO\n"] [thread_id=1]
[2022/11/29 12:32:08.907 +08:00] [INFO] [<unknown>] ["starting up"] [source=Application] [thread_id=1]
seiya-annie commented 1 year ago

tiflash.log.gz

CalvinNeo commented 1 year ago

TiFlash rely on "exact once" semantic of put keys. e.g. If we already proposed a Put with key of 7480000000000001FFC95F720380000000FF0000001903800000FF0000008DEE000000FCF9ECF9619EE3FFFB, then a later Put with the same key will raise the above error.

The issue is occurred since a recent prepare flashback modification has introduced some retrying mechanism which breaks the "exact once" assumption. According to @JmPotato.

@lilinghai Is it possible we add check for the "exact once" semantic to tiflash check on tikv CI?

JmPotato commented 1 year ago

This is introduced and fixed by https://github.com/tikv/tikv/pull/13823, there is a corner case that will cause a key in DEFAULT_CF to be written twice during the multiple flashback retries.

lilinghai commented 1 year ago

TiFlash rely on "exact once" semantic of put keys. e.g. If we already proposed a Put with key of 7480000000000001FFC95F720380000000FF0000001903800000FF0000008DEE000000FCF9ECF9619EE3FFFB, then a later Put with the same key will raise the above error.

The issue is occurred since a recent prepare flashback modification has introduced some retrying mechanism which breaks the "exact once" assumption. According to @JmPotato.

@lilinghai Is it possible we add check for the "exact once" semantic to tiflash check on tikv CI?

if it is a simple e2e test , we could add the test in tikv merge ci

JmPotato commented 1 year ago

if it is a simple e2e test , we could add the test in tikv merge ci

This case could be covered by the unit test in https://github.com/tikv/tikv/pull/13823.

seiya-annie commented 1 year ago

load latest tikv patch, and run the testcase again, tiflash still can't start after restart during flashback. currently error message is :

[2022/11/29 23:40:23.354 +08:00] [ERROR] [Exception.cpp:89] ["Code: 49, e.displayText() = DB::Exception: cur_offset does not equal to offset, e.what() = DB::Exception, Stack trace:\n\n\n       0x674ab1b\tDB::DM::DeltaMergeStore::write(DB::Context const&, DB::Settings const&, DB::Block&) [tiflash+108309275]\n                \tdbms/src/Storages/DeltaMerge/DeltaMergeStore.cpp:564\n       0x693a23f\tDB::writeRegionDataToStorage(DB::Context&, DB::RegionPtrWithBlock const&, std::__1::vector<std::__1::tuple<DB::RawTiDBPK, unsigned char, unsigned long, std::__1::shared_ptr<DB::StringObject<false> const> >, std::__1::allocator<std::__1::tuple<DB::RawTiDBPK, unsigned char, unsigned long, std::__1::shared_ptr<DB::StringObject<false> const> > > >&, std::__1::shared_ptr<DB::Logger> const&)::$_0::operator()(bool) const [tiflash+110338623]\n                \tdbms/src/Storages/Transaction/PartitionStreams.cpp:148\n       0x6934fe6\tDB::writeRegionDataToStorage(DB::Context&, DB::RegionPtrWithBlock const&, std::__1::vector<std::__1::tuple<DB::RawTiDBPK, unsigned char, unsigned long, std::__1::shared_ptr<DB::StringObject<false> const> >, std::__1::allocator<std::__1::tuple<DB::RawTiDBPK, unsigned char, unsigned long, std::__1::shared_ptr<DB::StringObject<false> const> > > >&, std::__1::shared_ptr<DB::Logger> const&) [tiflash+110317542]\n                \tdbms/src/Storages/Transaction/PartitionStreams.cpp:181\n       0x6934959\tDB::RegionTable::writeBlockByRegion(DB::Context&, DB::RegionPtrWithBlock const&, std::__1::vector<std::__1::tuple<DB::RawTiDBPK, unsigned char, unsigned long, std::__1::shared_ptr<DB::StringObject<false> const> >, std::__1::allocator<std::__1::tuple<DB::RawTiDBPK, unsigned char, unsigned long, std::__1::shared_ptr<DB::StringObject<false> const> > > >&, std::__1::shared_ptr<DB::Logger> const&, bool) [tiflash+110315865]\n                \tdbms/src/Storages/Transaction/PartitionStreams.cpp:359\n       0x69593bd\tDB::Region::handleWriteRaftCmd(DB::WriteCmdsView const&, unsigned long, unsigned long, DB::TMTContext&) [tiflash+110465981]\n                \tdbms/src/Storages/Transaction/Region.cpp:723\n       0x691ebfe\tDB::KVStore::handleWriteRaftCmd(DB::WriteCmdsView const&, unsigned long, unsigned long, unsigned long, DB::TMTContext&) [tiflash+110226430]\n                \tdbms/src/Storages/Transaction/KVStore.cpp:293\n       0x693e145\tHandleWriteRaftCmd [tiflash+110354757]\n                \tdbms/src/Storages/Transaction/ProxyFFI.cpp:95\n  0x7f523bcc9f2d\t_$LT$engine_store_ffi..observer..TiFlashObserver$u20$as$u20$raftstore..coprocessor..QueryObserver$GT$::post_exec_query::h1a8a961972b8ccf8 [[libtiflash_proxy.so](http://libtiflash_proxy.so/)+17366829]\n  0x7f523cb9700d\traftstore::store::fsm::apply::ApplyDelegate$LT$EK$GT$::apply_raft_cmd::h0b08541bf9853547 [[libtiflash_proxy.so](http://libtiflash_proxy.so/)+32886797]\n  0x7f523cbae545\traftstore::store::fsm::apply::ApplyDelegate$LT$EK$GT$::process_raft_cmd::ha4bde57f9f25a6ab [[libtiflash_proxy.so](http://libtiflash_proxy.so/)+32982341]\n  0x7f523cbb3a14\traftstore::store::fsm::apply::ApplyDelegate$LT$EK$GT$::handle_raft_committed_entries::h88c3ac142c377f65 [[libtiflash_proxy.so](http://libtiflash_proxy.so/)+33004052]\n  0x7f523cb8900c\traftstore::store::fsm::apply::ApplyFsm$LT$EK$GT$::handle_apply::hf865ae2f9856a0b6 [[libtiflash_proxy.so](http://libtiflash_proxy.so/)+32829452]\n  0x7f523cb8c628\traftstore::store::fsm::apply::ApplyFsm$LT$EK$GT$::handle_tasks::h91ceadd162e4f65b [[libtiflash_proxy.so](http://libtiflash_proxy.so/)+32843304]\n  0x7f523c263f5c\t_$LT$raftstore..store..fsm..apply..ApplyPoller$LT$EK$GT$$u20$as$u20$batch_system..batch..PollHandler$LT$raftstore..store..fsm..apply..ApplyFsm$LT$EK$GT$$C$raftstore..store..fsm..apply..ControlFsm$GT$$GT$::handle_normal::hed686c0d358455a2 [[libtiflash_proxy.so](http://libtiflash_proxy.so/)+23240540]\n  0x7f523c203b03\tbatch_system::batch::Poller$LT$N$C$C$C$Handler$GT$::poll::h2c84599628ed3b21 [[libtiflash_proxy.so](http://libtiflash_proxy.so/)+22846211]\n  0x7f523c2b1ad2\tstd::sys_common::backtrace::__rust_begin_short_backtrace::h385c7c98d4548244 [[libtiflash_proxy.so](http://libtiflash_proxy.so/)+23558866]\n  0x7f523c2f4b4e\tcore::ops::function::FnOnce::call_once$u7b$$u7b$vtable.shim$u7d$$u7d$::h626760f3e1255bb8 [[libtiflash_proxy.so](http://libtiflash_proxy.so/)+23833422]\n  0x7f523d23cc75\tstd::sys::unix::thread::Thread::new::thread_start::hd2791a9cabec1fda [[libtiflash_proxy.so](http://libtiflash_proxy.so/)+39857269]\n                \t/rustc/96ddd32c4bfb1d78f0cd03eb068b1710a8cebeef/library/std/src/sys/unix/[thread.rs:108](http://thread.rs:108/)\n  0x7f523aa25ea5\tstart_thread [libpthread.so.0+32421]\n  0x7f5239e2ab0d\tclone [libc.so.6+1043213]"] [source="DB::EngineStoreApplyRes DB::HandleWriteRaftCmd(const DB::EngineStoreServerWrap *, DB::WriteCmdsView, DB::RaftCmdHeader)"] [thread_id=74]
CalvinNeo commented 1 year ago

The following key will decoded into an empty pk, which will make TiFlash panic.

key 7480000000000001FFBD5F720000000000FAF9ECEFDC3207FFFC
val 4486809092ACFEC38906
CalvinNeo commented 1 year ago

See https://github.com/tikv/tikv/issues/13861 This is not a TiFlash issue