Open lilinghai opened 3 years ago
There is a chance that acquiring read lock will be blocked by another read lock for current implementation on RWLock
. I guess some threads holding a read lock on the table lock for a long time. And it blocks others threads from acquiring read lock under some situation.
Background:
'.' for waiting
'-' for holding the lock
drop_lock
q1: |--r1---------|
q2: |..w1..|
^-(timeout and released)
q3: |..r2....---------|
q4: |...-r3--|
^-(r2,r3 will be blocked until r1 released)
Or check out the test case here: https://github.com/JaySon-Huang/tics/commit/4b76143a0b2cb0f3c566bf7bae5fa47e106a631a#diff-821b263f17d8d71ab0412e967ddb61dc9480c20db316cc7714b2affe1d5c4c74R376-R453
TL;DR, the thread of removeObsoleteDataInStorage
and shouldOptimizeTable
just like "q3" and "q4" on the previous comment. Both of them are waiting on another read group("q1") to release the read lock. But I can not print out which thread is holding that read lock.
# Try to acquire read lock on "drop_lock"(0x7f832e69a080), without timeout
# not sure which thread hold a read lock on "drop_lock"
Thread 231 (Thread 0x7f833d7f5700 (LWP 218899)):
...
#3 0x00000000079c2e33 in wait<DB::RWLock::getLock(DB::RWLock::Type, const String&, const milliseconds&)::<lambda()> > (__p=..., __lock=..., this=0x7f8352e749c0) at /usr/local/include/c++/7.4.0/condition_variable:99
#4 DB::RWLock::getLock (this=0x7f832e69a080, type=type@entry=DB::RWLock::Read, query_id=..., lock_timeout_ms=...) at /home/jenkins/agent/workspace/optimization-build-tidb-linux-amd/tics/dbms/src/Common/RWLock.cpp:170
#5 0x00000000070a59f4 in DB::IStorage::tryLockTimed (this=this@entry=0x7f832e631010, rwlock=..., type=type@entry=DB::RWLock::Read, query_id=..., acquire_timeout=...) at /home/jenkins/agent/workspace/optimization-build-tidb-linux-amd/tics/dbms/src/Storages/IStorage.cpp:17
#6 0x00000000070a5fcb in DB::IStorage::lockStructureForShare (this=0x7f832e631010, query_id=..., acquire_timeout=...) at /home/jenkins/agent/workspace/optimization-build-tidb-linux-amd/tics/dbms/src/Storages/IStorage.cpp:58
#7 0x000000000732c0b7 in DB::shouldOptimizeTable (table_id=table_id@entry=137, tmt=..., log=0x7f83815734e0, threshold=threshold@entry=0.59999999999999998) at /home/jenkins/agent/workspace/optimization-build-tidb-linux-amd/tics/dbms/src/Storages/Transaction/RegionTableCheckOptimize.cpp:106
#8 0x000000000732d169 in DB::RegionTable::checkTableOptimize (this=this@entry=0x7f83816e8088, table_id=137, threshold=threshold@entry=0.59999999999999998) at /home/jenkins/agent/workspace/optimization-build-tidb-linux-amd/tics/dbms/src/Storages/Transaction/RegionTableCheckOptimize.cpp:173
#9 0x000000000732dd30 in DB::RegionTable::checkTableOptimize (this=0x7f83816e8088) at /home/jenkins/agent/workspace/optimization-build-tidb-linux-amd/tics/dbms/src/Storages/Transaction/RegionTableCheckOptimize.cpp:158
#10 0x00000000072c987f in operator() (__closure=0x7f8352f40818) at /home/jenkins/agent/workspace/optimization-build-tidb-linux-amd/tics/dbms/src/Storages/Transaction/BackgroundService.cpp:21
#11 std::_Function_handler<bool(), DB::BackgroundService::BackgroundService(DB::TMTContext&)::<lambda()> >::_M_invoke(const std::_Any_data &) (__functor=...) at /usr/local/include/c++/7.4.0/bits/std_function.h:302
#12 0x0000000007200fa7 in operator() (this=<optimized out>) at /usr/local/include/c++/7.4.0/bits/std_function.h:706
#13 DB::BackgroundProcessingPool::threadFunction (this=0x7f8343c8fa50) at /home/jenkins/agent/workspace/optimization-build-tidb-linux-amd/tics/dbms/src/Storages/MergeTree/BackgroundProcessingPool.cpp:187
...
# Applying a raft admin cmd to remove region from TiFlash node
# Try to acquire read lock on "drop_lock"(0x7f832e69a080), without timeout
# not sure which thread hold a read lock on "drop_lock"
Thread 61 (Thread 0x7f829fb7f700 (LWP 219161)):
...
#3 0x00000000079c2e33 in wait<DB::RWLock::getLock(DB::RWLock::Type, const String&, const milliseconds&)::<lambda()> > (__p=..., __lock=..., this=0x7f8352e749c0) at /usr/local/include/c++/7.4.0/condition_variable:99
#4 DB::RWLock::getLock (this=0x7f832e69a080, type=type@entry=DB::RWLock::Read, query_id=..., lock_timeout_ms=...) at /home/jenkins/agent/workspace/optimization-build-tidb-linux-amd/tics/dbms/src/Common/RWLock.cpp:170
#5 0x00000000070a59f4 in DB::IStorage::tryLockTimed (this=this@entry=0x7f832e631010, rwlock=..., type=type@entry=DB::RWLock::Read, query_id=..., acquire_timeout=...) at /home/jenkins/agent/workspace/optimization-build-tidb-linux-amd/tics/dbms/src/Storages/IStorage.cpp:17
#6 0x00000000070a5d59 in DB::IStorage::lockForShare (this=this@entry=0x7f832e631010, query_id=..., acquire_timeout=...) at /home/jenkins/agent/workspace/optimization-build-tidb-linux-amd/tics/dbms/src/Storages/IStorage.cpp:32
#7 0x000000000732a3a9 in removeObsoleteDataInStorage (handle_range=..., table_id=137, context=0x7f8381483000) at /home/jenkins/agent/workspace/optimization-build-tidb-linux-amd/tics/dbms/src/Storages/Transaction/RegionTable.cpp:242
#8 DB::RegionTable::removeRegion (this=this@entry=0x7f83816e8088, region_id=region_id@entry=2464367, remove_data=remove_data@entry=true) at /home/jenkins/agent/workspace/optimization-build-tidb-linux-amd/tics/dbms/src/Storages/Transaction/RegionTable.cpp:305
#9 0x00000000072e2e99 in DB::KVStore::removeRegion (this=this@entry=0x7f83819c5a10, region_id=region_id@entry=2464367, remove_data=remove_data@entry=true, region_table=..., task_lock=..., region_lock=...) at /home/jenkins/agent/workspace/optimization-build-tidb-linux-amd/tics/dbms/src/Storages/Transaction/KVStore.cpp:177
#10 0x00000000072e55c7 in operator() (__closure=<synthetic pointer>) at /home/jenkins/agent/workspace/optimization-build-tidb-linux-amd/tics/dbms/src/Storages/Transaction/KVStore.cpp:451
#11 DB::KVStore::handleAdminRaftCmd(raft_cmdpb::AdminRequest&&, raft_cmdpb::AdminResponse&&, unsigned long, unsigned long, unsigned long, DB::TMTContext&) (this=0x7f83819c5a10, request=request@entry=<unknown type in /data1/tidb-deploy-upgrade0/tiflash-9900/bin/tiflash/tiflash, CU 0xe1a131d, DIE 0xe3482e4>, response=response@entry=<unknown type in /data1/tidb-deploy-upgrade0/tiflash-9900/bin/tiflash/tiflash, CU 0xe1a131d, DIE 0xe3482f5>, curr_region_id=curr_region_id@entry=2464367, index=index@entry=9, term=term@entry=6, tmt=...) at /home/jenkins/agent/workspace/optimization-build-tidb-linux-amd/tics/dbms/src/Storages/Transaction/KVStore.cpp:497
#12 0x00000000072f0988 in DB::HandleAdminRaftCmd (server=0x7ffe939ea620, req_buff=..., resp_buff=..., header=...) at /home/jenkins/agent/workspace/optimization-build-tidb-linux-amd/tics/dbms/src/Storages/Transaction/ProxyFFI.cpp:85
...
# Background physically drop table
# Try to acquire write lock on "alter_lock", with timeout, not sure which thread hold a read lock on "alter_lock"
Thread 236 (Thread 0x7f83409fa700 (LWP 218894)):
...
#5 DB::RWLock::getLock (this=0x7f832e64e000, type=type@entry=DB::RWLock::Write, query_id=..., lock_timeout_ms=...) at /home/jenkins/agent/workspace/optimization-build-tidb-linux-amd/tics/dbms/src/Common/RWLock.cpp:176
#6 0x00000000070a59f4 in DB::IStorage::tryLockTimed (this=this@entry=0x7f832e631010, rwlock=..., type=type@entry=DB::RWLock::Write, query_id=..., acquire_timeout=...) at /home/jenkins/agent/workspace/optimization-build-tidb-linux-amd/tics/dbms/src/Storages/IStorage.cpp:17
#7 0x00000000070a6323 in DB::IStorage::lockExclusively (this=this@entry=0x7f832e631010, query_id=..., acquire_timeout=...) at /home/jenkins/agent/workspace/optimization-build-tidb-linux-amd/tics/dbms/src/Storages/IStorage.cpp:66
#8 0x00000000075b4233 in DB::InterpreterDropQuery::execute (this=this@entry=0x7f83409d03a0) at /home/jenkins/agent/workspace/optimization-build-tidb-linux-amd/tics/dbms/src/Interpreters/InterpreterDropQuery.cpp:133
#9 0x00000000073423ec in DB::SchemaSyncService::gc (this=0x7f8354ef68f0, gc_safe_point=426312505650315264) at /home/jenkins/agent/workspace/optimization-build-tidb-linux-amd/tics/dbms/src/Storages/Transaction/SchemaSyncService.cpp:132
#10 0x000000000734465d in _ZZN2DB17SchemaSyncServiceC4ERNS_7ContextEENKUlvE_clEv () at /home/jenkins/agent/workspace/optimization-build-tidb-linux-amd/tics/dbms/src/Storages/Transaction/SchemaSyncService.cpp:40
#11 0x0000000007200fa7 in operator() (this=<optimized out>) at /usr/local/include/c++/7.4.0/bits/std_function.h:706
#12 DB::BackgroundProcessingPool::threadFunction (this=0x7f8343c8fa50) at /home/jenkins/agent/workspace/optimization-build-tidb-linux-amd/tics/dbms/src/Storages/MergeTree/BackgroundProcessingPool.cpp:187
...
# Thread 10 is stuck by wait index. The `DAGQueryBlock` being executing has children (there are a frame with InterpreterDAG.cpp:48).
# No sure whether it holds table lock or not.
Thread 10 (Thread 0x7f81cbdff700 (LWP 294957)):
...
#3 0x000000000731abf2 in wait<DB::RegionMeta::waitIndex(DB::UInt64, const atomic_bool&) const::<lambda()> > (__p=..., __lock=..., this=0x7f82a74a94c8) at /usr/local/include/c++/7.4.0/condition_variable:99
#4 DB::RegionMeta::waitIndex (this=this@entry=0x7f82a74a93f8, index=index@entry=13, terminated=...) at /home/jenkins/agent/workspace/optimization-build-tidb-linux-amd/tics/dbms/src/Storages/Transaction/RegionMeta.cpp:151
#5 0x00000000072f21fc in DB::Region::waitIndex (this=this@entry=0x7f82a74a9310, index=13, terminated=...) at /home/jenkins/agent/workspace/optimization-build-tidb-linux-amd/tics/dbms/src/Storages/Transaction/Region.cpp:493
#6 0x00000000076e76f3 in DB::<lambda(size_t)>::operator()(size_t) const (__closure=__closure@entry=0x7f81cbdcf8f0, region_begin_idx=region_begin_idx@entry=0) at /home/jenkins/agent/workspace/optimization-build-tidb-linux-amd/tics/dbms/src/Storages/Transaction/LearnerRead.cpp:199
#7 0x00000000076e9585 in DB::doLearnerRead (table_id=<optimized out>, mvcc_query_info=..., num_streams=<optimized out>, tmt=..., log=<optimized out>) at /home/jenkins/agent/workspace/optimization-build-tidb-linux-amd/tics/dbms/src/Storages/Transaction/LearnerRead.cpp:243
#8 0x00000000078cac50 in DB::DAGQueryBlockInterpreter::executeTS (this=this@entry=0x7f81cbdd22b0, ts=..., pipeline=...) at /home/jenkins/agent/workspace/optimization-build-tidb-linux-amd/tics/dbms/src/Flash/Coprocessor/DAGQueryBlockInterpreter.cpp:221
#9 0x00000000078cdbbd in DB::DAGQueryBlockInterpreter::executeImpl (this=this@entry=0x7f81cbdd22b0, pipeline=...) at /home/jenkins/agent/workspace/optimization-build-tidb-linux-amd/tics/dbms/src/Flash/Coprocessor/DAGQueryBlockInterpreter.cpp:1348
#10 0x00000000078cf07d in DB::DAGQueryBlockInterpreter::execute (this=this@entry=0x7f81cbdd22b0) at /home/jenkins/agent/workspace/optimization-build-tidb-linux-amd/tics/dbms/src/Flash/Coprocessor/DAGQueryBlockInterpreter.cpp:1436
#11 0x000000000788811d in DB::InterpreterDAG::executeQueryBlock (this=this@entry=0x7f8209a24380, query_block=..., subqueriesForSets=...) at /home/jenkins/agent/workspace/optimization-build-tidb-linux-amd/tics/dbms/src/Flash/Coprocessor/InterpreterDAG.cpp:53
#12 0x0000000007887f36 in DB::InterpreterDAG::executeQueryBlock (this=this@entry=0x7f8209a24380, query_block=..., subqueriesForSets=...) at /home/jenkins/agent/workspace/optimization-build-tidb-linux-amd/tics/dbms/src/Flash/Coprocessor/InterpreterDAG.cpp:48
#13 0x0000000007888ba5 in DB::InterpreterDAG::execute (this=0x7f8209a24380) at /home/jenkins/agent/workspace/optimization-build-tidb-linux-amd/tics/dbms/src/Flash/Coprocessor/InterpreterDAG.cpp:80
#14 0x0000000007017451 in DB::executeQueryImpl (query_src=..., context=..., internal=internal@entry=false, stage=stage@entry=DB::QueryProcessingStage::Complete) at /home/jenkins/agent/workspace/optimization-build-tidb-linux-amd/tics/dbms/src/Interpreters/executeQuery.cpp:205
#15 0x00000000070186e4 in DB::executeQuery (dag=..., context=..., internal=internal@entry=false, stage=stage@entry=DB::QueryProcessingStage::Complete) at /home/jenkins/agent/workspace/optimization-build-tidb-linux-amd/tics/dbms/src/Interpreters/executeQuery.cpp:398
#16 0x00000000078a0ffe in DB::MPPTask::prepare (this=0x7f822bd51c10, task_request=...) at /home/jenkins/agent/workspace/optimization-build-tidb-linux-amd/tics/dbms/src/Flash/Mpp/MPPHandler.cpp:184
#17 0x00000000078a4241 in DB::MPPHandler::execute (this=this@entry=0x7f81cbdd4270, context=..., response=0x7f81cbdd5450) at /home/jenkins/agent/workspace/optimization-build-tidb-linux-amd/tics/dbms/src/Flash/Mpp/MPPHandler.cpp:368
#18 0x00000000078570db in DB::FlashService::DispatchMPPTask (this=0x7f8381e207a0, grpc_context=0x7f8209a88020, request=0x7f8209a0d3b0, response=<optimized out>) at /home/jenkins/agent/workspace/optimization-build-tidb-linux-amd/tics/dbms/src/Flash/FlashService.cpp:140
...
# Not sure whether it holds table lock or not
Thread 44 (Thread 0x7f7e5a2fe700 (LWP 222685)):
...
#3 0x0000000007857733 in wait<DB::MPPTunnel::waitForFinish()::<lambda()> > (__p=..., __lock=..., this=0x7f81ecec87c8) at /usr/local/include/c++/7.4.0/condition_variable:99
#4 waitForFinish (this=0x7f81ecec8770) at /home/jenkins/agent/workspace/optimization-build-tidb-linux-amd/tics/dbms/src/Flash/Mpp/MPPHandler.h:168
#5 DB::FlashService::EstablishMPPConnection (this=0x7f8381e207a0, grpc_context=<optimized out>, request=<optimized out>, writer=0x7f7e5a2d45c0) at /home/jenkins/agent/workspace/optimization-build-tidb-linux-amd/tics/dbms/src/Flash/FlashService.cpp:200
#6 0x00000000080b5ec9 in __invoke_impl<grpc::Status, grpc::Status (tikvpb::Tikv::Service::* const&)(grpc_impl::ServerContext*, mpp::EstablishMPPConnectionRequest const*, grpc_impl::ServerWriter<mpp::MPPDataPacket>*), tikvpb::Tikv::Service*, grpc_impl::ServerContext*, mpp::EstablishMPPConnectionRequest const*, grpc_impl::ServerWriter<mpp::MPPDataPacket>*> (__t=<optimized out>, __f=<optimized out>) at /usr/local/include/c++/7.4.0/bits/invoke.h:73
#7 __invoke<grpc::Status (tikvpb::Tikv::Service::* const&)(grpc_impl::ServerContext*, mpp::EstablishMPPConnectionRequest const*, grpc_impl::ServerWriter<mpp::MPPDataPacket>*), tikvpb::Tikv::Service*, grpc_impl::ServerContext*, mpp::EstablishMPPConnectionRequest const*, grpc_impl::ServerWriter<mpp::MPPDataPacket>*> (__fn=<optimized out>) at /usr/local/include/c++/7.4.0/bits/invoke.h:96
#8 operator()<tikvpb::Tikv::Service*, grpc_impl::ServerContext*, const mpp::EstablishMPPConnectionRequest*, grpc_impl::ServerWriter<mpp::MPPDataPacket>*> (this=<optimized out>) at /usr/local/include/c++/7.4.0/functional:175
#9 std::_Function_handler<grpc::Status (tikvpb::Tikv::Service*, grpc_impl::ServerContext*, mpp::EstablishMPPConnectionRequest const*, grpc_impl::ServerWriter<mpp::MPPDataPacket>*), std::_Mem_fn<grpc::Status (tikvpb::Tikv::Service::*)(grpc_impl::ServerContext*, mpp::EstablishMPPConnectionRequest const*, grpc_impl::ServerWriter<mpp::MPPDataPacket>*)> >::_M_invoke(std::_Any_data const&, tikvpb::Tikv::Service*&&, grpc_impl::ServerContext*&&, mpp::EstablishMPPConnectionRequest const*&&, grpc_impl::ServerWriter<mpp::MPPDataPacket>*&&) (__functor=..., __args#0=<optimized out>, __args#1=<optimized out>, __args#2=<optimized out>, __args#3=<optimized out>) at /usr/local/include/c++/7.4.0/bits/std_function.h:302
...
# Thread 3,4,5,7,8,9 share a similar backtrace, they are stuck by wait index
# because all raft-applying thread are stuck
# Seems not holding any table lock
Thread 9 (Thread 0x7f81c97fc700 (LWP 294958)):
...
#3 0x000000000731abf2 in wait<DB::RegionMeta::waitIndex(DB::UInt64, const atomic_bool&) const::<lambda()> > (__p=..., __lock=..., this=0x7f829f2413c8) at /usr/local/include/c++/7.4.0/condition_variable:99
#4 DB::RegionMeta::waitIndex (this=this@entry=0x7f829f2412f8, index=index@entry=8626, terminated=...) at /home/jenkins/agent/workspace/optimization-build-tidb-linux-amd/tics/dbms/src/Storages/Transaction/RegionMeta.cpp:151
#5 0x00000000072f21fc in DB::Region::waitIndex (this=this@entry=0x7f829f241210, index=8626, terminated=...) at /home/jenkins/agent/workspace/optimization-build-tidb-linux-amd/tics/dbms/src/Storages/Transaction/Region.cpp:493
#6 0x00000000076e76f3 in DB::<lambda(size_t)>::operator()(size_t) const (__closure=__closure@entry=0x7f81c97cca80, region_begin_idx=region_begin_idx@entry=0) at /home/jenkins/agent/workspace/optimization-build-tidb-linux-amd/tics/dbms/src/Storages/Transaction/LearnerRead.cpp:199
#7 0x00000000076e9585 in DB::doLearnerRead (table_id=<optimized out>, mvcc_query_info=..., num_streams=<optimized out>, tmt=..., log=<optimized out>) at /home/jenkins/agent/workspace/optimization-build-tidb-linux-amd/tics/dbms/src/Storages/Transaction/LearnerRead.cpp:243
#8 0x00000000078cac50 in DB::DAGQueryBlockInterpreter::executeTS (this=this@entry=0x7f81c97cf440, ts=..., pipeline=...) at /home/jenkins/agent/workspace/optimization-build-tidb-linux-amd/tics/dbms/src/Flash/Coprocessor/DAGQueryBlockInterpreter.cpp:221
#9 0x00000000078cdbbd in DB::DAGQueryBlockInterpreter::executeImpl (this=this@entry=0x7f81c97cf440, pipeline=...) at /home/jenkins/agent/workspace/optimization-build-tidb-linux-amd/tics/dbms/src/Flash/Coprocessor/DAGQueryBlockInterpreter.cpp:1348
#10 0x00000000078cf07d in DB::DAGQueryBlockInterpreter::execute (this=this@entry=0x7f81c97cf440) at /home/jenkins/agent/workspace/optimization-build-tidb-linux-amd/tics/dbms/src/Flash/Coprocessor/DAGQueryBlockInterpreter.cpp:1436
#11 0x000000000788811d in DB::InterpreterDAG::executeQueryBlock (this=this@entry=0x7f81f0257150, query_block=..., subqueriesForSets=...) at /home/jenkins/agent/workspace/optimization-build-tidb-linux-amd/tics/dbms/src/Flash/Coprocessor/InterpreterDAG.cpp:53
#12 0x0000000007888ba5 in DB::InterpreterDAG::execute (this=0x7f81f0257150) at /home/jenkins/agent/workspace/optimization-build-tidb-linux-amd/tics/dbms/src/Flash/Coprocessor/InterpreterDAG.cpp:80
#13 0x0000000007017451 in DB::executeQueryImpl (query_src=..., context=..., internal=internal@entry=false, stage=stage@entry=DB::QueryProcessingStage::Complete) at /home/jenkins/agent/workspace/optimization-build-tidb-linux-amd/tics/dbms/src/Interpreters/executeQuery.cpp:205
#14 0x00000000070186e4 in DB::executeQuery (dag=..., context=..., internal=internal@entry=false, stage=stage@entry=DB::QueryProcessingStage::Complete) at /home/jenkins/agent/workspace/optimization-build-tidb-linux-amd/tics/dbms/src/Interpreters/executeQuery.cpp:398
#15 0x00000000078a0ffe in DB::MPPTask::prepare (this=0x7f81f02bd010, task_request=...) at /home/jenkins/agent/workspace/optimization-build-tidb-linux-amd/tics/dbms/src/Flash/Mpp/MPPHandler.cpp:184
#16 0x00000000078a4241 in DB::MPPHandler::execute (this=this@entry=0x7f81c97d1270, context=..., response=0x7f81c97d2450) at /home/jenkins/agent/workspace/optimization-build-tidb-linux-amd/tics/dbms/src/Flash/Mpp/MPPHandler.cpp:368
#17 0x00000000078570db in DB::FlashService::DispatchMPPTask (this=0x7f8381e207a0, grpc_context=0x7f81f0247020, request=0x7f81f02053b0, response=<optimized out>) at /home/jenkins/agent/workspace/optimization-build-tidb-linux-amd/tics/dbms/src/Flash/FlashService.cpp:140
...
# Apply snapshot get stuck by `KVStoreTaskLock`, which is being held by thread 61
Thread 70 (Thread 0x7f82a7ffe700 (LWP 219152)):
...
#6 KVStoreTaskLock (mutex_=..., this=0x7f82a7fd2ed8) at /home/jenkins/agent/workspace/optimization-build-tidb-linux-amd/tics/dbms/src/Storages/Transaction/KVStore.h:156
#7 DB::KVStore::genTaskLock (this=this@entry=0x7f83819c5a10) at /home/jenkins/agent/workspace/optimization-build-tidb-linux-amd/tics/dbms/src/Storages/Transaction/KVStore.cpp:182
#8 0x00000000072decbd in DB::KVStore::handleRegionsByRangeOverlap(std::pair<DB::TiKVRangeKey, DB::TiKVRangeKey> const&, std::function<void (std::unordered_map<unsigned long, std::shared_ptr<DB::Region>, std::hash<unsigned long>, std::equal_to<unsigned long>, std::allocator<std::pair<unsigned long const, std::shared_ptr<DB::Region> > > >, DB::KVStoreTaskLock const&)>&&) const (this=this@entry=0x7f83819c5a10, range=..., callback=callback@entry=<unknown type in /data1/tidb-deploy-upgrade0/tiflash-9900/bin/tiflash/tiflash, CU 0xe1a131d, DIE 0xe368b19>) at /home/jenkins/agent/workspace/optimization-build-tidb-linux-amd/tics/dbms/src/Storages/Transaction/KVStore.cpp:61
#9 0x00000000076dc382 in DB::KVStore::checkAndApplySnapshot (this=this@entry=0x7f83819c5a10, new_region=..., tmt=...) at /home/jenkins/agent/workspace/optimization-build-tidb-linux-amd/tics/dbms/src/Storages/Transaction/ApplySnapshot.cpp:75
#10 0x00000000076dd523 in DB::KVStore::handlePreApplySnapshot (this=0x7f83819c5a10, new_region=..., tmt=...) at /home/jenkins/agent/workspace/optimization-build-tidb-linux-amd/tics/dbms/src/Storages/Transaction/ApplySnapshot.cpp:299
#11 0x00000000072eea30 in DB::ApplyPreHandledSnapshot (server=0x7ffe939ea620, snap=<optimized out>) at /home/jenkins/agent/workspace/optimization-build-tidb-linux-amd/tics/dbms/src/Storages/Transaction/ProxyFFI.cpp:254
#12 0x00000000072eeaee in DB::ApplyPreHandledSnapshot (server=<optimized out>, res=<optimized out>, type=<optimized out>) at /home/jenkins/agent/workspace/optimization-build-tidb-linux-amd/tics/dbms/src/Storages/Transaction/ProxyFFI.cpp:270
...
# Apply snapshot get stuck by `KVStoreTaskLock`, which is being held by thread 61
Thread 60 (Thread 0x7f829f97e700 (LWP 219162)):
...
#6 KVStoreTaskLock (mutex_=..., this=0x7f829f951878) at /home/jenkins/agent/workspace/optimization-build-tidb-linux-amd/tics/dbms/src/Storages/Transaction/KVStore.h:156
#7 DB::KVStore::genTaskLock (this=this@entry=0x7f83819c5a10) at /home/jenkins/agent/workspace/optimization-build-tidb-linux-amd/tics/dbms/src/Storages/Transaction/KVStore.cpp:182
#8 0x00000000072e4d83 in DB::KVStore::handleAdminRaftCmd(raft_cmdpb::AdminRequest&&, raft_cmdpb::AdminResponse&&, unsigned long, unsigned long, unsigned long, DB::TMTContext&) (this=0x7f83819c5a10, request=request@entry=<unknown type in /data1/tidb-deploy-upgrade0/tiflash-9900/bin/tiflash/tiflash, CU 0xe1a131d, DIE 0xe3482e4>, response=response@entry=<unknown type in /data1/tidb-deploy-upgrade0/tiflash-9900/bin/tiflash/tiflash, CU 0xe1a131d, DIE 0xe3482f5>, curr_region_id=curr_region_id@entry=2456452, index=index@entry=10, term=term@entry=7, tmt=...) at /home/jenkins/agent/workspace/optimization-build-tidb-linux-amd/tics/dbms/src/Storages/Transaction/KVStore.cpp:353
#9 0x00000000072f0988 in DB::HandleAdminRaftCmd (server=0x7ffe939ea620, req_buff=..., resp_buff=..., header=...) at /home/jenkins/agent/workspace/optimization-build-tidb-linux-amd/tics/dbms/src/Storages/Transaction/ProxyFFI.cpp:85
...
# Thread 59, 58, Apply admin raft cmd get stuck by `KVStoreTaskLock`, which is being held by thread 61
Thread 59 (Thread 0x7f829f1ff700 (LWP 219163)):
...
#6 KVStoreTaskLock (mutex_=..., this=0x7f829f1d2878) at /home/jenkins/agent/workspace/optimization-build-tidb-linux-amd/tics/dbms/src/Storages/Transaction/KVStore.h:156
#7 DB::KVStore::genTaskLock (this=this@entry=0x7f83819c5a10) at /home/jenkins/agent/workspace/optimization-build-tidb-linux-amd/tics/dbms/src/Storages/Transaction/KVStore.cpp:182
#8 0x00000000072e4d83 in DB::KVStore::handleAdminRaftCmd(raft_cmdpb::AdminRequest&&, raft_cmdpb::AdminResponse&&, unsigned long, unsigned long, unsigned long, DB::TMTContext&) (this=0x7f83819c5a10, request=request@entry=<unknown type in /data1/tidb-deploy-upgrade0/tiflash-9900/bin/tiflash/tiflash, CU 0xe1a131d, DIE 0xe3482e4>, response=response@entry=<unknown type in /data1/tidb-deploy-upgrade0/tiflash-9900/bin/tiflash/tiflash, CU 0xe1a131d, DIE 0xe3482f5>, curr_region_id=curr_region_id@entry=218960, index=index@entry=24, term=term@entry=8, tmt=...) at /home/jenkins/agent/workspace/optimization-build-tidb-linux-amd/tics/dbms/src/Storages/Transaction/KVStore.cpp:353
#9 0x00000000072f0988 in DB::HandleAdminRaftCmd (server=0x7ffe939ea620, req_buff=..., resp_buff=..., header=...) at /home/jenkins/agent/workspace/optimization-build-tidb-linux-amd/tics/dbms/src/Storages/Transaction/ProxyFFI.cpp:85
...
Details on why the thread of removeObsoleteDataInStorage
and shouldOptimizeTable
are in the same read group:
# checkout the thread of `removeObsoleteDataInStorage`, it is waiting on a read lock
# on "drop_lock"
# The reader group (it_group => 0x7f8352e749a0)
# It is waiting on current reader group (rdlock_owner => 0x7f7e5520fe20) to release the read lock
(gdb) thread xx
[Switching to thread xx (Thread 0x7f829fb7f700 (LWP 219161))]
...
#3 0x00000000079c2e33 in wait<DB::RWLock::getLock(DB::RWLock::Type, const String&, const milliseconds&)::<lambda()> > (__p=..., __lock=..., this=0x7f8352e749c0) at /usr/local/include/c++/7.4.0/condition_variable:99
#4 DB::RWLock::getLock (this=0x7f832e69a080, type=type@entry=DB::RWLock::Read, query_id=..., lock_timeout_ms=...) at /home/jenkins/agent/workspace/optimization-build-tidb-linux-amd/tics/dbms/src/Common/RWLock.cpp:170
#5 0x00000000070a59f4 in DB::IStorage::tryLockTimed (this=this@entry=0x7f832e631010, rwlock=..., type=type@entry=DB::RWLock::Read, query_id=..., acquire_timeout=...) at /home/jenkins/agent/workspace/optimization-build-tidb-linux-amd/tics/dbms/src/Storages/IStorage.cpp:17
#6 0x00000000070a5d59 in DB::IStorage::lockForShare (this=this@entry=0x7f832e631010, query_id=..., acquire_timeout=...) at /home/jenkins/agent/workspace/optimization-build-tidb-linux-amd/tics/dbms/src/Storages/IStorage.cpp:32
#7 0x000000000732a3a9 in removeObsoleteDataInStorage (handle_range=..., table_id=137, context=0x7f8381483000) at /home/jenkins/agent/workspace/optimization-build-tidb-linux-amd/tics/dbms/src/Storages/Transaction/RegionTable.cpp:242
#8 DB::RegionTable::removeRegion (this=this@entry=0x7f83816e8088, region_id=region_id@entry=2464367, remove_data=remove_data@entry=true) at /home/jenkins/agent/workspace/optimization-build-tidb-linux-amd/tics/dbms/src/Storages/Transaction/RegionTable.cpp:305
#9 0x00000000072e2e99 in DB::KVStore::removeRegion (this=this@entry=0x7f83819c5a10, region_id=region_id@entry=2464367, remove_data=remove_data@entry=true, region_table=..., task_lock=..., region_lock=...) at /home/jenkins/agent/workspace/optimization-build-tidb-linux-amd/tics/dbms/src/Storages/Transaction/KVStore.cpp:177
#10 0x00000000072e55c7 in operator() (__closure=<synthetic pointer>) at /home/jenkins/agent/workspace/optimization-build-tidb-linux-amd/tics/dbms/src/Storages/Transaction/KVStore.cpp:451
#11 DB::KVStore::handleAdminRaftCmd(raft_cmdpb::AdminRequest&&, raft_cmdpb::AdminResponse&&, unsigned long, unsigned long, unsigned long, DB::TMTContext&) (this=0x7f83819c5a10, request=request@entry=<unknown type in /data1/tidb-deploy-upgrade0/tiflash-9900/bin/tiflash/tiflash, CU 0xe1a131d, DIE 0xe3482e4>,
response=response@entry=<unknown type in /data1/tidb-deploy-upgrade0/tiflash-9900/bin/tiflash/tiflash, CU 0xe1a131d, DIE 0xe3482f5>, curr_region_id=curr_region_id@entry=2464367, index=index@entry=9, term=term@entry=6, tmt=...) at /home/jenkins/agent/workspace/optimization-build-tidb-linux-amd/tics/dbms/src/Storages/Transaction/KVStore.cpp:497
#12 0x00000000072f0988 in DB::HandleAdminRaftCmd (server=0x7ffe939ea620, req_buff=..., resp_buff=..., header=...) at /home/jenkins/agent/workspace/optimization-build-tidb-linux-amd/tics/dbms/src/Storages/Transaction/ProxyFFI.cpp:85
...
(gdb) up
#4 DB::RWLock::getLock (this=0x7f832e69a080, type=type@entry=DB::RWLock::Read, query_id=..., lock_timeout_ms=...) at /home/jenkins/agent/workspace/optimization-build-tidb-linux-amd/tics/dbms/src/Common/RWLock.cpp:170
170 /home/jenkins/agent/workspace/optimization-build-tidb-linux-amd/tics/dbms/src/Common/RWLock.cpp: No such file or directory.
# A readers list with 2 groups
(gdb) p it_group
$1 = {_M_node = 0x7f8352e749a0}
(gdb) p it_group._M_node
$2 = (std::__detail::_List_node_base *) 0x7f8352e749a0
(gdb) p it_group._M_node._M_next
$3 = (std::__detail::_List_node_base *) 0x7f832e69a0b8
(gdb) p it_group._M_node._M_next._M_next
$4 = (std::__detail::_List_node_base *) 0x7f7e5520fe20
(gdb) p it_group._M_node._M_next._M_next._M_next
$5 = (std::__detail::_List_node_base *) 0x7f8352e749a0
(gdb) p rdlock_owner
$6 = {_M_node = 0x7f7e5520fe20}
# Empty writers list
(gdb) p wrlock_owner
$7 = {_M_node = 0x7f832e69a0d0}
(gdb) p wrlock_owner._M_node
$8 = (std::__detail::_List_node_base *) 0x7f832e69a0d0
(gdb) p *wrlock_owner._M_node
$9 = {_M_next = 0x7f832e69a0d0, _M_prev = 0x7f832e69a0d0}
# checkout the thread of `shouldOptimizeTable`, it is waiting on a read lock on "drop_lock",
# waiting on the same reader group with the thread of `removeObsoleteDataInStorage` (it_group => 0x7f8352e749a0)
(gdb) bt
[Switching to thread (Thread 0x7f833d7f5700 (LWP 218899))]
...
#3 0x00000000079c2e33 in wait<DB::RWLock::getLock(DB::RWLock::Type, const String&, const milliseconds&)::<lambda()> > (__p=..., __lock=..., this=0x7f8352e749c0) at /usr/local/include/c++/7.4.0/condition_variable:99
#4 DB::RWLock::getLock (this=0x7f832e69a080, type=type@entry=DB::RWLock::Read, query_id=..., lock_timeout_ms=...) at /home/jenkins/agent/workspace/optimization-build-tidb-linux-amd/tics/dbms/src/Common/RWLock.cpp:170
#5 0x00000000070a59f4 in DB::IStorage::tryLockTimed (this=this@entry=0x7f832e631010, rwlock=..., type=type@entry=DB::RWLock::Read, query_id=..., acquire_timeout=...) at /home/jenkins/agent/workspace/optimization-build-tidb-linux-amd/tics/dbms/src/Storages/IStorage.cpp:17
#6 0x00000000070a5fcb in DB::IStorage::lockStructureForShare (this=0x7f832e631010, query_id=..., acquire_timeout=...) at /home/jenkins/agent/workspace/optimization-build-tidb-linux-amd/tics/dbms/src/Storages/IStorage.cpp:58
#7 0x000000000732c0b7 in DB::shouldOptimizeTable (table_id=table_id@entry=137, tmt=..., log=0x7f83815734e0, threshold=threshold@entry=0.59999999999999998) at /home/jenkins/agent/workspace/optimization-build-tidb-linux-amd/tics/dbms/src/Storages/Transaction/RegionTableCheckOptimize.cpp:106
#8 0x000000000732d169 in DB::RegionTable::checkTableOptimize (this=this@entry=0x7f83816e8088, table_id=137, threshold=threshold@entry=0.59999999999999998) at /home/jenkins/agent/workspace/optimization-build-tidb-linux-amd/tics/dbms/src/Storages/Transaction/RegionTableCheckOptimize.cpp:173
#9 0x000000000732dd30 in DB::RegionTable::checkTableOptimize (this=0x7f83816e8088) at /home/jenkins/agent/workspace/optimization-build-tidb-linux-amd/tics/dbms/src/Storages/Transaction/RegionTableCheckOptimize.cpp:158
#10 0x00000000072c987f in operator() (__closure=0x7f8352f40818) at /home/jenkins/agent/workspace/optimization-build-tidb-linux-amd/tics/dbms/src/Storages/Transaction/BackgroundService.cpp:21
#11 std::_Function_handler<bool(), DB::BackgroundService::BackgroundService(DB::TMTContext&)::<lambda()> >::_M_invoke(const std::_Any_data &) (__functor=...) at /usr/local/include/c++/7.4.0/bits/std_function.h:302
#12 0x0000000007200fa7 in operator() (this=<optimized out>) at /usr/local/include/c++/7.4.0/bits/std_function.h:706
#13 DB::BackgroundProcessingPool::threadFunction (this=0x7f8343c8fa50) at /home/jenkins/agent/workspace/optimization-build-tidb-linux-amd/tics/dbms/src/Storages/MergeTree/BackgroundProcessingPool.cpp:187
...
(gdb) up
#4 DB::RWLock::getLock (this=0x7f832e69a080, type=type@entry=DB::RWLock::Read, query_id=..., lock_timeout_ms=...) at /home/jenkins/agent/workspace/optimization-build-tidb-linux-amd/tics/dbms/src/Common/RWLock.cpp:170
170 /home/jenkins/agent/workspace/optimization-build-tidb-linux-amd/tics/dbms/src/Common/RWLock.cpp: No such file or directory.
(gdb) p it_group
$10 = {_M_node = 0x7f8352e749a0}
# checkout the thread of `SchemaSyncService::gc`
# it is waiting a write lock on the "alter_lock"
(gdb) bt
...
#5 DB::RWLock::getLock (this=0x7f832e64e000, type=type@entry=DB::RWLock::Write, query_id=..., lock_timeout_ms=...) at /home/jenkins/agent/workspace/optimization-build-tidb-linux-amd/tics/dbms/src/Common/RWLock.cpp:176
#6 0x00000000070a59f4 in DB::IStorage::tryLockTimed (this=this@entry=0x7f832e631010, rwlock=..., type=type@entry=DB::RWLock::Write, query_id=..., acquire_timeout=...) at /home/jenkins/agent/workspace/optimization-build-tidb-linux-amd/tics/dbms/src/Storages/IStorage.cpp:17
#7 0x00000000070a6323 in DB::IStorage::lockExclusively (this=this@entry=0x7f832e631010, query_id=..., acquire_timeout=...) at /home/jenkins/agent/workspace/optimization-build-tidb-linux-amd/tics/dbms/src/Storages/IStorage.cpp:66
#8 0x00000000075b4233 in DB::InterpreterDropQuery::execute (this=this@entry=0x7f83413d13a0) at /home/jenkins/agent/workspace/optimization-build-tidb-linux-amd/tics/dbms/src/Interpreters/InterpreterDropQuery.cpp:133
#9 0x00000000073423ec in DB::SchemaSyncService::gc (this=0x7f8354ef68f0, gc_safe_point=426317224242315264) at /home/jenkins/agent/workspace/optimization-build-tidb-linux-amd/tics/dbms/src/Storages/Transaction/SchemaSyncService.cpp:132
#10 0x000000000734465d in _ZZN2DB17SchemaSyncServiceC4ERNS_7ContextEENKUlvE_clEv () at /home/jenkins/agent/workspace/optimization-build-tidb-linux-amd/tics/dbms/src/Storages/Transaction/SchemaSyncService.cpp:40
...
(gdb) up
#5 DB::RWLock::getLock (this=0x7f832e64e000, type=type@entry=DB::RWLock::Write, query_id=..., lock_timeout_ms=...) at /home/jenkins/agent/workspace/optimization-build-tidb-linux-amd/tics/dbms/src/Common/RWLock.cpp:176
176 /home/jenkins/agent/workspace/optimization-build-tidb-linux-amd/tics/dbms/src/Common/RWLock.cpp: No such file or directory.
(gdb) p it_group
$11 = {_M_node = 0x7f8294275260}
(gdb) p this
$12 = (DB::RWLock * const) 0x7f832e64e000
(gdb) p it_group
$13 = {_M_node = 0x7f8294275260}
(gdb) p rdlock_owner
$14 = {_M_node = 0x7f8296263740}
# A reader list with one reader
(gdb) p *rdlock_owner._M_node
$17 = {_M_next = 0x7f832e64e038, _M_prev = 0x7f832e64e038}
(gdb) p *rdlock_owner._M_node._M_next
$18 = {_M_next = 0x7f8296263740, _M_prev = 0x7f8296263740}
(gdb) p *rdlock_owner._M_node._M_next._M_next
$19 = {_M_next = 0x7f832e64e038, _M_prev = 0x7f832e64e038}
# Empty writers list
(gdb) p wrlock_owner
$20 = {_M_node = 0x7f832e64e050}
(gdb) p wrlock_owner._M_node
$21 = (std::__detail::_List_node_base *) 0x7f832e64e050
(gdb) p wrlock_owner._M_node._M_next
$22 = (std::__detail::_List_node_base *) 0x7f8294275260
(gdb) p wrlock_owner._M_node._M_next._M_next
$23 = (std::__detail::_List_node_base *) 0x7f832e64e050
(gdb) p wrlock_owner._M_node._M_next._M_next._M_next
@jiangyuzhao please help me reproduce this situation with branch [1], it will print out the query ID if happen to a deadlock.
Find another issue: https://github.com/pingcap/tics/issues/2529. Maybe the changes in the previous commit make it hard to reproduce this deadlock. Will try it with fewer changes.
From the stack in previous comment, we can see that thread 10[1] is a DAGQueryBlock
with children (maybe an MPPTask with "join"). Assume that it is a join query that table "t1" join table "t2", it will first do learner read and acquire a read lock on "drop" for "t1", save the input streams from "t1" into input_streams_vec
. Then it will do learner read and acquire a read lock on "drop" for "t2".
windtalker says that one DAGQueryBlock
only read one table. "t1" join "t2" will be split into different DAGRequest
[1]
# Thread 10 is stuck by wait index. The `DAGQueryBlock` being executing has children (there are a frame with InterpreterDAG.cpp:48).
# No sure whether it holds table lock or not.
Thread 10 (Thread 0x7f81cbdff700 (LWP 294957)):
...
#3 0x000000000731abf2 in wait<DB::RegionMeta::waitIndex(DB::UInt64, const atomic_bool&) const::<lambda()> > (__p=..., __lock=..., this=0x7f82a74a94c8) at /usr/local/include/c++/7.4.0/condition_variable:99
#4 DB::RegionMeta::waitIndex (this=this@entry=0x7f82a74a93f8, index=index@entry=13, terminated=...) at /home/jenkins/agent/workspace/optimization-build-tidb-linux-amd/tics/dbms/src/Storages/Transaction/RegionMeta.cpp:151
#5 0x00000000072f21fc in DB::Region::waitIndex (this=this@entry=0x7f82a74a9310, index=13, terminated=...) at /home/jenkins/agent/workspace/optimization-build-tidb-linux-amd/tics/dbms/src/Storages/Transaction/Region.cpp:493
#6 0x00000000076e76f3 in DB::<lambda(size_t)>::operator()(size_t) const (__closure=__closure@entry=0x7f81cbdcf8f0, region_begin_idx=region_begin_idx@entry=0) at /home/jenkins/agent/workspace/optimization-build-tidb-linux-amd/tics/dbms/src/Storages/Transaction/LearnerRead.cpp:199
#7 0x00000000076e9585 in DB::doLearnerRead (table_id=<optimized out>, mvcc_query_info=..., num_streams=<optimized out>, tmt=..., log=<optimized out>) at /home/jenkins/agent/workspace/optimization-build-tidb-linux-amd/tics/dbms/src/Storages/Transaction/LearnerRead.cpp:243
#8 0x00000000078cac50 in DB::DAGQueryBlockInterpreter::executeTS (this=this@entry=0x7f81cbdd22b0, ts=..., pipeline=...) at /home/jenkins/agent/workspace/optimization-build-tidb-linux-amd/tics/dbms/src/Flash/Coprocessor/DAGQueryBlockInterpreter.cpp:221
#9 0x00000000078cdbbd in DB::DAGQueryBlockInterpreter::executeImpl (this=this@entry=0x7f81cbdd22b0, pipeline=...) at /home/jenkins/agent/workspace/optimization-build-tidb-linux-amd/tics/dbms/src/Flash/Coprocessor/DAGQueryBlockInterpreter.cpp:1348
#10 0x00000000078cf07d in DB::DAGQueryBlockInterpreter::execute (this=this@entry=0x7f81cbdd22b0) at /home/jenkins/agent/workspace/optimization-build-tidb-linux-amd/tics/dbms/src/Flash/Coprocessor/DAGQueryBlockInterpreter.cpp:1436
#11 0x000000000788811d in DB::InterpreterDAG::executeQueryBlock (this=this@entry=0x7f8209a24380, query_block=..., subqueriesForSets=...) at /home/jenkins/agent/workspace/optimization-build-tidb-linux-amd/tics/dbms/src/Flash/Coprocessor/InterpreterDAG.cpp:53
#12 0x0000000007887f36 in DB::InterpreterDAG::executeQueryBlock (this=this@entry=0x7f8209a24380, query_block=..., subqueriesForSets=...) at /home/jenkins/agent/workspace/optimization-build-tidb-linux-amd/tics/dbms/src/Flash/Coprocessor/InterpreterDAG.cpp:48
#13 0x0000000007888ba5 in DB::InterpreterDAG::execute (this=0x7f8209a24380) at /home/jenkins/agent/workspace/optimization-build-tidb-linux-amd/tics/dbms/src/Flash/Coprocessor/InterpreterDAG.cpp:80
#14 0x0000000007017451 in DB::executeQueryImpl (query_src=..., context=..., internal=internal@entry=false, stage=stage@entry=DB::QueryProcessingStage::Complete) at /home/jenkins/agent/workspace/optimization-build-tidb-linux-amd/tics/dbms/src/Interpreters/executeQuery.cpp:205
#15 0x00000000070186e4 in DB::executeQuery (dag=..., context=..., internal=internal@entry=false, stage=stage@entry=DB::QueryProcessingStage::Complete) at /home/jenkins/agent/workspace/optimization-build-tidb-linux-amd/tics/dbms/src/Interpreters/executeQuery.cpp:398
#16 0x00000000078a0ffe in DB::MPPTask::prepare (this=0x7f822bd51c10, task_request=...) at /home/jenkins/agent/workspace/optimization-build-tidb-linux-amd/tics/dbms/src/Flash/Mpp/MPPHandler.cpp:184
#17 0x00000000078a4241 in DB::MPPHandler::execute (this=this@entry=0x7f81cbdd4270, context=..., response=0x7f81cbdd5450) at /home/jenkins/agent/workspace/optimization-build-tidb-linux-amd/tics/dbms/src/Flash/Mpp/MPPHandler.cpp:368
#18 0x00000000078570db in DB::FlashService::DispatchMPPTask (this=0x7f8381e207a0, grpc_context=0x7f8209a88020, request=0x7f8209a0d3b0, response=<optimized out>) at /home/jenkins/agent/workspace/optimization-build-tidb-linux-amd/tics/dbms/src/Flash/FlashService.cpp:140
...
lower the severity to moderate cause it is hard to reproduce
v5.0.3 thread apply all bt 218815.log