Closed JaySon-Huang closed 3 months ago
Some details about this crash The query pipeline is
CreatingSets
Union
HashJoinBuildBlockInputStream × 32
Expression
Expression
Squashing
TiRemoteBlockInputStream(ExchangeReceiver)
Union
SharedQuery × 32
Union
ExchangeSender × 32
Expression
Expression
Expression
Expression
Expression
Squashing
TiRemoteBlockInputStream(ExchangeReceiver)
Union
ExchangeSender × 32
Expression
Expression
NonJoined
The task already finish build hash table, and meet error when trying to send some results to TiDB(meet grpc writes fails, I suspect this fail is because TiDB close the connection first). After the task meet error, it goes to error handling process, which is as expected
[2024/07/04 09:41:35.906 +08:00] [ERROR] [MPPTask.cpp:334] ["MPPTask:<query:450902692276731912,task:10> task running meets error: DB::Exception: has finished, tunnel10+27 grpc writes failed. Stack Trace : 0. bin/tiflash/tiflash(StackTrace::StackTrace()+0x16) [0x36d5546]\n1. bin/tiflash/tiflash(DB::Exception::Exception(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, int)+0x26) [0x36c9376]\n2. bin/tiflash/tiflash(DB::MPPTunnelBase<grpc_impl::ServerWriter<mpp::MPPDataPacket> >::writeDone()+0x281f) [0x7c9c16f]\n3. bin/tiflash/tiflash(DB::MPPTask::finishWrite()+0x21) [0x7c89111]\n4. bin/tiflash/tiflash(DB::MPPTask::runImpl()+0x399) [0x7c8e119]\n5. bin/tiflash/tiflash(_ZNSt6thread11_State_implINS_8_InvokerISt5tupleIJZN2DB13ThreadFactory9newThreadISt8functionIFvvEEJEEES_bNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEEOT_DpOT0_EUlDpOT_E_EEEEE6_M_runEv+0x63) [0x7dca0a3]\n6. bin/tiflash/tiflash() [0x8b11bef]\n7. /lib64/libpthread.so.0(+0x7ea5) [0x7f22f1d73ea5]\n8. /lib64/libc.so.6(clone+0x6d) [0x7f22f179a96d]\n"] [thread_id=1345780]
[2024/07/04 09:41:35.999 +08:00] [DEBUG] [<unknown>] ["MPPTask:<query:450902692276731912,task:24> ExchangeReceiver finish read : sender_meta {\n start_ts: 450902692276731912\n task_id: 14\n address: \"172.25.215.86:3930\"\n}\nreceiver_meta {\n start_ts: 450902692276731912\n task_id: 24\n address: \"172.25.214.93:3930\"\n}\n"] [thread_id=1345787]
[2024/07/04 09:41:35.999 +08:00] [DEBUG] [<unknown>] ["MPPTask:<query:450902692276731912,task:24> ExchangeReceiver 14 -> 24 end! current alive connections: 1"] [thread_id=1345787]
[2024/07/04 09:41:36.012 +08:00] [INFO] [MPPTask.cpp:369] ["MPPTask:<query:450902692276731912,task:10> task ends, time cost is 28558 ms."] [thread_id=1345780]
[2024/07/04 09:41:36.012 +08:00] [DEBUG] [<unknown>] ["MPPTask:<query:450902692276731912,task:10> task unregistered"] [thread_id=1345780]
[2024/07/04 09:41:36.012 +08:00] [INFO] [<unknown>] ["MPPTask:<query:450902692276731912,task:10> finish task"] [thread_id=1345780]
[2024/07/04 09:41:36.012 +08:00] [INFO] [MPPTaskStatistics.cpp:113] ["mpp_task_tracing:<query:450902692276731912,task:10> {\"query_tso\":450902692276731912,\"task_id\":10,\"is_root\":false,\"sender_executor_id\":\"ExchangeSender_76\",\"executors\":[{\"id\":\"ExchangeSender_76\",\"type\":\"ExchangeSender\",\"children\":[\"TableFullScan_75\"],\"outbound_rows\":0,\"outbound_blocks\":0,\"outbound_bytes\":0,\"execution_time_ns\":0,\"partition_num\":7,\"sender_target_task_ids\":[22,23,24,25,26,27,28],\"exchange_type\":\"Hash\",\"connection_details\":[{\"tunnel_id\":\"tunnel10+22\",\"sender_target_task_id\":22,\"sender_target_host\":\"172.25.215.88:3930\",\"is_local\":false,\"packets\":0,\"bytes\":0},{\"tunnel_id\":\"tunnel10+23\",\"sender_target_task_id\":23,\"sender_target_host\":\"172.25.215.89:3930\",\"is_local\":false,\"packets\":0,\"bytes\":0},{\"tunnel_id\":\"tunnel10+24\",\"sender_target_task_id\":24,\"sender_target_host\":\"172.25.214.93:3930\",\"is_local\":true,\"packets\":0,\"bytes\":0},{\"tunnel_id\":\"tunnel10+25\",\"sender_target_task_id\":25,\"sender_target_host\":\"172.25.214.92:3930\",\"is_local\":false,\"packets\":0,\"bytes\":0},{\"tunnel_id\":\"tunnel10+26\",\"sender_target_task_id\":26,\"sender_target_host\":\"172.25.215.87:3930\",\"is_local\":false,\"packets\":0,\"bytes\":0},{\"tunnel_id\":\"tunnel10+27\",\"sender_target_task_id\":27,\"sender_target_host\":\"172.25.214.94:3930\",\"is_local\":false,\"packets\":0,\"bytes\":0},{\"tunnel_id\":\"tunnel10+28\",\"sender_target_task_id\":28,\"sender_target_host\":\"172.25.215.86:3930\",\"is_local\":false,\"packets\":0,\"bytes\":0}]},{\"id\":\"TableFullScan_75\",\"type\":\"TableScan\",\"children\":[],\"outbound_rows\":0,\"outbound_blocks\":0,\"outbound_bytes\":0,\"execution_time_ns\":0,\"connection_details\":[{\"is_local\":true,\"packets\":0,\"bytes\":0},{\"is_local\":false,\"packets\":0,\"bytes\":0}]}],\"host\":\"172.25.214.93:3930\",\"task_init_timestamp\":1720057267453588376,\"task_start_timestamp\":1720057267548268517,\"task_end_timestamp\":1720057296012583062,\"compile_start_timestamp\":1720057267458401888,\"compile_end_timestamp\":1720057267548266435,\"read_wait_index_start_timestamp\":1720057267484182749,\"read_wait_index_end_timestamp\":1720057267500841767,\"local_input_bytes\":0,\"remote_input_bytes\":0,\"output_bytes\":0,\"status\":\"FINISHED\",\"error_message\":\"DB::Exception: has finished, tunnel10+27 grpc writes failed.\",\"working_time\":0,\"memory_peak\":0}"] [thread_id=1345780]
[2024/07/04 09:41:36.012 +08:00] [DEBUG] [MPPTask.cpp:54] ["MPPTask:<query:450902692276731912,task:10> finish MPPTask: <query:450902692276731912,task:10>"] [thread_id=1345780]
[2024/07/04 09:41:36.092 +08:00] [ERROR] [MPPTask.cpp:334] ["MPPTask:<query:450902692276731912,task:24> task running meets error: DB::Exception: write to tunnel which is already closed,tunnel24+-1 grpc writes failed. Stack Trace : 0. bin/tiflash/tiflash(StackTrace::StackTrace()+0x16) [0x36d5546]\n1. bin/tiflash/tiflash(DB::Exception::Exception(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, int)+0x26) [0x36c9376]\n2. bin/tiflash/tiflash(DB::SharedQueryBlockInputStream::readImpl()+0x410) [0x3717030]\n3. bin/tiflash/tiflash(DB::IProfilingBlockInputStream::read(DB::PODArray<unsigned char, 4096ul, Allocator<false>, 15ul, 16ul>*&, bool)+0x3bd) [0x6d97d9d]\n4. bin/tiflash/tiflash(DB::IProfilingBlockInputStream::read()+0x17) [0x6d97f87]\n5. bin/tiflash/tiflash(DB::ParallelInputsProcessor<DB::UnionBlockInputStream<(DB::StreamUnionMode)0, true>::Handler, (DB::StreamUnionMode)0>::loop(unsigned long)+0x143) [0x7c6a873]\n6. bin/tiflash/tiflash(DB::ParallelInputsProcessor<DB::UnionBlockInputStream<(DB::StreamUnionMode)0, true>::Handler, (DB::StreamUnionMode)0>::thread(unsigned long)+0x1cc) [0x7c6b0dc]\n7. bin/tiflash/tiflash(_ZNSt6thread11_State_implINS_8_InvokerISt5tupleIJZN2DB13ThreadFactory9newThreadISt8functionIFvvEEJEEES_bNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEEOT_DpOT0_EUlDpOT_E_EEEEE6_M_runEv+0x63) [0x7dca0a3]\n8. bin/tiflash/tiflash() [0x8b11bef]\n9. /lib64/libpthread.so.0(+0x7ea5) [0x7f22f1d73ea5]\n10. /lib64/libc.so.6(clone+0x6d) [0x7f22f179a96d]\n"] [thread_id=1345778]
[2024/07/04 09:41:36.105 +08:00] [INFO] [MPPTask.cpp:369] ["MPPTask:<query:450902692276731912,task:24> task ends, time cost is 28651 ms."] [thread_id=1345778]
[2024/07/04 09:41:36.105 +08:00] [DEBUG] [<unknown>] ["MPPTask:<query:450902692276731912,task:24> task unregistered"] [thread_id=1345778]
[2024/07/04 09:41:36.105 +08:00] [INFO] [<unknown>] ["MPPTask:<query:450902692276731912,task:24> finish task"] [thread_id=1345778]
[2024/07/04 09:41:36.105 +08:00] [INFO] [MPPTaskStatistics.cpp:113] ["mpp_task_tracing:<query:450902692276731912,task:24> {\"query_tso\":450902692276731912,\"task_id\":24,\"is_root\":true,\"sender_executor_id\":\"ExchangeSender_81\",\"executors\":[{\"id\":\"ExchangeReceiver_77\",\"type\":\"ExchangeReceiver\",\"children\":[],\"outbound_rows\":0,\"outbound_blocks\":0,\"outbound_bytes\":0,\"execution_time_ns\":0,\"partition_num\":7,\"receiver_source_task_ids\":[8,9,10,11,12,13,14],\"connection_details\":[{\"receiver_source_task_id\":8,\"packets\":0,\"bytes\":0},{\"receiver_source_task_id\":9,\"packets\":0,\"bytes\":0},{\"receiver_source_task_id\":10,\"packets\":0,\"bytes\":0},{\"receiver_source_task_id\":11,\"packets\":0,\"bytes\":0},{\"receiver_source_task_id\":12,\"packets\":0,\"bytes\":0},{\"receiver_source_task_id\":13,\"packets\":0,\"bytes\":0},{\"receiver_source_task_id\":14,\"packets\":0,\"bytes\":0}]},{\"id\":\"ExchangeReceiver_80\",\"type\":\"ExchangeReceiver\",\"children\":[],\"outbound_rows\":0,\"outbound_blocks\":0,\"outbound_bytes\":0,\"execution_time_ns\":0,\"partition_num\":7,\"receiver_source_task_ids\":[15,16,17,18,19,20,21],\"connection_details\":[{\"receiver_source_task_id\":15,\"packets\":0,\"bytes\":0},{\"receiver_source_task_id\":16,\"packets\":0,\"bytes\":0},{\"receiver_source_task_id\":17,\"packets\":0,\"bytes\":0},{\"receiver_source_task_id\":18,\"packets\":0,\"bytes\":0},{\"receiver_source_task_id\":19,\"packets\":0,\"bytes\":0},{\"receiver_source_task_id\":20,\"packets\":0,\"bytes\":0},{\"receiver_source_task_id\":21,\"packets\":0,\"bytes\":0}]},{\"id\":\"ExchangeSender_81\",\"type\":\"ExchangeSender\",\"children\":[\"HashJoin_72\"],\"outbound_rows\":0,\"outbound_blocks\":0,\"outbound_bytes\":0,\"execution_time_ns\":0,\"partition_num\":1,\"sender_target_task_ids\":[-1],\"exchange_type\":\"PassThrough\",\"connection_details\":[{\"tunnel_id\":\"tunnel24+-1\",\"sender_target_task_id\":-1,\"sender_target_host\":\"172.25.215.15:33324\",\"is_local\":false,\"packets\":0,\"bytes\":0}]},{\"id\":\"HashJoin_72\",\"type\":\"Join\",\"children\":[\"ExchangeReceiver_77\",\"ExchangeReceiver_80\"],\"outbound_rows\":0,\"outbound_blocks\":0,\"outbound_bytes\":0,\"execution_time_ns\":0,\"hash_table_bytes\":0,\"build_side_child\":\"\",\"non_joined_outbound_rows\":0,\"non_joined_outbound_blocks\":0,\"non_joined_outbound_bytes\":0,\"non_joined_execution_time_ns\":0}],\"host\":\"172.25.214.93:3930\",\"task_init_timestamp\":1720057267453395626,\"task_start_timestamp\":1720057267636056235,\"task_end_timestamp\":1720057296105609150,\"compile_start_timestamp\":1720057267453968200,\"compile_end_timestamp\":1720057267636048852,\"read_wait_index_start_timestamp\":0,\"read_wait_index_end_timestamp\":0,\"local_input_bytes\":0,\"remote_input_bytes\":0,\"output_bytes\":0,\"status\":\"FINISHED\",\"error_message\":\"DB::Exception: write to tunnel which is already closed,tunnel24+-1 grpc writes failed.\",\"working_time\":0,\"memory_peak\":0}"] [thread_id=1345778]
[2024/07/04 09:41:36.105 +08:00] [DEBUG] [MPPTask.cpp:54] ["MPPTask:<query:450902692276731912,task:24> finish MPPTask: <query:450902692276731912,task:24>"] [thread_id=1345778]
[2024/07/04 09:41:36.188 +08:00] [DEBUG] [<unknown>] ["MPPTask:<query:450902692276731912,task:24> ExchangeReceiver 13 -> 24 end! current alive connections: 0"] [thread_id=1345786]
[2024/07/04 09:41:36.193 +08:00] [DEBUG] [<unknown>] ["MPPTask:<query:450902692276731912,task:24> ExchangeReceiver All threads end in ExchangeReceiver"] [thread_id=1345786]
[2024/07/04 09:41:36.105 +08:00] [DEBUG] [MPPTask.cpp:54] ["MPPTask:<query:450902692276731912,task:24> finish MPPTask: <query:450902692276731912,task:24>"] [thread_id=1345778]
Shows the MPPTask is begin to destruct itself, and the last two log shows ExchangeReciver
is ready to be destructed.
The crash happens after this
[2024/07/04 09:41:36.722 +08:00] [ERROR] [<unknown>] [BaseDaemon:########################################] [thread_id=1346777]
My guess is NonJoinedBlockInputStream
only holds a reference of Join
, and will use it during fillColumns
. Join
is actually hold in HashJoinBuildBlockInputStream
and ExpressionBlockInputStream
at build side, since MPPTask is in destruct process, it is possible that the InputStreams that actually holds Join
is already destructed while NonJoinedBlockInputStream
still calling fillColumns
, so NonJoinedBlockInputStream
meet NULL pointer exception.
/report customer
Bug Report
Please answer these questions before submitting your issue. Thanks!
1. Minimal reproduce step (Required)
We can not locate the SQL that raise this issue. This issue is created for reference.
TiFlash have reimplement
NonJoinedBlockInputStream
in v7.1, there should be no such issue happen again.2. What did you expect to see? (Required)
3. What did you see instead (Required)
4. What is your TiFlash version? (Required)
v5.4.3