pingcap / tiflash

The analytical engine for TiDB and TiDB Cloud. Try free: https://tidbcloud.com/free-trial
https://docs.pingcap.com/tidb/stable/tiflash-overview
Apache License 2.0
944 stars 410 forks source link

tiflash crash in NonJoinedBlockInputStream #9190

Closed JaySon-Huang closed 3 months ago

JaySon-Huang commented 3 months ago

Bug Report

Please answer these questions before submitting your issue. Thanks!

1. Minimal reproduce step (Required)

We can not locate the SQL that raise this issue. This issue is created for reference.

TiFlash have reimplement NonJoinedBlockInputStream in v7.1, there should be no such issue happen again.

2. What did you expect to see? (Required)

3. What did you see instead (Required)

[2024/07/04 10:32:08.429 +08:00] [ERROR] [<unknown>] [BaseDaemon:########################################] [thread_id=117399]
[2024/07/04 10:32:08.429 +08:00] [ERROR] [<unknown>] ["BaseDaemon:(from thread 117398) Received signal Segmentation fault (11)."] [thread_id=117399]
[2024/07/04 10:32:08.429 +08:00] [ERROR] [<unknown>] ["BaseDaemon:Address: NULL pointer."] [thread_id=117399]
[2024/07/04 10:32:08.429 +08:00] [ERROR] [<unknown>] ["BaseDaemon:Access: read."] [thread_id=117399]
[2024/07/04 10:32:08.429 +08:00] [ERROR] [<unknown>] ["BaseDaemon:Address not mapped to object."] [thread_id=117399]
[2024/07/04 10:32:08.471 +08:00] [ERROR] [<unknown>] ["BaseDaemon:0. bin/tiflash/tiflash(unsigned long DB::NonJoinedBlockInputStream::fillColumns<(DB::ASTTableJoin::Strictness)2, ConcurrentHashMapTable<StringRef, HashMapCellWithSavedHash<StringRef, DB::Join::WithUsedFlag<true, DB::Join::RowRefList>, DefaultHash<StringRef, void>, HashTableNoState>, DefaultHash<StringRef, void>, HashTableGrower<8ul>, Allocator<true> > >(ConcurrentHashMapTable<StringRef, HashMapCellWithSavedHash<StringRef, DB::Join::WithUsedFlag<true, DB::Join::RowRefList>, DefaultHash<StringRef, void>, HashTableNoState>, DefaultHash<StringRef, void>, HashTableGrower<8ul>, Allocator<true> > const&, unsigned long, std::vector<COWPtr<DB::IColumn>::mutable_ptr<DB::IColumn>, std::allocator<COWPtr<DB::IColumn>::mutable_ptr<DB::IColumn> > >&, unsigned long, std::vector<COWPtr<DB::IColumn>::mutable_ptr<DB::IColumn>, std::allocator<COWPtr<DB::IColumn>::mutable_ptr<DB::IColumn> > >&)+0x343) [0x79a63f3]"] [thread_id=117399]
[2024/07/04 10:32:08.471 +08:00] [ERROR] [<unknown>] ["BaseDaemon:1. bin/tiflash/tiflash(DB::Block DB::NonJoinedBlockInputStream::createBlock<(DB::ASTTableJoin::Strictness)2, DB::Join::MapsTemplate<DB::Join::WithUsedFlag<true, DB::Join::RowRefList> > >(DB::Join::MapsTemplate<DB::Join::WithUsedFlag<true, DB::Join::RowRefList> > const&)+0x473) [0x79a8a23]"] [thread_id=117399]
[2024/07/04 10:32:08.471 +08:00] [ERROR] [<unknown>] ["BaseDaemon:2. bin/tiflash/tiflash(DB::NonJoinedBlockInputStream::readImpl()+0xe2) [0x79a8ce2]"] [thread_id=117399]
[2024/07/04 10:32:08.471 +08:00] [ERROR] [<unknown>] ["BaseDaemon:3. bin/tiflash/tiflash(DB::IProfilingBlockInputStream::read(DB::PODArray<unsigned char, 4096ul, Allocator<false>, 15ul, 16ul>*&, bool)+0x3bd) [0x6d97d9d]"] [thread_id=117399]
[2024/07/04 10:32:08.471 +08:00] [ERROR] [<unknown>] ["BaseDaemon:4. bin/tiflash/tiflash(DB::IProfilingBlockInputStream::read()+0x17) [0x6d97f87]"] [thread_id=117399]
[2024/07/04 10:32:08.471 +08:00] [ERROR] [<unknown>] ["BaseDaemon:5. bin/tiflash/tiflash(DB::ExpressionBlockInputStream::readImpl()+0x1b) [0x7ad074b]"] [thread_id=117399]
[2024/07/04 10:32:08.471 +08:00] [ERROR] [<unknown>] ["BaseDaemon:6. bin/tiflash/tiflash(DB::IProfilingBlockInputStream::read(DB::PODArray<unsigned char, 4096ul, Allocator<false>, 15ul, 16ul>*&, bool)+0x3bd) [0x6d97d9d]"] [thread_id=117399]
[2024/07/04 10:32:08.472 +08:00] [ERROR] [<unknown>] ["BaseDaemon:7. bin/tiflash/tiflash(DB::IProfilingBlockInputStream::read()+0x17) [0x6d97f87]"] [thread_id=117399]
[2024/07/04 10:32:08.472 +08:00] [ERROR] [<unknown>] ["BaseDaemon:8. bin/tiflash/tiflash(DB::ExpressionBlockInputStream::readImpl()+0x1b) [0x7ad074b]"] [thread_id=117399]
[2024/07/04 10:32:08.472 +08:00] [ERROR] [<unknown>] ["BaseDaemon:9. bin/tiflash/tiflash(DB::IProfilingBlockInputStream::read(DB::PODArray<unsigned char, 4096ul, Allocator<false>, 15ul, 16ul>*&, bool)+0x3bd) [0x6d97d9d]"] [thread_id=117399]
[2024/07/04 10:32:08.472 +08:00] [ERROR] [<unknown>] ["BaseDaemon:10. bin/tiflash/tiflash(DB::IProfilingBlockInputStream::read()+0x17) [0x6d97f87]"] [thread_id=117399]
[2024/07/04 10:32:08.472 +08:00] [ERROR] [<unknown>] ["BaseDaemon:11. bin/tiflash/tiflash(DB::ExchangeSender::readImpl()+0x1b) [0x7d059db]"] [thread_id=117399]
[2024/07/04 10:32:08.472 +08:00] [ERROR] [<unknown>] ["BaseDaemon:12. bin/tiflash/tiflash(DB::IProfilingBlockInputStream::read(DB::PODArray<unsigned char, 4096ul, Allocator<false>, 15ul, 16ul>*&, bool)+0x3bd) [0x6d97d9d]"] [thread_id=117399]
[2024/07/04 10:32:08.472 +08:00] [ERROR] [<unknown>] ["BaseDaemon:13. bin/tiflash/tiflash(DB::IProfilingBlockInputStream::read()+0x17) [0x6d97f87]"] [thread_id=117399]
[2024/07/04 10:32:08.472 +08:00] [ERROR] [<unknown>] ["BaseDaemon:14. bin/tiflash/tiflash(DB::ParallelInputsProcessor<DB::UnionBlockInputStream<(DB::StreamUnionMode)0, false>::Handler, (DB::StreamUnionMode)0>::loop(unsigned long)+0x101) [0x74f0101]"] [thread_id=117399]
[2024/07/04 10:32:08.472 +08:00] [ERROR] [<unknown>] ["BaseDaemon:15. bin/tiflash/tiflash(DB::ParallelInputsProcessor<DB::UnionBlockInputStream<(DB::StreamUnionMode)0, false>::Handler, (DB::StreamUnionMode)0>::thread(unsigned long)+0x1f0) [0x74f09d0]"] [thread_id=117399]
[2024/07/04 10:32:08.472 +08:00] [ERROR] [<unknown>] ["BaseDaemon:16. bin/tiflash/tiflash(_ZNSt6thread11_State_implINS_8_InvokerISt5tupleIJZN2DB13ThreadFactory9newThreadISt8functionIFvvEEJEEES_bNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEEOT_DpOT0_EUlDpOT_E_EEEEE6_M_runEv+0x63) [0x7dca0a3]"] [thread_id=117399]
[2024/07/04 10:32:08.472 +08:00] [ERROR] [<unknown>] ["BaseDaemon:17. bin/tiflash/tiflash() [0x8b11bef]"] [thread_id=117399]
[2024/07/04 10:32:08.472 +08:00] [ERROR] [<unknown>] ["BaseDaemon:18. /lib64/libpthread.so.0(+0x7ea5) [0x7fada1b9cea5]"] [thread_id=117399]

4. What is your TiFlash version? (Required)

v5.4.3

windtalker commented 3 months ago

Some details about this crash The query pipeline is

CreatingSets
 Union
  HashJoinBuildBlockInputStream × 32
   Expression
    Expression
     Squashing
      TiRemoteBlockInputStream(ExchangeReceiver)
 Union
  SharedQuery × 32
   Union
    ExchangeSender × 32
     Expression
      Expression
       Expression
        Expression
         Expression
          Squashing
           TiRemoteBlockInputStream(ExchangeReceiver)
    Union
     ExchangeSender × 32
      Expression
       Expression
        NonJoined

The task already finish build hash table, and meet error when trying to send some results to TiDB(meet grpc writes fails, I suspect this fail is because TiDB close the connection first). After the task meet error, it goes to error handling process, which is as expected

[2024/07/04 09:41:35.906 +08:00] [ERROR] [MPPTask.cpp:334] ["MPPTask:<query:450902692276731912,task:10> task running meets error: DB::Exception: has finished, tunnel10+27 grpc writes failed. Stack Trace : 0. bin/tiflash/tiflash(StackTrace::StackTrace()+0x16) [0x36d5546]\n1. bin/tiflash/tiflash(DB::Exception::Exception(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, int)+0x26) [0x36c9376]\n2. bin/tiflash/tiflash(DB::MPPTunnelBase<grpc_impl::ServerWriter<mpp::MPPDataPacket> >::writeDone()+0x281f) [0x7c9c16f]\n3. bin/tiflash/tiflash(DB::MPPTask::finishWrite()+0x21) [0x7c89111]\n4. bin/tiflash/tiflash(DB::MPPTask::runImpl()+0x399) [0x7c8e119]\n5. bin/tiflash/tiflash(_ZNSt6thread11_State_implINS_8_InvokerISt5tupleIJZN2DB13ThreadFactory9newThreadISt8functionIFvvEEJEEES_bNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEEOT_DpOT0_EUlDpOT_E_EEEEE6_M_runEv+0x63) [0x7dca0a3]\n6. bin/tiflash/tiflash() [0x8b11bef]\n7. /lib64/libpthread.so.0(+0x7ea5) [0x7f22f1d73ea5]\n8. /lib64/libc.so.6(clone+0x6d) [0x7f22f179a96d]\n"] [thread_id=1345780]
[2024/07/04 09:41:35.999 +08:00] [DEBUG] [<unknown>] ["MPPTask:<query:450902692276731912,task:24> ExchangeReceiver finish read : sender_meta {\n  start_ts: 450902692276731912\n  task_id: 14\n  address: \"172.25.215.86:3930\"\n}\nreceiver_meta {\n  start_ts: 450902692276731912\n  task_id: 24\n  address: \"172.25.214.93:3930\"\n}\n"] [thread_id=1345787]
[2024/07/04 09:41:35.999 +08:00] [DEBUG] [<unknown>] ["MPPTask:<query:450902692276731912,task:24> ExchangeReceiver 14 -> 24 end! current alive connections: 1"] [thread_id=1345787]
[2024/07/04 09:41:36.012 +08:00] [INFO] [MPPTask.cpp:369] ["MPPTask:<query:450902692276731912,task:10> task ends, time cost is 28558 ms."] [thread_id=1345780]
[2024/07/04 09:41:36.012 +08:00] [DEBUG] [<unknown>] ["MPPTask:<query:450902692276731912,task:10> task unregistered"] [thread_id=1345780]
[2024/07/04 09:41:36.012 +08:00] [INFO] [<unknown>] ["MPPTask:<query:450902692276731912,task:10> finish task"] [thread_id=1345780]
[2024/07/04 09:41:36.012 +08:00] [INFO] [MPPTaskStatistics.cpp:113] ["mpp_task_tracing:<query:450902692276731912,task:10> {\"query_tso\":450902692276731912,\"task_id\":10,\"is_root\":false,\"sender_executor_id\":\"ExchangeSender_76\",\"executors\":[{\"id\":\"ExchangeSender_76\",\"type\":\"ExchangeSender\",\"children\":[\"TableFullScan_75\"],\"outbound_rows\":0,\"outbound_blocks\":0,\"outbound_bytes\":0,\"execution_time_ns\":0,\"partition_num\":7,\"sender_target_task_ids\":[22,23,24,25,26,27,28],\"exchange_type\":\"Hash\",\"connection_details\":[{\"tunnel_id\":\"tunnel10+22\",\"sender_target_task_id\":22,\"sender_target_host\":\"172.25.215.88:3930\",\"is_local\":false,\"packets\":0,\"bytes\":0},{\"tunnel_id\":\"tunnel10+23\",\"sender_target_task_id\":23,\"sender_target_host\":\"172.25.215.89:3930\",\"is_local\":false,\"packets\":0,\"bytes\":0},{\"tunnel_id\":\"tunnel10+24\",\"sender_target_task_id\":24,\"sender_target_host\":\"172.25.214.93:3930\",\"is_local\":true,\"packets\":0,\"bytes\":0},{\"tunnel_id\":\"tunnel10+25\",\"sender_target_task_id\":25,\"sender_target_host\":\"172.25.214.92:3930\",\"is_local\":false,\"packets\":0,\"bytes\":0},{\"tunnel_id\":\"tunnel10+26\",\"sender_target_task_id\":26,\"sender_target_host\":\"172.25.215.87:3930\",\"is_local\":false,\"packets\":0,\"bytes\":0},{\"tunnel_id\":\"tunnel10+27\",\"sender_target_task_id\":27,\"sender_target_host\":\"172.25.214.94:3930\",\"is_local\":false,\"packets\":0,\"bytes\":0},{\"tunnel_id\":\"tunnel10+28\",\"sender_target_task_id\":28,\"sender_target_host\":\"172.25.215.86:3930\",\"is_local\":false,\"packets\":0,\"bytes\":0}]},{\"id\":\"TableFullScan_75\",\"type\":\"TableScan\",\"children\":[],\"outbound_rows\":0,\"outbound_blocks\":0,\"outbound_bytes\":0,\"execution_time_ns\":0,\"connection_details\":[{\"is_local\":true,\"packets\":0,\"bytes\":0},{\"is_local\":false,\"packets\":0,\"bytes\":0}]}],\"host\":\"172.25.214.93:3930\",\"task_init_timestamp\":1720057267453588376,\"task_start_timestamp\":1720057267548268517,\"task_end_timestamp\":1720057296012583062,\"compile_start_timestamp\":1720057267458401888,\"compile_end_timestamp\":1720057267548266435,\"read_wait_index_start_timestamp\":1720057267484182749,\"read_wait_index_end_timestamp\":1720057267500841767,\"local_input_bytes\":0,\"remote_input_bytes\":0,\"output_bytes\":0,\"status\":\"FINISHED\",\"error_message\":\"DB::Exception: has finished, tunnel10+27 grpc writes failed.\",\"working_time\":0,\"memory_peak\":0}"] [thread_id=1345780]
[2024/07/04 09:41:36.012 +08:00] [DEBUG] [MPPTask.cpp:54] ["MPPTask:<query:450902692276731912,task:10> finish MPPTask: <query:450902692276731912,task:10>"] [thread_id=1345780]
[2024/07/04 09:41:36.092 +08:00] [ERROR] [MPPTask.cpp:334] ["MPPTask:<query:450902692276731912,task:24> task running meets error: DB::Exception: write to tunnel which is already closed,tunnel24+-1 grpc writes failed. Stack Trace : 0. bin/tiflash/tiflash(StackTrace::StackTrace()+0x16) [0x36d5546]\n1. bin/tiflash/tiflash(DB::Exception::Exception(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, int)+0x26) [0x36c9376]\n2. bin/tiflash/tiflash(DB::SharedQueryBlockInputStream::readImpl()+0x410) [0x3717030]\n3. bin/tiflash/tiflash(DB::IProfilingBlockInputStream::read(DB::PODArray<unsigned char, 4096ul, Allocator<false>, 15ul, 16ul>*&, bool)+0x3bd) [0x6d97d9d]\n4. bin/tiflash/tiflash(DB::IProfilingBlockInputStream::read()+0x17) [0x6d97f87]\n5. bin/tiflash/tiflash(DB::ParallelInputsProcessor<DB::UnionBlockInputStream<(DB::StreamUnionMode)0, true>::Handler, (DB::StreamUnionMode)0>::loop(unsigned long)+0x143) [0x7c6a873]\n6. bin/tiflash/tiflash(DB::ParallelInputsProcessor<DB::UnionBlockInputStream<(DB::StreamUnionMode)0, true>::Handler, (DB::StreamUnionMode)0>::thread(unsigned long)+0x1cc) [0x7c6b0dc]\n7. bin/tiflash/tiflash(_ZNSt6thread11_State_implINS_8_InvokerISt5tupleIJZN2DB13ThreadFactory9newThreadISt8functionIFvvEEJEEES_bNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEEOT_DpOT0_EUlDpOT_E_EEEEE6_M_runEv+0x63) [0x7dca0a3]\n8. bin/tiflash/tiflash() [0x8b11bef]\n9. /lib64/libpthread.so.0(+0x7ea5) [0x7f22f1d73ea5]\n10. /lib64/libc.so.6(clone+0x6d) [0x7f22f179a96d]\n"] [thread_id=1345778]
[2024/07/04 09:41:36.105 +08:00] [INFO] [MPPTask.cpp:369] ["MPPTask:<query:450902692276731912,task:24> task ends, time cost is 28651 ms."] [thread_id=1345778]
[2024/07/04 09:41:36.105 +08:00] [DEBUG] [<unknown>] ["MPPTask:<query:450902692276731912,task:24> task unregistered"] [thread_id=1345778]
[2024/07/04 09:41:36.105 +08:00] [INFO] [<unknown>] ["MPPTask:<query:450902692276731912,task:24> finish task"] [thread_id=1345778]
[2024/07/04 09:41:36.105 +08:00] [INFO] [MPPTaskStatistics.cpp:113] ["mpp_task_tracing:<query:450902692276731912,task:24> {\"query_tso\":450902692276731912,\"task_id\":24,\"is_root\":true,\"sender_executor_id\":\"ExchangeSender_81\",\"executors\":[{\"id\":\"ExchangeReceiver_77\",\"type\":\"ExchangeReceiver\",\"children\":[],\"outbound_rows\":0,\"outbound_blocks\":0,\"outbound_bytes\":0,\"execution_time_ns\":0,\"partition_num\":7,\"receiver_source_task_ids\":[8,9,10,11,12,13,14],\"connection_details\":[{\"receiver_source_task_id\":8,\"packets\":0,\"bytes\":0},{\"receiver_source_task_id\":9,\"packets\":0,\"bytes\":0},{\"receiver_source_task_id\":10,\"packets\":0,\"bytes\":0},{\"receiver_source_task_id\":11,\"packets\":0,\"bytes\":0},{\"receiver_source_task_id\":12,\"packets\":0,\"bytes\":0},{\"receiver_source_task_id\":13,\"packets\":0,\"bytes\":0},{\"receiver_source_task_id\":14,\"packets\":0,\"bytes\":0}]},{\"id\":\"ExchangeReceiver_80\",\"type\":\"ExchangeReceiver\",\"children\":[],\"outbound_rows\":0,\"outbound_blocks\":0,\"outbound_bytes\":0,\"execution_time_ns\":0,\"partition_num\":7,\"receiver_source_task_ids\":[15,16,17,18,19,20,21],\"connection_details\":[{\"receiver_source_task_id\":15,\"packets\":0,\"bytes\":0},{\"receiver_source_task_id\":16,\"packets\":0,\"bytes\":0},{\"receiver_source_task_id\":17,\"packets\":0,\"bytes\":0},{\"receiver_source_task_id\":18,\"packets\":0,\"bytes\":0},{\"receiver_source_task_id\":19,\"packets\":0,\"bytes\":0},{\"receiver_source_task_id\":20,\"packets\":0,\"bytes\":0},{\"receiver_source_task_id\":21,\"packets\":0,\"bytes\":0}]},{\"id\":\"ExchangeSender_81\",\"type\":\"ExchangeSender\",\"children\":[\"HashJoin_72\"],\"outbound_rows\":0,\"outbound_blocks\":0,\"outbound_bytes\":0,\"execution_time_ns\":0,\"partition_num\":1,\"sender_target_task_ids\":[-1],\"exchange_type\":\"PassThrough\",\"connection_details\":[{\"tunnel_id\":\"tunnel24+-1\",\"sender_target_task_id\":-1,\"sender_target_host\":\"172.25.215.15:33324\",\"is_local\":false,\"packets\":0,\"bytes\":0}]},{\"id\":\"HashJoin_72\",\"type\":\"Join\",\"children\":[\"ExchangeReceiver_77\",\"ExchangeReceiver_80\"],\"outbound_rows\":0,\"outbound_blocks\":0,\"outbound_bytes\":0,\"execution_time_ns\":0,\"hash_table_bytes\":0,\"build_side_child\":\"\",\"non_joined_outbound_rows\":0,\"non_joined_outbound_blocks\":0,\"non_joined_outbound_bytes\":0,\"non_joined_execution_time_ns\":0}],\"host\":\"172.25.214.93:3930\",\"task_init_timestamp\":1720057267453395626,\"task_start_timestamp\":1720057267636056235,\"task_end_timestamp\":1720057296105609150,\"compile_start_timestamp\":1720057267453968200,\"compile_end_timestamp\":1720057267636048852,\"read_wait_index_start_timestamp\":0,\"read_wait_index_end_timestamp\":0,\"local_input_bytes\":0,\"remote_input_bytes\":0,\"output_bytes\":0,\"status\":\"FINISHED\",\"error_message\":\"DB::Exception: write to tunnel which is already closed,tunnel24+-1 grpc writes failed.\",\"working_time\":0,\"memory_peak\":0}"] [thread_id=1345778]
[2024/07/04 09:41:36.105 +08:00] [DEBUG] [MPPTask.cpp:54] ["MPPTask:<query:450902692276731912,task:24> finish MPPTask: <query:450902692276731912,task:24>"] [thread_id=1345778]
[2024/07/04 09:41:36.188 +08:00] [DEBUG] [<unknown>] ["MPPTask:<query:450902692276731912,task:24> ExchangeReceiver 13 -> 24 end! current alive connections: 0"] [thread_id=1345786]
[2024/07/04 09:41:36.193 +08:00] [DEBUG] [<unknown>] ["MPPTask:<query:450902692276731912,task:24> ExchangeReceiver All threads end in ExchangeReceiver"] [thread_id=1345786]
[2024/07/04 09:41:36.105 +08:00] [DEBUG] [MPPTask.cpp:54] ["MPPTask:<query:450902692276731912,task:24> finish MPPTask: <query:450902692276731912,task:24>"] [thread_id=1345778]

Shows the MPPTask is begin to destruct itself, and the last two log shows ExchangeReciver is ready to be destructed. The crash happens after this

[2024/07/04 09:41:36.722 +08:00] [ERROR] [<unknown>] [BaseDaemon:########################################] [thread_id=1346777]

My guess is NonJoinedBlockInputStream only holds a reference of Join, and will use it during fillColumns. Join is actually hold in HashJoinBuildBlockInputStream and ExpressionBlockInputStream at build side, since MPPTask is in destruct process, it is possible that the InputStreams that actually holds Join is already destructed while NonJoinedBlockInputStream still calling fillColumns, so NonJoinedBlockInputStream meet NULL pointer exception.

seiya-annie commented 2 months ago

/report customer