vesoft-inc / nebula

A distributed, fast open-source graph database featuring horizontal scalability and high availability
https://nebula-graph.io
Apache License 2.0
10.73k stars 1.2k forks source link

Graphd server coredump when nebula-java time-out #5750

Open flymysql opened 11 months ago

flymysql commented 11 months ago

Please check the FAQ documentation before raising an issue

I have the same problem as in the link below, but I am not using the nebula-go client side, I am using the nebula-java client side, but the call stack of graphd crash is still the same.

Describe the bug (required) https://discuss.nebula-graph.com.cn/t/topic/10101/13 https://github.com/vesoft-inc/nebula/issues/4635

Thread 79 "graph-netio25" received signal SIGSEGV, Segmentation fault.
[Switching to LWP 393143]
0x000000000673b4c8 in apache::thrift::transport::THeader::getSequenceNumber() const ()
(gdb) bt
#0  0x000000000673b4c8 in apache::thrift::transport::THeader::getSequenceNumber() const ()
#1  0x000000000673ba84 in apache::thrift::HeaderServerChannel::HeaderRequest::isOneway() const ()
#2  0x000000000673c002 in apache::thrift::Cpp2Connection::Cpp2Request::isOneway() const ()
#3  0x0000000006735785 in apache::thrift::Cpp2Connection::stop() ()
#4  0x0000000006738dfb in ?? ()
#5  0x000000000673afc5 in ?? ()
#6  0x000000000673a91e in ?? ()
#7  0x0000000006738fb9 in apache::thrift::Cpp2Connection::channelClosed(folly::exception_wrapper&&) ()
#8  0x000000000675c3a7 in apache::thrift::HeaderServerChannel::messageChannelEOF() ()
#9  0x00000000066a4f9e in apache::thrift::Cpp2Channel::processReadEOF() ()
#10 0x00000000066a4966 in apache::thrift::Cpp2Channel::readEOF(wangle::HandlerContext<int, std::pair<std::unique_ptr<folly::IOBuf, std::default_delete<folly::IOBuf> >, apache::thrift::transport::THeader*> >*) ()
#11 0x00000000066b91b0 in wangle::ContextImpl<apache::thrift::Cpp2Channel>::readEOF() ()
#12 0x00000000066b78f6 in wangle::ContextImpl<apache::thrift::FramingHandler>::fireReadEOF() ()
#13 0x00000000066bb047 in wangle::Handler<folly::IOBufQueue&, std::pair<std::unique_ptr<folly::IOBuf, std::default_delete<folly::IOBuf> >, std::unique_ptr<apache::thrift::transport::THeader, std::default_delete<apache::thrift::transport::THeader> > >, std::pair<std::unique_ptr<folly::IOBuf, std::default_delete<folly::IOBuf> >, apache::thrift::transport::THeader*>, std::unique_ptr<folly::IOBuf, std::default_delete<folly::IOBuf> > >::readEOF(wangle::HandlerContext<std::pair<std::unique_ptr<folly::IOBuf, std::default_delete<folly::IOBuf> >, std::unique_ptr<apache::thrift::transport::THeader, std::default_delete<apache::thrift::transport::THeader> > >, std::unique_ptr<folly::IOBuf, std::default_delete<folly::IOBuf> > >*) ()
#14 0x00000000066b80bc in wangle::ContextImpl<apache::thrift::FramingHandler>::readEOF() ()
#15 0x00000000066b5e18 in wangle::ContextImpl<apache::thrift::TAsyncTransportHandler>::fireReadEOF() ()
#16 0x00000000066a7355 in apache::thrift::TAsyncTransportHandler::readEOF() ()
#17 0x00000000069d30a9 in folly::AsyncSocket::handleRead() ()
#18 0x00000000069c7e30 in folly::AsyncSocket::ioReady(unsigned short) ()
#19 0x0000000006a9f3e4 in ?? ()
#20 0x0000000006a9fc9f in event_base_loop ()
#21 0x00000000069e1d95 in folly::EventBase::loopBody(int, bool) ()
#22 0x00000000069e267e in folly::EventBase::loop() ()
#23 0x00000000069e4f08 in folly::EventBase::loopForever() ()
#24 0x000000000696c799 in folly::IOThreadPoolExecutor::threadRun(std::shared_ptr<folly::ThreadPoolExecutor::Thread>) ()
#25 0x000000000697b0c5 in void folly::detail::function::FunctionTraits<void ()>::callSmall<std::_Bind<void (folly::ThreadPoolExecutor::*(folly::ThreadPoolExecutor*, std::shared_ptr<folly::ThreadPoolExecutor::Thread>))(std::shared_ptr<folly::ThreadPoolExecutor::Thread>)> >(folly::detail::function::Data&) ()
#26 0x0000000004618188 in folly::detail::function::FunctionTraits<void ()>::operator()() (this=0x7f34168112c0)
    at ../../../third_party_build/install/include/folly/Function.h:400
#27 0x00000000046a0446 in folly::NamedThreadFactory::newThread(folly::Function<void ()>&&)::{lambda()#1}::operator()() (
--Type <RET> for more, q to quit, c to continue without paging--
    __closure=0x7f34168112c0) at ../../../../third_party_build/install/include/folly/executors/thread_factory/NamedThreadFactory.h:40
#28 0x00000000046dffc7 in std::__invoke_impl<void, folly::NamedThreadFactory::newThread(folly::Function<void ()>&&)::{lambda()#1}>(std::__invoke_other, folly::NamedThreadFactory::newThread(folly::Function<void ()>&&)::{lambda()#1}&&) (__f=...)
    at /opt/buildtools/gcc-10.3.0/include/c++/10.3.0/bits/invoke.h:60
#29 0x00000000046dfa91 in std::__invoke<folly::NamedThreadFactory::newThread(folly::Function<void ()>&&)::{lambda()#1}>(std::__invoke_result&&, (folly::NamedThreadFactory::newThread(folly::Function<void ()>&&)::{lambda()#1}&&)...) (__fn=...)
    at /opt/buildtools/gcc-10.3.0/include/c++/10.3.0/bits/invoke.h:95
#30 0x00000000046df78c in std::thread::_Invoker<std::tuple<folly::NamedThreadFactory::newThread(folly::Function<void ()>&&)::{lambda()#1}> >::_M_invoke<0ul>(std::_Index_tuple<0ul>) (this=0x7f34168112c0) at /opt/buildtools/gcc-10.3.0/include/c++/10.3.0/thread:264
#31 0x00000000046df3c8 in std::thread::_Invoker<std::tuple<folly::NamedThreadFactory::newThread(folly::Function<void ()>&&)::{lambda()#1}> >::operator()() (this=0x7f34168112c0) at /opt/buildtools/gcc-10.3.0/include/c++/10.3.0/thread:271
#32 0x00000000046ded0a in std::thread::_State_impl<std::thread::_Invoker<std::tuple<folly::NamedThreadFactory::newThread(folly::Function<void ()>&&)::{lambda()#1}> > >::_M_run() (this=0x7f34168112b0) at /opt/buildtools/gcc-10.3.0/include/c++/10.3.0/thread:215
#33 0x0000000006fd915e in std::execute_native_thread_routine (__p=0x7f34168112b0) at ../../../.././libstdc++-v3/src/c++11/thread.cc:78
#34 0x00007f3416fb9f3b in ?? () from /usr/lib64/libpthread.so.0
#35 0x00007f3416ef1840 in clone () from /usr/lib64/libc.so.6

Your Environments (required)

* Commit id (e.g. `a3ffc7d8`)

version v3.2.1 nebula-graphd version Git: bb2e684


**How To Reproduce(__required__)**

Steps to reproduce the behavior:

1. Step 1
2. Step 2
3. Step 3

**Expected behavior**

<!-- A clear and concise description of what you expected to happen. -->

**Additional context**

<!-- Provide logs and configs, or any other context to trace the problem. -->
flymysql commented 11 months ago

The reason for this problem is that the apache::thrift::HeaderServerChannel::HeaderRequest::isOneway() function in the thrift library does not do null pointer verification

need change to

    // Note: 这个函数里面应该做空指针校验
    bool isOneway() const override {
      if (header_.get() == nullptr) {
        return true;
      }
      return header_->getSequenceNumber() == ONEWAY_REQUEST_ID;
    }
flymysql commented 11 months ago

this is my patch

diff -ur a/thrift/lib/cpp2/async/HeaderServerChannel.h b/thrift/lib/cpp2/async/HeaderServerChannel.h
--- a/thrift/lib/cpp2/async/HeaderServerChannel.h  2022-08-18 16:27:45.353299307 +0800
+++ b/thrift/lib/cpp2/async/HeaderServerChannel.h  2022-08-18 16:27:28.453299912 +0800
@@ -108,6 +108,10 @@
     }

     bool isOneway() const override {
+   if (header_.get() == nullptr) {
+     LOG(ERROR) << "header request is null";
+     return true;
+   }
       return header_->getSequenceNumber() == ONEWAY_REQUEST_ID;
     }
wey-gu commented 11 months ago

Amazing @flymysql would you mind PR to https://github.com/vesoft-inc/nebula-third-party/tree/master/project/patches ?

@dutor could you please take a look at this?

THANKS!

wey-gu commented 11 months ago

Seems already addressed by https://github.com/vesoft-inc/nebula-third-party/blob/release-3.3/project/patches/fbthrift-2021-11-29.patch ?

flymysql commented 11 months ago

Seems already addressed by https://github.com/vesoft-inc/nebula-third-party/blob/release-3.3/project/patches/fbthrift-2021-11-29.patch ?

Oh, that's great. Since I'm still using the v3.0 version, I didn't find it fixed.

dutor commented 11 months ago

This is a known issue, and we have fixed it with an adhoc patch.