Closed santazhang closed 10 years ago
lldb disable interrupt handling:
process handle --pass true --stop false --notify true SIGINT
Decided to debug using GDB instead.
https://sourceware.org/gdb/onlinedocs/gdb/Signals.html
gdb --args ./build/rpcbench -s 0.0.0.0:8848 -e 1 -w 1
handle SIGINT noignore nostop
Program received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7ffff67d5700 (LWP 16838)]
0x000000000040a4d5 in _M_find_before_node (__code=140737085706976, __k=<synthetic pointer>, __n=0, this=0x618430)
at /usr/include/c++/4.8/bits/hashtable.h:1162
1162 __node_type* __p = static_cast<__node_type*>(__prev_p->_M_nxt);
(gdb) bt
#0 0x000000000040a4d5 in _M_find_before_node (__code=140737085706976, __k=<synthetic pointer>, __n=0, this=0x618430)
at /usr/include/c++/4.8/bits/hashtable.h:1162
#1 _M_find_node (__c=140737085706976, __key=<synthetic pointer>, __bkt=0, this=0x618430)
at /usr/include/c++/4.8/bits/hashtable.h:604
#2 find (__k=<synthetic pointer>, this=0x618430) at /usr/include/c++/4.8/bits/hashtable.h:1025
#3 find (__x=<synthetic pointer>, this=0x618430) at /usr/include/c++/4.8/bits/unordered_set.h:517
#4 rpc::PollMgr::PollThread::update_mode (this=0x618340, poll=poll@entry=0x7fffe80012e0, new_mode=new_mode@entry=3)
at ../rpc/polling.cc:291
#5 0x000000000040a66e in rpc::PollMgr::update_mode (this=<optimized out>, poll=poll@entry=0x7fffe80012e0,
new_mode=new_mode@entry=3) at ../rpc/polling.cc:385
#6 0x000000000040c017 in rpc::ServerConnection::end_reply (this=0x7fffe80012e0) at ../rpc/server.cc:49
#7 0x0000000000405cf7 in operator() (__closure=0x7fffec107230) at ../test/benchmark_service.h:198
#8 std::_Function_handler<void (), benchmark::BenchmarkService::__nop__wrapper__(rpc::Request*, rpc::ServerConnection*)::{lambda()#1}>::_M_invoke(std::_Any_data const&) (__functor=...) at /usr/include/c++/4.8/functional:2071
#9 0x0000000000410078 in operator() (this=0x7fffec107200) at /usr/include/c++/4.8/functional:2464
#10 base::ThreadPool::run_thread (this=0x618230, id_in_pool=0) at ../base/threading.cc:167
#11 0x0000000000410317 in base::ThreadPool::start_thread_pool (args=0x6182b0) at ../base/threading.cc:52
#12 0x00007ffff7bc4f6e in start_thread (arg=0x7ffff67d5700) at pthread_create.c:311
#13 0x00007ffff70d19cd in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:113
Pinpointed the cause: ServerConnection does not have a reference to PollMgr. When handle_write is called, ServerConnection need to use PollMgr to update epoll mode on its file descriptor.
Consider this situation: Server has a reference of ThreadPool (ref_count = 2) and PollMgr (ref_count = 1). When shutting down the server (~Server), the Server tries to:
threadpool_->release(); // 1
pollmgr_->release(); // 2
For 1, the threadpool_
will not be collected because it now will have ref_count=1
. But for 2, the pollmgr_
will have ref_count=0
and will be collected. Now when a ServerConnection finishes handling RPC and tries to update its file descriptor, it will crash due to accessing destroyed pollmgr_
.
So the solution would be:
Now about the Client, we need to check if it also has similar issues.
Client does not have reference on PollMgr (to prevent circular reference). We depend on users to carefully collect PollMgr instance.
I don't think Client will have the corruption issue, because the callback of Client does not need to send RPC (access PollMgr). ServerConnection crashes because it tries to have delayed access to PollMgr, which could be a stale/dead pointer.
A bit of update:
We cannot add ref copy of PollMgr into ServerConnection. This will cause:
And we cannot easily revert the reference between ServerConnection (Pollable) and PollMgr. This is because PollMgr might have delayed access to Pollable -- when it is being removed from the poll loop. PollMgr just marks the pollable as "being destroyed", and only actually destroy it later after READ/WRITE operations in an IO loop finishes. So it has to hold a reference copy of Pollable.
So now it seems that the sane solution is to use some sort of semaphore to explicitly wait till ServerConnection is destroyed in ~Server().
Crash happened on rpcbench.
Modify the nop rpc to do some sleeping. When running the server, press Ctrl-C when client is sending messages. The rpcbench might crash due to SIGSEGV.
My suspicion is that the server side is not properly shutdown. The worker thread finishes one job and come back to Server data structure, but found it destroyed.
It's difficult to debug, because lldb is intercepting my Ctrl-C interrupts.