santazhang / simple-rpc

Simple RPC in C++, with Python binding
http://www.yzhang.net/simple-rpc/
BSD 3-Clause "New" or "Revised" License
27 stars 6 forks source link

Server crash on Ctrl-C #7

Closed santazhang closed 10 years ago

santazhang commented 10 years ago

Crash happened on rpcbench.

Modify the nop rpc to do some sleeping. When running the server, press Ctrl-C when client is sending messages. The rpcbench might crash due to SIGSEGV.

My suspicion is that the server side is not properly shutdown. The worker thread finishes one job and come back to Server data structure, but found it destroyed.

It's difficult to debug, because lldb is intercepting my Ctrl-C interrupts.

santazhang commented 10 years ago

lldb disable interrupt handling:

process handle --pass true --stop false --notify true SIGINT

santazhang commented 10 years ago

Decided to debug using GDB instead.

https://sourceware.org/gdb/onlinedocs/gdb/Signals.html

gdb --args ./build/rpcbench -s 0.0.0.0:8848 -e 1 -w 1

handle SIGINT noignore nostop
santazhang commented 10 years ago
Program received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7ffff67d5700 (LWP 16838)]
0x000000000040a4d5 in _M_find_before_node (__code=140737085706976, __k=<synthetic pointer>, __n=0, this=0x618430)
    at /usr/include/c++/4.8/bits/hashtable.h:1162
1162          __node_type* __p = static_cast<__node_type*>(__prev_p->_M_nxt);
(gdb) bt
#0  0x000000000040a4d5 in _M_find_before_node (__code=140737085706976, __k=<synthetic pointer>, __n=0, this=0x618430)
    at /usr/include/c++/4.8/bits/hashtable.h:1162
#1  _M_find_node (__c=140737085706976, __key=<synthetic pointer>, __bkt=0, this=0x618430)
    at /usr/include/c++/4.8/bits/hashtable.h:604
#2  find (__k=<synthetic pointer>, this=0x618430) at /usr/include/c++/4.8/bits/hashtable.h:1025
#3  find (__x=<synthetic pointer>, this=0x618430) at /usr/include/c++/4.8/bits/unordered_set.h:517
#4  rpc::PollMgr::PollThread::update_mode (this=0x618340, poll=poll@entry=0x7fffe80012e0, new_mode=new_mode@entry=3)
    at ../rpc/polling.cc:291
#5  0x000000000040a66e in rpc::PollMgr::update_mode (this=<optimized out>, poll=poll@entry=0x7fffe80012e0, 
    new_mode=new_mode@entry=3) at ../rpc/polling.cc:385
#6  0x000000000040c017 in rpc::ServerConnection::end_reply (this=0x7fffe80012e0) at ../rpc/server.cc:49
#7  0x0000000000405cf7 in operator() (__closure=0x7fffec107230) at ../test/benchmark_service.h:198
#8  std::_Function_handler<void (), benchmark::BenchmarkService::__nop__wrapper__(rpc::Request*, rpc::ServerConnection*)::{lambda()#1}>::_M_invoke(std::_Any_data const&) (__functor=...) at /usr/include/c++/4.8/functional:2071
#9  0x0000000000410078 in operator() (this=0x7fffec107200) at /usr/include/c++/4.8/functional:2464
#10 base::ThreadPool::run_thread (this=0x618230, id_in_pool=0) at ../base/threading.cc:167
#11 0x0000000000410317 in base::ThreadPool::start_thread_pool (args=0x6182b0) at ../base/threading.cc:52
#12 0x00007ffff7bc4f6e in start_thread (arg=0x7ffff67d5700) at pthread_create.c:311
#13 0x00007ffff70d19cd in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:113
santazhang commented 10 years ago

Pinpointed the cause: ServerConnection does not have a reference to PollMgr. When handle_write is called, ServerConnection need to use PollMgr to update epoll mode on its file descriptor.

Consider this situation: Server has a reference of ThreadPool (ref_count = 2) and PollMgr (ref_count = 1). When shutting down the server (~Server), the Server tries to:

threadpool_->release(); // 1
pollmgr_->release(); // 2

For 1, the threadpool_ will not be collected because it now will have ref_count=1. But for 2, the pollmgr_ will have ref_count=0 and will be collected. Now when a ServerConnection finishes handling RPC and tries to update its file descriptor, it will crash due to accessing destroyed pollmgr_.

So the solution would be:

  1. Add ref copy of PollMgr to ServerConnection.
  2. In ~Server(), explicitly wait till ServerConnection is destroyed.

Now about the Client, we need to check if it also has similar issues.

Client does not have reference on PollMgr (to prevent circular reference). We depend on users to carefully collect PollMgr instance.

I don't think Client will have the corruption issue, because the callback of Client does not need to send RPC (access PollMgr). ServerConnection crashes because it tries to have delayed access to PollMgr, which could be a stale/dead pointer.

santazhang commented 10 years ago

A bit of update:

We cannot add ref copy of PollMgr into ServerConnection. This will cause:

  1. circular reference
  2. deadlock on pthread_join when destroying ServerConnection: In IO loop, the ServerConnection being destroyed tries to collect PollMgr, which in turn tries to join the IO loop. This is causing the IO loop to join itself, which is a deadlock.

And we cannot easily revert the reference between ServerConnection (Pollable) and PollMgr. This is because PollMgr might have delayed access to Pollable -- when it is being removed from the poll loop. PollMgr just marks the pollable as "being destroyed", and only actually destroy it later after READ/WRITE operations in an IO loop finishes. So it has to hold a reference copy of Pollable.

So now it seems that the sane solution is to use some sort of semaphore to explicitly wait till ServerConnection is destroyed in ~Server().