Closed rjpower closed 10 years ago
I'm seeing consistent deadlock issues which seem to be the result of a data-race between ServerConnection::handle_write and ::begin_reply. If I comment out this unlock/lock pair:
Marshal::read_barrier barrier = out_.get_read_barrier(); // out_l_.unlock(); out_.write_to_fd(socket_, barrier, rate); // out_l_.lock();
Everything works perfectly.
Helgrind reports this:
==13723== Possible data race during read of size 8 at 0xA3627F8 by thread #12 ==13723== Locks held: none ==13723== at 0x6CA10C5: rpc::FastMarshal::chunk::write_to_fd(int, unsigned long) (marshal.h:170) ==13723== by 0x6CA0108: rpc::FastMarshal::write_to_fd(int, rpc::FastMarshal::read_barrier const&, rpc::io_ratelimit const&) (marshal.cc:297) ==13723== by 0x6CA66A7: rpc::ServerConnection::handle_write(rpc::io_ratelimit const&) (server.cc:122) ==13723== by 0x6CA1912: rpc::PollMgr::PollThread::poll_loop() (polling.cc:152) ==13723== by 0x6CA2529: rpc::PollMgr::PollThread::start_poll_loop(void*) (polling.cc:45) ==13723== by 0x4C2EF9C: mythread_wrapper (hg_intercepts.c:219) ==13723== by 0x4E3DF8D: start_thread (pthread_create.c:311) ==13723== by 0x5A6FE1C: clone (clone.S:113) ==13723== ==13723== This conflicts with a previous write of size 8 by thread #14 ==13723== Locks held: 1, at address 0xA331D88 ==13723== at 0x6CA09D0: rpc::FastMarshal::chunk::set_bookmark() (marshal.h:91) ==13723== by 0x6CA031A: rpc::FastMarshal::set_bookmark(unsigned long) (marshal.cc:333) ==13723== by 0x6CA6113: rpc::ServerConnection::begin_reply(rpc::Request*, int) (server.cc:25) ==13723== by 0x6A3184F: spartan::WorkerService::__run_kernel__wrapper__(rpc::Request*, rpc::ServerConnection*)::{lambda()#1}::operator()() const (spartan_service.h:438) ==13723== by 0x6A355F9: std::_Function_handler<void (), spartan::WorkerService::__run_kernel__wrapper__(rpc::Request*, rpc::ServerConnection*)::{lambda()#1}>::_M_invoke(std::_Any_data const&) (functional:1925) ==13723== by 0x6CACA4B: std::function<void ()>::operator()() const (functional:2310) ==13723== by 0x6CAB957: rpc::ThreadPool::run_thread(int) (utils.cc:70) ==13723== by 0x6CAB3DF: rpc::ThreadPool::start_thread_pool(void*) (utils.cc:25) ==13723== ==13723== Address 0xA3627F8 is 24 bytes inside a block of size 48 alloc'd ==13723== at 0x4C2C2D1: operator new(unsigned long) (vg_replace_malloc.c:298) ==13723== by 0x6CA0288: rpc::FastMarshal::set_bookmark(unsigned long) (marshal.cc:327) ==13723== by 0x6CA6113: rpc::ServerConnection::begin_reply(rpc::Request*, int) (server.cc:25) ==13723== by 0x6A31183: spartan::WorkerService::__initialize__wrapper__(rpc::Request*, rpc::ServerConnection*) (spartan_service.h:390) ==13723== by 0x6A42CBC: void rpc::Server::reg<spartan::WorkerService>(int, spartan::WorkerService*, void (spartan::WorkerService::*)(rpc::Request*, rpc::ServerConnection*))::H::handle(rpc::Request*, rpc::ServerConnection*) (server.h:201) ==13723== by 0x6CA6543: rpc::ServerConnection::handle_read() (server.cc:103) ==13723== by 0x6CA18B8: rpc::PollMgr::PollThread::poll_loop() (polling.cc:149) ==13723== by 0x6CA2529: rpc::PollMgr::PollThread::start_poll_loop(void*) (polling.cc:45) ==13723== by 0x4C2EF9C: mythread_wrapper (hg_intercepts.c:219) ==13723== by 0x4E3DF8D: start_thread (pthread_create.c:311) ==13723== by 0x5A6FE1C: clone (clone.S:113)
Thanks Russell. I just saw your issue report because Github doesn't send me notifications...
I've located the problem and will fix it quickly. Thanks! :D
I'm seeing consistent deadlock issues which seem to be the result of a data-race between ServerConnection::handle_write and ::begin_reply. If I comment out this unlock/lock pair:
Everything works perfectly.
Helgrind reports this: