sirikata / sirikata

Sirikata is a BSD-licensed platform for networked 3d environments
http://www.sirikata.com/
Other
126 stars 39 forks source link

Infrequent hang in SST during destruction #514

Closed ewencp closed 11 years ago

ewencp commented 11 years ago

I sometimes get SST hanging trying to clean up. Here's the stack trace where it's hanging:

(gdb) bt
#0  0x00007fbc0ea2b848 in std::_Rb_tree_rebalance_for_erase(std::_Rb_tree_node_base*, std::_Rb_tree_node_base&) () from /usr/lib/libstdc++.so.6
#1  0x00000000005c8541 in std::_Rb_tree<Sirikata::SST::EndPoint<Sirikata::SpaceObjectReference>, std::pair<Sirikata::SST::EndPoint<Sirikata::SpaceObjectReference> const, std::tr1::shared_ptr<Sirikata::SST::Connection<Sirikata::SpaceObjectReference> > >, std::_Select1st<std::pair<Sirikata::SST::EndPoint<Sirikata::SpaceObjectReference> const, std::tr1::shared_ptr<Sirikata::SST::Connection<Sirikata::SpaceObjectReference> > > >, std::less<Sirikata::SST::EndPoint<Sirikata::SpaceObjectReference> >, std::allocator<std::pair<Sirikata::SST::EndPoint<Sirikata::SpaceObjectReference> const, std::tr1::shared_ptr<Sirikata::SST::Connection<Sirikata::SpaceObjectReference> > > > >::erase (this=0x25d22c8, __position=...) at /usr/include/c++/4.4/bits/stl_tree.h:1347
#2  0x0000000000606697 in std::map<Sirikata::SST::EndPoint<Sirikata::SpaceObjectReference>, std::tr1::shared_ptr<Sirikata::SST::Connection<Sirikata::SpaceObjectReference> >, std::less<Sirikata::SST::EndPoint<Sirikata::SpaceObjectReference> >, std::allocator<std::pair<Sirikata::SST::EndPoint<Sirikata::SpaceObjectReference> const, std::tr1::shared_ptr<Sirikata::SST::Connection<Sirikata::SpaceObjectReference> > > > >::erase (this=0x25d22c8, __position=...)
    at /usr/include/c++/4.4/bits/stl_map.h:567
#3  0x000000000060451d in Sirikata::SST::Connection<Sirikata::SpaceObjectReference>::closeConnections (sstConnVars=0x25d2268)
    at /home/ewencp/sirikata.sirikata/libcore/include/sirikata/core/network/SSTImpl.hpp:1166
#4  0x0000000000608bfa in Sirikata::SST::ConnectionManager<Sirikata::SpaceObjectReference>::stop (this=0x25d2260)
    at /home/ewencp/sirikata.sirikata/libcore/include/sirikata/core/network/SSTImpl.hpp:2560
#5  0x00007fbc105f5ac7 in Sirikata::Context::shutdown (this=0x25daf60) at /home/ewencp/sirikata.sirikata/libcore/src/service/Context.cpp:141
#6  0x00007fbc105fbd4f in std::tr1::_Mem_fn<void (Sirikata::Context::*)()>::operator() (this=0x7fbbf4355f60, __object=0x25daf60)
    at /usr/include/c++/4.4/tr1_impl/functional:552
#7  0x00007fbc105fb4aa in std::tr1::result_of<std::tr1::_Mem_fn<void (Sirikata::Context::*)()> ()(std::tr1::result_of<std::tr1::_Mu<Sirikata::Context*, false, false> ()(Sirikata::Context*, std::tr1::tuple<>)>::type)>::type std::tr1::_Bind<std::tr1::_Mem_fn<void (Sirikata::Context::*)()> ()(Sirikata::Context*)>::__call<, 0>(std::tr1::_Mu<Sirikata::Context*, false, false> ( const&)(Sirikata::Context*, std::tr1::tuple<>), std::tr1::_Index_tuple<0>) (
    this=0x7fbbf4355f60, __args=...) at /usr/include/c++/4.4/tr1_impl/functional:1137
#8  0x00007fbc105fa645 in std::tr1::result_of<std::tr1::_Mem_fn<void (Sirikata::Context::*)()> ()(std::tr1::result_of<std::tr1::_Mu<Sirikata::Context*, false, false> ()(Sirikata::Context*, std::tr1::tuple<>)>::type)>::type std::tr1::_Bind<std::tr1::_Mem_fn<void (Sirikata::Context::*)()> ()(Sirikata::Context*)>::operator()<>() (this=0x7fbbf4355f60) at /usr/include/c++/4.4/tr1_impl/functional:1191
#9  0x00007fbc105f9391 in std::tr1::_Function_handler<void ()(), std::tr1::_Bind<std::tr1::_Mem_fn<void (Sirikata::Context::*)()> ()(Sirikata::Context*)> >::_M_invoke(std::tr1::_Any_data const&) (__functor=...) at /usr/include/c++/4.4/tr1_impl/functional:1668
#10 0x00007fbc104bca69 in std::tr1::function<void ()()>::operator()() const (this=0x7fffd64a1940) at /usr/include/c++/4.4/tr1_impl/functional:2024
#11 0x00007fbc105d8fce in void boost::asio::asio_handler_invoke<std::tr1::function<void ()()> >(std::tr1::function<void ()()>, ...) (function=...)
    at /home/ewencp/sirikata.sirikata/dependencies/installed-boost/include/boost/asio/handler_invoke_hook.hpp:62
#12 0x00007fbc105d8a32 in void boost_asio_handler_invoke_helpers::invoke<std::tr1::function<void ()()>, std::tr1::function<void ()()> >(std::tr1::function<void ()()> const&, std::tr1::function<void ()()>&) (function=..., context=...)
    at /home/ewencp/sirikata.sirikata/dependencies/installed-boost/include/boost/asio/detail/handler_invoke_helpers.hpp:41
#13 0x00007fbc105d9a85 in boost::asio::detail::handler_queue::handler_wrapper<std::tr1::function<void ()()> >::do_call(boost::asio::detail::handler_queue::handler*) (base=0x7fbbf4a0a7d0) at /home/ewencp/sirikata.sirikata/dependencies/installed-boost/include/boost/asio/detail/handler_queue.hpp:192
#14 0x00007fbc1055abe3 in boost::asio::detail::handler_queue::handler::invoke (this=0x7fbbf4a0a7d0)
    at /home/ewencp/sirikata.sirikata/dependencies/installed-boost/include/boost/asio/detail/handler_queue.hpp:39
#15 0x00007fbc10561249 in boost::asio::detail::task_io_service<boost::asio::detail::epoll_reactor<false> >::do_one (this=0x25d1bf0, lock=..., 
    this_idle_thread=0x7fffd64a1b20, ec=...)
    at /home/ewencp/sirikata.sirikata/dependencies/installed-boost/include/boost/asio/detail/task_io_service.hpp:268
#16 0x00007fbc1055dc21 in boost::asio::detail::task_io_service<boost::asio::detail::epoll_reactor<false> >::run (this=0x25d1bf0, ec=...)
    at /home/ewencp/sirikata.sirikata/dependencies/installed-boost/include/boost/asio/detail/task_io_service.hpp:103
#17 0x00007fbc1055b103 in boost::asio::io_service::run (this=0x25d64d0)
    at /home/ewencp/sirikata.sirikata/dependencies/installed-boost/include/boost/asio/impl/io_service.ipp:68
#18 0x00007fbc105d7047 in Sirikata::Network::IOService::run (this=0x25d2210) at /home/ewencp/sirikata.sirikata/libcore/src/network/IOService.cpp:118
#19 0x00007fbc105f52fd in Sirikata::Context::run (this=0x25daf60, nthreads=3, exthreads=Sirikata::Context::IncludeOriginal)
    at /home/ewencp/sirikata.sirikata/libcore/src/service/Context.cpp:106
---Type <return> to continue, or q <return> to quit---
#20 0x00000000005fd77a in main (argc=14, argv=0x7fffd64a2678) at /home/ewencp/sirikata.sirikata/space/src/main.cpp:372

There's nothing obviously wrong with the code or the stack trace. It's hung looping in the rebalancing method for the map. I think this is due to memory corruption because of different threads trying to operate on the map of connections at the same time.

If you look at the comment in Connection::closeConnection, we don't acquire a lock because this should only be happening during shutdown. It used to only be called in the ConnectionManager destructor, but we added a call to this in ConnectionManager::stop() because otherwise SST will keep everything from shutting down. This means that this will get called during normal operation with multiple threads active and possibly accessing SST streams/connections.

The obvious solution would be to just acquire the lock mentioned in the comment but it also suggests we'd deadlock if we did. If its safe to do so we could just convert it to a recursive_mutex.

This happens fairly frequently with some code I've been testing, but the test isn't small and has some other issues. If there's a patch with a possible fix I can test it before it gets committed.

danielrh commented 11 years ago

You could try http://graphics.stanford.edu/~danielrh/microLockErase.patch or if that's too messy/complex the simpler

http://graphics.stanford.edu/~danielrh/scopedErase.patch

basically they release the lock during the reset period...anything wrong/risky with that?

ewencp commented 11 years ago

@danielrh Thanks, went with a very slightly different solution, but gist is the same.