Fix occasional crash during shutdown when explicitly calling ros::start but not ros::shutdown

dgossow commented 1 year ago

Summary

In the case that user code calls ros::start explicitly, the ros::NodeHandle class will not call ros::shutdown when the last instance is destroyed.

In order to make sure that the program can exit error-free nonetheless, currently ros::init registers the ros::atexitCallback callback to be executed after the user's main has finished, which performs clean-up of resources that were set up in init.

However, atexitCallback is registered before the various ROS singletons are instantiated, which means that it is called after the singletons have been destroyed.

This PR addressed most of the problem, however with one remaining issue: ros::start creates a thread, which will continue to execute after singletons have been destroyed and until ros::atexitCallback is called. So, in rare circumstances, this thread can access singleton objects after they have been destroyed.

This PR fixes the issue by ensuring that ros::shutdown is called after main has returned, but before singletons are destroyed.

Detailed Description

We came across a quite rare crash of ROS nodes during shutdown at my company with the following backtrace:

main thread:

#0  0x00007f0076baead3 in futex_wait_cancelable (private=<optimized out>, expected=0, futex_word=0x29708d8) at ../sysdeps/unix/sysv/linux/futex-internal.h:88
#1  0x00007f0076baead3 in __pthread_cond_wait_common (abstime=0x0, mutex=0x2970888, cond=0x29708b0) at pthread_cond_wait.c:502
#2  0x00007f0076baead3 in __pthread_cond_wait (cond=0x29708b0, mutex=0x2970888) at pthread_cond_wait.c:655
#3  0x00007f00773f1b5d in boost::condition_variable::wait(boost::unique_lock<boost::mutex>&) () at /usr/lib/x86_64-linux-gnu/libboost_thread.so.1.65.1
#4  0x00007f00773e9d14 in boost::thread::join_noexcept() () at /usr/lib/x86_64-linux-gnu/libboost_thread.so.1.65.1
#5  0x00007f00ae60784d in ros::shutdown() () at /opt/ros/melodic/lib/libroscpp.so
#6  0x00007f00ae607a49 in ros::atexitCallback() () at /opt/ros/melodic/lib/libroscpp.so
#7  0x00007f007362f031 in __run_exit_handlers (status=0, listp=0x7f00739d7718 <__exit_funcs>, run_list_atexit=run_list_atexit@entry=true, run_dtors=run_dtors@entry=true) at exit.c:108
#8  0x00007f007362f12a in __GI_exit (status=<optimized out>) at exit.c:139
#9  0x00007f007360dc8e in __libc_start_main (main=0x419a90 <main(int, char**)>, argc=27, argv=0x7ffdf94ccdc8, init=<optimized out>, fini=<optimized out>, rtld_fini=<optimized out>, stack_end=0x7ffdf94ccdb8) at ../csu/libc-start.c:344
#10 0x00000000004176ca in _start () at /usr/include/boost/asio/ssl/impl/error.ipp:89

second thread:

terminate called after throwing an instance of 'boost::exception_detail::clone_impl<boost::exception_detail::error_info_injector<boost::lock_error> >'
  what():  boost: mutex lock failed in pthread_mutex_lock: Invalid argument
Stack trace (most recent call last) in thread 245:
#18   Object "", at 0xffffffffffffffff, in 
#17   Object "/lib/x86_64-linux-gnu/libc.so.6", at 0x7fae78dc561e, in clone
#16   Object "/lib/x86_64-linux-gnu/libpthread.so.0", at 0x7fae7c25f6da, in start_thread
#15   Object "/usr/lib/x86_64-linux-gnu/libboost_thread.so.1.65.1", at 0x7fae7caa0bcc, in boost::this_thread::interruption_requested()
#14   Object "/opt/ros/melodic/lib/libroscpp.so", at 0x7faeb3cbdef0, in ros::internalCallbackQueueThreadFunc()
#13   Object "/opt/ros/melodic/lib/libroscpp.so", at 0x7faeb3c7e2fa, in ros::CallbackQueue::callAvailable(ros::WallDuration)
#12   Object "/opt/ros/melodic/lib/libroscpp.so", at 0x7faeb3c7c558, in ros::CallbackQueue::callOneCB(ros::CallbackQueue::TLS*)
#11   Object "/opt/ros/melodic/lib/libroscpp.so", at 0x7faeb3cb267a, in ros::TimerManager<ros::SteadyTime, ros::WallDuration, ros::SteadyTimerEvent>::TimerQueueCallback::call()
#10   Object "/opt/ros/melodic/lib/libroscpp.so", at 0x7faeb3cad644, in ros::TransportPublisherLink::onRetryTimer(ros::SteadyTimerEvent const&)
#9    Object "/opt/ros/melodic/lib/libroscpp.so", at 0x7faeb3ca32c2, in ros::TransportTCP::connect(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, int)
#8    Object "/opt/ros/melodic/lib/libroscpp.so", at 0x7faeb3ca1e34, in ros::TransportTCP::initializeSocket()
#7    Object "/opt/ros/melodic/lib/libroscpp.so", at 0x7faeb3cdbb31, in ros::PollSet::addSocket(int, boost::function<void (int)> const&, boost::shared_ptr<ros::Transport> const&)
#6    Object "/opt/ros/melodic/lib/libroscpp.so", at 0x7faeb3c2a368, in void boost::throw_exception<boost::lock_error>(boost::lock_error const&)
#5    Object "/usr/lib/x86_64-linux-gnu/libstdc++.so.6", at 0x7fae796ddd53, in __cxa_throw
#4    Object "/usr/lib/x86_64-linux-gnu/libstdc++.so.6", at 0x7fae796ddb20, in std::terminate()
#3    Object "/usr/lib/x86_64-linux-gnu/libstdc++.so.6", at 0x7fae796ddae5, in std::rethrow_exception(std::__exception_ptr::exception_ptr)
#2    Object "/usr/lib/x86_64-linux-gnu/libstdc++.so.6", at 0x7fae796d7956, in 
#1    Object "/lib/x86_64-linux-gnu/libc.so.6", at 0x7fae78ce47f0, in abort
#0    Object "/lib/x86_64-linux-gnu/libc.so.6", at 0x7fae78ce2e87, in gsignal
Aborted (Signal sent by tkill() 231 0)

After digging through the source code, I found that this is indeed a bug in ROS itself, which happens in the following circumstances:

The user code calls ros::start, but not ros::shutdown.
There is some TCP transport publisher link which is dropped and happens to try to reconnect after the program's main() has ended

In this case, the sequence that leads to the crash is the following:

In ros::init(), the ros::atexitCallback is registered to be called after main() has ended.
In ros::start(), Singletons are created and a thread which calls internalCallbackQueueThreadFunc is started, which calls callbacks from ros::g_internal_callback_queue.
TransportPublisherLink creates a timer which calls TransportPublisherLink::onRetryTimer and pushes this onto ros::g_internal_callback_queue, accessed via ros::getInternalCallbackQueue().

Once the user's main() function exits:

g_internal_queue_thread is destructed, which results in the underlying thread being detached, i.e. internalCallbackQueueThreadFunc continues to run
all singletons are destructed, including PollManager.
atexitCallback is called and calls ros::shutdown, which sets g_shutting_down to true, signaling the internalCallbackQueueThreadFunc to exit, and waits for the thread to join.
The internalCallbackQueueThreadFunc thread however has already entered the call to queue->callAvailable, which in turn calls TransportPublisherLink::onRetryTimer.
TransportPublisherLink::onRetryTimer calls PollManager::instance(), which returns the address of the destructed singleton object
The following call to TransportTCP::connect results in an exception, since it tries to lock an already-destructed mutex.

The fix in this PR works around this problem by making sure that the internalCallbackQueueThreadFunc thread is ended before singletons are destroyed.

It does this by stopping the thread in the destructor of a InternalQueueJoiningThread instance which is declared as a static variable inside of the initInternalQueueJoiningThread function.

This function is called after the singleton instances (which are also static function-scope variables) are created, which means that the compiler must generate code that destroys the InternalQueueJoiningThread object, and thus joins the thread, before the other Singletons are destroyed.

Note that, as mentioned above, init.cpp registers atexitCallback specifically for the case that ros::shutdown was not called by the user code. However, it fails to address the issue described above since it is called after singletons have been destroyed but before the internalCallbackQueueThreadFunc thread has ended.

This was meant to be addressed by this PR,, but the fix was incomplete.

dgossow commented 1 year ago

@mjcarroll could you PTAL?

dgossow commented 1 year ago

@rhaschke could you take another look?

dgossow commented 1 year ago

The destruction order ensured by the compiler should already guarantee that the thread finishes before singleton destruction.

I guess because of this?

d) For each local object obj with static storage duration, obj is destroyed as if a function calling the destructor of obj were registered with std::atexit at the completion of the constructor of obj.

c) If the completion of the initialization of a static object A was sequenced-before the call to std::atexit for some function F, the call to F during termination is sequenced-before the start of the destruction of A

Order of de-initialization used to be:

singleton destructors
atexitCallback
g_internal_queue_thread destructor

Now it is

ShutdownCaller destructor
singleton destructors
g_internal_queue_thread destructor

rhaschke commented 1 year ago

I'm afraid we're talking past each other. My point is that the shutdown handler isn't registered in ros::init() at all anymore. I would expect that you replace the atexit registration with ShutdownCaller instantiation, i.e. calling initShutdownCaller(), but you just removed the atexit registration w/o replacing it with something else.

The ShutdownCaller is only initialized in shutdownCallback(), which is only called on an external shutdown request: https://github.com/ros/ros_comm/blob/842f0f026924323f605495da0b80493f15f0bdce/clients/roscpp/src/libros/init.cpp#L322

Actually, I would expect the call to initShutdownCaller() only in ros::init() and not in the latter callback.

dgossow commented 1 year ago

The ShutdownCaller is only initialized in shutdownCallback(), which is only called on an external shutdown request:

The GitHub diff view is a little misleading here. initShutdownCaller is now called in void start() and not shutdownCallback`

I'm afraid we're talking past each other. My point is that the shutdown handler isn't registered in ros::init() at all anymore.

Sorry, I didn't respond to that comment yet.

Yes, that is true. With this change, ros::init does not register any type of exit handler anymore. Only ros::start does.

atexitHandler used to call ros::shutdown (with g_started set to false).

My understanding was that shutdown only has to be called if start has been called. But you are right, some of the resources it frees are allocated in init().

Looking at the shutdown code in ros::shutdown:

void shutdown()
{
  boost::recursive_mutex::scoped_lock lock(g_shutting_down_mutex);
  if (g_shutting_down)
    return;
  else
    g_shutting_down = true;

  ros::console::shutdown(); // ROSCONSOLE_AUTOINIT is called in ros::init

  g_global_queue->disable(); // created in init()
  g_global_queue->clear();

  if (g_internal_queue_thread.get_id() != boost::this_thread::get_id())
  {
    g_internal_queue_thread.join();
  } // created in start()

  //ros::console::deregister_appender(g_rosout_appender);
  delete g_rosout_appender; // created in start()
  g_rosout_appender = 0;

  if (g_started)
  {
    TopicManager::instance()->shutdown();
    ServiceManager::instance()->shutdown();
    PollManager::instance()->shutdown();
    ConnectionManager::instance()->shutdown();
    XMLRPCManager::instance()->shutdown();
  }

  g_started = false;
  g_ok = false;
  Time::shutdown(); // ros::Time::init(); is called in start()
}

dgossow commented 1 year ago

If you take the effort to write a test, why don't you make a true unit test from it, i.e. adding some assertions?

the assertions are made in the atexit callback, this is why gtest is not used, if this was your question.

the missing assignment of hasError was a bug, it's fixed now

dgossow commented 1 year ago

I'll see if I can fix the mismatch between init, start and shutdown

dgossow commented 1 year ago

Switching from boolean variables to static objects is probably the only way to correctly fix that issue.

Yes, it's a mess and much more of a rabbit hole than I signed up for.

I pushed another fix.

dgossow commented 1 year ago

oh what have I gotten myself into. This code is such a f-ing mess

dgossow commented 1 year ago

Alright.

It seems like the current implementation preserves the previous behavior, except for joining the thread before singletons are destructed.

I also tested that introducing various bugs does indeed fail the tests.

dgossow commented 1 year ago

@rhaschke Thanks a lot for the quick review cycle. would you mind taking another look?

rhaschke commented 1 year ago

would you mind taking another look?

I'm in a meeting now and then driving home. I will have a look into that later in the evening again.

dgossow commented 1 year ago

@rhaschke happy to also jump on a call if you want to talk through the changes here.

I opened this PR on a whim, but now that I've spent a full day on it, I'm kind of invested in it.

dgossow commented 1 year ago

@rhaschke I also added a more readable explanation to the PR description.

dgossow commented 1 year ago

@mjcarroll could you merge this?

dgossow commented 1 year ago

@ros/ros_comm-maintainers could someone give this a final review and merge?

mjcarroll commented 1 year ago

This looks good to me, @sloretz do you want to take a look before we merge?

sloretz commented 1 year ago

Thank you for the PR and clear description!

ros / ros_comm

Fix occasional crash during shutdown when explicitly calling ros::start but not ros::shutdown #2355

Summary

Detailed Description