Closed dgossow closed 1 year ago
@mjcarroll could you PTAL?
@rhaschke could you take another look?
The destruction order ensured by the compiler should already guarantee that the thread finishes before singleton destruction.
I guess because of this?
d) For each local object obj with static storage duration, obj is destroyed as if a function calling the destructor of obj were registered with std::atexit at the completion of the constructor of obj.
c) If the completion of the initialization of a static object A was sequenced-before the call to std::atexit for some function F, the call to F during termination is sequenced-before the start of the destruction of A
Order of de-initialization used to be:
Now it is
I'm afraid we're talking past each other. My point is that the shutdown handler isn't registered in ros::init()
at all anymore.
I would expect that you replace the atexit
registration with ShutdownCaller
instantiation, i.e. calling initShutdownCaller()
, but you just removed the atexit
registration w/o replacing it with something else.
The ShutdownCaller
is only initialized in shutdownCallback()
, which is only called on an external shutdown
request:
https://github.com/ros/ros_comm/blob/842f0f026924323f605495da0b80493f15f0bdce/clients/roscpp/src/libros/init.cpp#L322
Actually, I would expect the call to initShutdownCaller()
only in ros::init()
and not in the latter callback.
The ShutdownCaller is only initialized in shutdownCallback(), which is only called on an external shutdown request:
The GitHub diff view is a little misleading here. initShutdownCaller
is now called in void start()
and not shutdownCallback`
I'm afraid we're talking past each other. My point is that the shutdown handler isn't registered in ros::init() at all anymore.
Sorry, I didn't respond to that comment yet.
Yes, that is true. With this change, ros::init
does not register any type of exit handler anymore. Only ros::start
does.
atexitHandler
used to call ros::shutdown
(with g_started
set to false
).
My understanding was that shutdown
only has to be called if start
has been called. But you are right, some of the resources it frees are allocated in init()
.
Looking at the shutdown code in ros::shutdown
:
void shutdown()
{
boost::recursive_mutex::scoped_lock lock(g_shutting_down_mutex);
if (g_shutting_down)
return;
else
g_shutting_down = true;
ros::console::shutdown(); // ROSCONSOLE_AUTOINIT is called in ros::init
g_global_queue->disable(); // created in init()
g_global_queue->clear();
if (g_internal_queue_thread.get_id() != boost::this_thread::get_id())
{
g_internal_queue_thread.join();
} // created in start()
//ros::console::deregister_appender(g_rosout_appender);
delete g_rosout_appender; // created in start()
g_rosout_appender = 0;
if (g_started)
{
TopicManager::instance()->shutdown();
ServiceManager::instance()->shutdown();
PollManager::instance()->shutdown();
ConnectionManager::instance()->shutdown();
XMLRPCManager::instance()->shutdown();
}
g_started = false;
g_ok = false;
Time::shutdown(); // ros::Time::init(); is called in start()
}
If you take the effort to write a test, why don't you make a true unit test from it, i.e. adding some assertions?
the assertions are made in the atexit
callback, this is why gtest is not used, if this was your question.
the missing assignment of hasError
was a bug, it's fixed now
I'll see if I can fix the mismatch between init
, start
and shutdown
Switching from boolean variables to static objects is probably the only way to correctly fix that issue.
Yes, it's a mess and much more of a rabbit hole than I signed up for.
I pushed another fix.
oh what have I gotten myself into. This code is such a f-ing mess
Alright.
It seems like the current implementation preserves the previous behavior, except for joining the thread before singletons are destructed.
I also tested that introducing various bugs does indeed fail the tests.
@rhaschke Thanks a lot for the quick review cycle. would you mind taking another look?
would you mind taking another look?
I'm in a meeting now and then driving home. I will have a look into that later in the evening again.
@rhaschke happy to also jump on a call if you want to talk through the changes here.
I opened this PR on a whim, but now that I've spent a full day on it, I'm kind of invested in it.
@rhaschke I also added a more readable explanation to the PR description.
@mjcarroll could you merge this?
@ros/ros_comm-maintainers could someone give this a final review and merge?
This looks good to me, @sloretz do you want to take a look before we merge?
Thank you for the PR and clear description!
Summary
In the case that user code calls
ros::start
explicitly, theros::NodeHandle
class will not callros::shutdown
when the last instance is destroyed.In order to make sure that the program can exit error-free nonetheless, currently
ros::init
registers theros::atexitCallback
callback to be executed after the user'smain
has finished, which performs clean-up of resources that were set up ininit
.However,
atexitCallback
is registered before the various ROS singletons are instantiated, which means that it is called after the singletons have been destroyed.This PR addressed most of the problem, however with one remaining issue:
ros::start
creates a thread, which will continue to execute after singletons have been destroyed and untilros::atexitCallback
is called. So, in rare circumstances, this thread can access singleton objects after they have been destroyed.This PR fixes the issue by ensuring that
ros::shutdown
is called aftermain
has returned, but before singletons are destroyed.Detailed Description
We came across a quite rare crash of ROS nodes during shutdown at my company with the following backtrace:
main thread:
second thread:
After digging through the source code, I found that this is indeed a bug in ROS itself, which happens in the following circumstances:
ros::start
, but notros::shutdown
.main()
has endedIn this case, the sequence that leads to the crash is the following:
ros::init()
, theros::atexitCallback
is registered to be called aftermain()
has ended.ros::start()
, Singletons are created and a thread which callsinternalCallbackQueueThreadFunc
is started, which calls callbacks fromros::g_internal_callback_queue
.TransportPublisherLink::onRetryTimer
and pushes this ontoros::g_internal_callback_queue
, accessed via ros::getInternalCallbackQueue().Once the user's
main()
function exits:internalCallbackQueueThreadFunc
continues to runatexitCallback
is called and callsros::shutdown
, which setsg_shutting_down
to true, signaling theinternalCallbackQueueThreadFunc
to exit, and waits for the thread to join.internalCallbackQueueThreadFunc
thread however has already entered the call toqueue->callAvailable
, which in turn callsTransportPublisherLink::onRetryTimer
.TransportPublisherLink::onRetryTimer
callsPollManager::instance()
, which returns the address of the destructed singleton objectTransportTCP::connect
results in an exception, since it tries to lock an already-destructed mutex.The fix in this PR works around this problem by making sure that the
internalCallbackQueueThreadFunc
thread is ended before singletons are destroyed.It does this by stopping the thread in the destructor of a
InternalQueueJoiningThread
instance which is declared as a static variable inside of theinitInternalQueueJoiningThread
function.This function is called after the singleton instances (which are also static function-scope variables) are created, which means that the compiler must generate code that destroys the
InternalQueueJoiningThread
object, and thus joins the thread, before the other Singletons are destroyed.Note that, as mentioned above, init.cpp registers
atexitCallback
specifically for the case thatros::shutdown
was not called by the user code. However, it fails to address the issue described above since it is called after singletons have been destroyed but before theinternalCallbackQueueThreadFunc
thread has ended.This was meant to be addressed by this PR,, but the fix was incomplete.