robosoft-ai / SMACC

An Event-Driven, Asynchronous, Behavioral State Machine Library for ROS (Robotic Operating System) applications written in C++
https://smacc.dev
BSD 3-Clause "New" or "Revised" License
295 stars 50 forks source link

Segfault when transitioning from state with a high frequency callback client behaviour #86

Open yassiezar opened 1 year ago

yassiezar commented 1 year ago

I've come across what might be a bug.

In my system, I have a CB subscribed to a high-frequency teleop topic with the standard SMACC signal system, which then publishes messages onto another topic via another SMACC client. However, I've experienced segfaults occurring during state transitions when publishing a message from within the subscriber CB callback. GDB confirmed my suspicion that I'm referencing a nullptr when trying to publish a message via the client pointer. I suspect the CB's onExit() has been called and all the class members have been deallocated before the client pointer is dereferenced and the message is published, triggering a segfault.

The event timeline looks something like this:

- topic data published  -> onMessageReceived() triggered -> dereference class pointer to client -> SEGFAULT
-        transition triggered -> call onExit() -> destroy CB

Does this make sense? I've added an example node here that hopefully make things clearer. The issue seems to be that there's no way to tell when a pointer to a client has been deallocated inside a CB callback, except for maybe doing a check for nullptr before attempting to deallocate e.g.

if(cl_pub_ == nullptr) return;
cl_pub_->publish(msg)

but this is effectively a race condition (the pointer can still be deallocated between these instructions). This issue only seems to become apparent with high frequency callback triggers - lower frequency CB's manage to destroy themselves more predictably

The obvious solution to me would be to replace the naked pointers with std::shared_ptr<TClient> instead and return std::weak_ptr<TClient> to the CBs via the requiresClient() calls. On other words, a typical CB would go from

class CbExample : public smacc::ISmaccClientBehavior {
  public:
    void onEntry() override {
      if(!this->requiresClient(cl_pub_)) {
        ROS_ERROR("Client doesn't exist");
        return;
      }
      std_msgs::Empty msg;
      cl_pub_->publish(msg);
    }
  private:
    ClPub* cl_pub_;
};

to

class CbExample : public smacc::ISmaccClientBehavior {
  public:
    void onEntry() override {
      cl_pub_ = this->requiresClient();
      if(auto lock = cl_pub_.lock()) {
        std_msgs::Empty msg;
        lock->publish(msg);
      }
      else {
        ROS_WARN("ClPub went out of scope");
      }
    }
  private:
    std::weak_ptr<ClPub> cl_pub_;
};

In my opinion, this has significant safety benefits and will prevent silly errors like the one I described above. I'm happy to look at this and open a PR. However, this refactor will need significant changes to SMACC's API. Thoughts?

┆Issue is synchronized with this Jira Task by Unito

pabloinigoblasco commented 1 year ago

Hello yassierzar,

Thank you for bringing this issue to our attention and providing an example. Ensuring pointer safety in high-speed applications is a something important for us, and we appreciate your help in identifying any potential race conditions.

Not sure if I understand totally the issue. I believe that such situations should have been considered and prevented already in our current code. I will reproduce and debug the example you provided to learn more about it.

Before making any decisions about breaking changes in the code, let's carefully analyze and discuss this matter in more detail. Thank you again for bringing this to our attention, we will find a solution for this case.

pabloinigoblasco commented 1 year ago

I am reproducing your example from your fork repo and branch. It is not crashing for me, Do you think there is anything I could do to make it crash?

yassiezar commented 1 year ago

Not sure if I understand totally the issue. I believe that such situations should have been considered and prevented already in our current code. I will reproduce and debug the example you provided to learn more about it.

Yes, its a weird issue and its quite difficult to explain it clearly in text. I think its a case of a ROS callback executed on a high frequency will sometimes reference a class member pointer to a client while the owning class (e.g. a client behaviour) is being destroyed. With low frequency callbacks, I think it won't be as much of a problem because your signalling system will prevent callbacks from being triggered while the CB is being destroyed. I don't mind setting up a call with you and Brett to explain this if its not clear.

Before making any decisions about breaking changes in the code, let's carefully analyze and discuss this matter in more detail. Thank you again for bringing this to our attention, we will find a solution for this case.

Of course, I was mostly thinking out loud and making notes for future discussions (I often forget about ideas). However, I think there is a discussion to be had (perhaps at another venue) around using naked pointers as opposed to smart pointers that could perhaps be more robust against these kinds of errors.

I am reproducing your example from your fork repo and branch. It is not crashing for me, Do you think there is anything I could do to make it crash?

Interesting, I just ran it and it crashes roughly 5 seconds after start as soon as the state transitions (see log below). Are you running the sm_atomic_early_exit node? I.e. rosrun sm_atomic_early_exit sm_atomic_early_exit_node

[ INFO] [1676629749.311695474]: [SmaccStateMachine] initiate_impl
[ WARN] [1676629749.311827340]: State machine base creation:SmAtomicEarlyExit
[ INFO] [1676629749.313078562]: [SmaccStateMachine] Initializing ROS communication mechanisms
[ INFO] [1676629749.313092726]: [SmaccStateMachine] Initializing state machine
[ WARN] [1676629749.313100908]: [sm_atomic_early_exit::State1] creating 
[ INFO] [1676629749.313140385]: [sm_atomic_early_exit::State1] Creating ros NodeHandle for this state: SmAtomicEarlyExit
[ INFO] [1676629749.313171077]: [sm_atomic_early_exit::State1] State object created. Initializating...
[ INFO] [1676629749.313688321]: [sm_atomic_early_exit::State1] -- STATIC STATE DESCRIPTION --
[ INFO] [1676629749.313699753]: [sm_atomic_early_exit::State1] Creating static client behavior: cl_ros_timer::CbTimerCountdownLoop
[ INFO] [1676629749.313707296]: [sm_atomic_early_exit::State1] Configuring orthogonal: sm_atomic_early_exit::OrTimer
[ INFO] [1676629749.313720177]: [Orthogonal sm_atomic_early_exit::OrTimer] adding client behavior: cl_ros_timer::CbTimerCountdownLoop
[ INFO] [1676629749.313726979]: [sm_atomic_early_exit::State1] Creating static client behavior: cl_ros_timer::CbTimerCountdownOnce
[ INFO] [1676629749.313734835]: [sm_atomic_early_exit::State1] Configuring orthogonal: sm_atomic_early_exit::OrTimer
[ INFO] [1676629749.313742767]: [Orthogonal sm_atomic_early_exit::OrTimer] adding client behavior: cl_ros_timer::CbTimerCountdownOnce
[ INFO] [1676629749.313750481]: [sm_atomic_early_exit::State1] ---- END STATIC DESCRIPTION
[ INFO] [1676629749.313757522]: [sm_atomic_early_exit::State1] State runtime configuration
[ INFO] [1676629749.313765311]: [Orthogonal sm_atomic_early_exit::OrTimer] runtimeConfigure, current Behavior: cl_ros_timer::CbTimerCountdownLoop
[ INFO] [1676629749.313772765]: [Orthogonal sm_atomic_early_exit::OrTimer] runtimeConfigure, current Behavior: cl_ros_timer::CbTimerCountdownOnce
[ WARN] [1676629749.313780869]: [StateMachine] setting state active : sm_atomic_early_exit::State1
[ INFO] [1676629749.313793460]: [sm_atomic_early_exit::State1] State OnEntry
[ INFO] [1676629749.313799884]: On Entry!
[ INFO] [1676629749.313806883]: [sm_atomic_early_exit::State1] State OnEntry code finished
[ INFO] [1676629749.313813699]: [Orthogonal sm_atomic_early_exit::OrExample] OnEntry
[ INFO] [1676629749.313820767]: [Orthogonal sm_atomic_early_exit::OrTimer] OnEntry, current Behavior: cl_ros_timer::CbTimerCountdownLoop
[ INFO] [1676629749.313838542]: [StateMachine] life-time constrained smacc signal subscription created. Subscriber is cl_ros_timer::CbTimerCountdownLoop
[ INFO] [1676629749.313846669]: [Orthogonal sm_atomic_early_exit::OrTimer] OnEntry, current Behavior: cl_ros_timer::CbTimerCountdownOnce
[ INFO] [1676629749.313855695]: [StateMachine] life-time constrained smacc signal subscription created. Subscriber is cl_ros_timer::CbTimerCountdownOnce
[ WARN] [1676629749.313862700]: [StateMachine] setting state active : sm_atomic_early_exit::State1
[ WARN] [1676629749.357850471]: Signal detector frequency (ros param ~signal_detector_loop_freq): 20.000000
[ INFO] [1676629749.359349223]: [SignalDetector] loop rate hz:20
[ INFO] [1676629749.359424795]: [SignalDetector] running in single threaded mode
[ INFO] [1676629749.359476609]: [SignalDetector] heartbeat
[ INFO] [1676629754.309856895]: NOTIFY TRANSITION: smacc::Transition<cl_ros_timer::EvTimer<cl_ros_timer::CbTimerCountdownOnce, sm_atomic_early_exit::OrTimer>, sm_atomic_early_exit::State2, smacc::default_transition_tags::SUCCESS, boost::statechart::detail::no_context<cl_ros_timer::EvTimer<cl_ros_timer::CbTimerCountdownOnce, sm_atomic_early_exit::OrTimer> >, &boost::statechart::detail::no_context<cl_ros_timer::EvTimer<cl_ros_timer::CbTimerCountdownOnce, sm_atomic_early_exit::OrTimer> >::no_function>
[ WARN] [1676629754.309993305]: exiting state: sm_atomic_early_exit::State1
[ INFO] [1676629754.310076794]: Notification State Exit: leaving state0x7f9b500088f0
[ INFO] [1676629754.310190692]: [Orthogonal sm_atomic_early_exit::OrExample] OnExit
[ INFO] [1676629754.310248891]: [Orthogonal sm_atomic_early_exit::OrTimer] OnExit, current Behavior: cl_ros_timer::CbTimerCountdownLoop
[ INFO] [1676629754.310293966]: [Orthogonal sm_atomic_early_exit::OrTimer] OnExit, current Behavior: cl_ros_timer::CbTimerCountdownOnce
[ WARN] [1676629754.310369701]: Client behavior deallocated.
[ WARN] [1676629754.310438975]: Client behavior deallocated.
[ WARN] [1676629754.310496246]: [StateMachine] Disconnecting scoped-lifetime SmaccSignal subscription
[ WARN] [1676629754.310545275]: [StateMachine] Disconnecting scoped-lifetime SmaccSignal subscription
[ INFO] [1676629754.310585711]: On Exit!
[ WARN] [1676629754.310625609]: state exit: sm_atomic_early_exit::State1
[ WARN] [1676629754.310674146]: [sm_atomic_early_exit::State2] creating 
[ INFO] [1676629754.310771453]: [sm_atomic_early_exit::State2] Creating ros NodeHandle for this state: SmAtomicEarlyExit
[ INFO] [1676629754.310833681]: [sm_atomic_early_exit::State2] State object created. Initializating...
[ INFO] [1676629754.319295376]: [sm_atomic_early_exit::State2] -- STATIC STATE DESCRIPTION --
[ INFO] [1676629754.319378369]: [sm_atomic_early_exit::State2] Creating static client behavior: cl_ros_timer::CbTimerCountdownOnce
[ INFO] [1676629754.319425234]: [sm_atomic_early_exit::State2] Configuring orthogonal: sm_atomic_early_exit::OrTimer
[ INFO] [1676629754.319477522]: [Orthogonal sm_atomic_early_exit::OrTimer] adding client behavior: cl_ros_timer::CbTimerCountdownOnce
[ INFO] [1676629754.319534497]: [sm_atomic_early_exit::State2] Creating static client behavior: sm_atomic_early_exit::CbExamplePub
[ INFO] [1676629754.319596191]: [sm_atomic_early_exit::State2] Configuring orthogonal: sm_atomic_early_exit::OrExample
[ INFO] [1676629754.319667641]: [Orthogonal sm_atomic_early_exit::OrExample] adding client behavior: sm_atomic_early_exit::CbExamplePub
[ INFO] [1676629754.319720762]: [sm_atomic_early_exit::State2] Creating static client behavior: sm_atomic_early_exit::CbExampleSub
[ INFO] [1676629754.319770833]: [sm_atomic_early_exit::State2] Configuring orthogonal: sm_atomic_early_exit::OrExample
[ INFO] [1676629754.319823136]: [Orthogonal sm_atomic_early_exit::OrExample] adding client behavior: sm_atomic_early_exit::CbExampleSub
[ INFO] [1676629754.319873761]: [sm_atomic_early_exit::State2] ---- END STATIC DESCRIPTION
[ INFO] [1676629754.319925516]: [sm_atomic_early_exit::State2] State runtime configuration
[ INFO] [1676629754.319961619]: Entering State2
[ INFO] [1676629754.319997934]: [Orthogonal sm_atomic_early_exit::OrExample] runtimeConfigure, current Behavior: sm_atomic_early_exit::CbExamplePub
[ INFO] [1676629754.320039657]: [Orthogonal sm_atomic_early_exit::OrExample] runtimeConfigure, current Behavior: sm_atomic_early_exit::CbExampleSub
[ INFO] [1676629754.320078436]: [Orthogonal sm_atomic_early_exit::OrTimer] runtimeConfigure, current Behavior: cl_ros_timer::CbTimerCountdownOnce
[ WARN] [1676629754.320116627]: [StateMachine] setting state active : sm_atomic_early_exit::State2
[ INFO] [1676629754.320167367]: [sm_atomic_early_exit::State2] State OnEntry
[ INFO] [1676629754.320203706]: On Entry!
[ INFO] [1676629754.320239907]: [sm_atomic_early_exit::State2] State OnEntry code finished
[ INFO] [1676629754.320279298]: [Orthogonal sm_atomic_early_exit::OrExample] OnEntry, current Behavior: sm_atomic_early_exit::CbExamplePub
[ INFO] [1676629754.320320714]: [Orthogonal sm_atomic_early_exit::OrExample] OnEntry, current Behavior: sm_atomic_early_exit::CbExampleSub
[ INFO] [1676629754.320377293]: [StateMachine] life-time constrained smacc signal subscription created. Subscriber is smacc::CbSubscriptionCallbackBase<std_msgs::Empty_<std::allocator<void> > >
[ INFO] [1676629754.320421849]: [Orthogonal sm_atomic_early_exit::OrTimer] OnEntry, current Behavior: cl_ros_timer::CbTimerCountdownOnce
[ INFO] [1676629754.320472790]: [StateMachine] life-time constrained smacc signal subscription created. Subscriber is cl_ros_timer::CbTimerCountdownOnce
[ WARN] [1676629754.320511390]: [StateMachine] setting state active : sm_atomic_early_exit::State2
fish: “rosrun sm_atomic_early_exit sm_…” terminated by signal SIGSEGV (Address boundary error)
pabloinigoblasco commented 1 year ago

Indeed, I was in some incorrect branch, now I was able to reproduce the error. I tested, and debugged.

During my debugging the segfault happens here: https://github.com/yassiezar/SMACC/blob/19dec7998e55c48b87337343af1d92cc080b7841/smacc_sm_reference_library/sm_atomic_early_exit/include/sm_atomic_early_exit/clients/client_behaviors/cb_example_sub.h#L9

But, why shouldnt that raise a segmentation fault? cl_pub is nullptr.

Where is the requiresClient call?

yassiezar commented 1 year ago

That's a good point and a silly mistake on my part. I've edited the example to add the missing requresClient() call and now it doesn't crash. I'll do some more digging to try and diagnose

yassiezar commented 1 year ago

We've been running into this problem more consistently now and seems to be problem with any transition, not only high-frequency ones. The problem seems centred around long-lived objects (components, clients), where their pointers are being dereferenced or made to point at invalid memory upon state transitions.

For example, I print out the address of a particularly troublesome component after a requiresComponent call and get the address 0xfffe7e193010. When the SM crashes on a segfault during a transition however, the address the component points at is 0xffff00000000, i.e. its seems to not be pointing at the right memory anymore. However, inspecting the memory in GDB gives

0xfffe7e193010:

p *(fox_rt::CpTeleop*)0xfffe7e193010
$3 = {<smacc::ISmaccComponent> = {_vptr.ISmaccComponent = 0xffff00000000, stateMachine_ = 0x141, 
    owner_ = 0xfffe7fa03900}, kMaxLinearVelScale = 1.390639178345914e-309, kMaxAngularVelScale = 0, 
  kNormalLinearVelScale = 0, kNormalAngularVelScale = 0, button_map_ = std::map with 0 elements, 
  mapping_enabled_ = {_M_base = {static _S_alignment = 1, _M_i = false}}, 
  mtx_ = {<std::__mutex_base> = {_M_mutex = {__data = {__lock = 0, __count = 0, __owner = 0, 
          __nusers = 0, __kind = 0, __spins = 0, __list = {__prev = 0x0, __next = 0x0}}, 
        __size = '\000' <repeats 40 times>, "\001\000\000\000\001\000\000", 
        __align = 0}}, <No data fields>}}

0xffff00000000:

p *(fox_rt::CpTeleop*)0xffff00000000
$4 = {<smacc::ISmaccComponent> = {_vptr.ISmaccComponent = 0xffff00000020, stateMachine_ = 0x0, 
    owner_ = 0x21000}, kMaxLinearVelScale = 6.6781865217069613e-319, 
  kMaxAngularVelScale = 4.2439915819305446e-314, kNormalLinearVelScale = 4.9406564584124654e-324, 
  kNormalAngularVelScale = 0, 
  button_map_ = std::map with 281470681751280 elements<error reading variable: Cannot access memory at address 0x18>, mapping_enabled_ = {_M_base = {static _S_alignment = 1, _M_i = false}}, 
  mtx_ = {<std::__mutex_base> = {_M_mutex = {__data = {__lock = 0, __count = 0, __owner = 0, 
          __nusers = 0, __kind = 9072, __spins = 65535, __list = {__prev = 0x0, 
            __next = 0xffff00000080}}, 
        __size = '\000' <repeats 16 times>, "p#\000\000\377\377\000\000\000\000\000\000\000\000\000\000\200\000\000\000\377\377\000\000\200\000\000\000\377\377\000", __align = 0}}, <No data fields>}}

We can see the component at 0xfffe7e193010 is valid and contains the data we'd expect, while 0xffff00000000 contains nonsense. This happens elsewhere with clients as well, but this case above happens most often.

Components and clients are long-lived objects and I can see that their memory is still valid. However, the CB's member variables, including the pointers to long-lived objects, seem to become invalid or change after/during transitions, which causes the segfaults we've been experiencing. The pointers also don't reset to nullptr so I there's no way to determine whether they're invalid during runtime. I'm at a loss of what might be happening or how I can prevent/detect this pointer reallocation. Any ideas?

yassiezar commented 1 year ago

Could a potential workaround for checking CB deallocation be to see if this == nullptr, like its set here? Also, why is only the zero'th CB being set to nullptr here?

It also looks like the CBs' onExit() calls are done before the signals are disconnected here. Would it make sense to move the disconnection step before the onExit() calls? onExit implies that the CB is done and ready to be destroyed, but this isn't the case because the signal lives on and a CB's callback can still be triggered, at which point it will call deallocated member variables and cause segfaults.

pabloinigoblasco commented 1 year ago

I have been having a look to the code, it is right there is a race condition for multithreaded application that could cause a callback on a destroyed object. Hence, that is a bug that it is being shown in your application.

Disconnection of callbacks for life-time objects must be (by design) just before they are disposed.

This is the first solution I propose: https://github.com/robosoft-ai/SMACC/pull/91