robosoft-ai / SMACC2

An Event-Driven, Asynchronous, Behavioral State Machine Library for ROS2 (Robotic Operating System) applications written in C++
https://smacc.dev
Apache License 2.0
244 stars 40 forks source link

Update from 0.4.0 to 2.3.18 on 22.04 Humble not transitioning to next inner state #530

Closed Crowdedlight closed 8 months ago

Crowdedlight commented 1 year ago

Describe the bug After updating smacc2 to 2.3.18 my state machine no longer transitions into the next inner state in a superstate. The transition rule is triggered and the RTA also confirm the trigger is executed, but the next state is never created and state machine just comes to a halt with no further output.

Any suggestions as to how I debug what is causing it?

I have tried to comment out all orthogonals in the next state, but it has no change. I tried to find a changelog for the update 0.4.0 to 2.3.18 and see if something changed for inner-states in a super-state but didn't find any. This problem is not present in 0.4.0.

The state machine goes through 2-3 "normal" states before it gets to the superstate, and those work as usual.

Added image of the super-state and events from RTA. image

Environment (please complete the following information):

ROS DETAILS:

REPO DETAILS: smacc2 2.3.18 installed through ubuntu PPA apt install ros-humble-smacc2

Error Details

[navigator_node-2] [INFO] [1697204323.675913356] [StiNextPosState]: TRANSITION RULE TRIGGERED: smacc2::Transition<smacc2::EvCbSuccess<navigator::cl_drone_interface::CbTargetPosChecker, navigator::OrDroneInterface>, navigator::mission_executor_super_state::navigator::inner_states::StiCaptureState, smacc2::default_transition_tags::CONTINUELOOP, boost::statechart::detail::no_context<smacc2::EvCbSuccess<navigator::cl_drone_interface::CbTargetPosChecker, navigator::OrDroneInterface> >, &boost::statechart::detail::no_context<smacc2::EvCbSuccess<navigator::cl_drone_interface::CbTargetPosChecker, navigator::OrDroneInterface> >::no_function>
[navigator_node-2] [WARN] [1697204323.675985855] [navigator]: exiting state: navigator::mission_executor_super_state::navigator::inner_states::StiNextPosState
[navigator_node-2] [INFO] [1697204323.676001259] [navigator]: Notification State Exit: leaving state 0x7f15742a3f80
[navigator_node-2] [INFO] [1697204323.676011963] [navigator]: [Orthogonal navigator::OrCameraInterface] OnExit
[navigator_node-2] [INFO] [1697204323.676034527] [navigator]: [Orthogonal navigator::OrDroneInterface] OnExit, current Behavior: navigator::cl_drone_interface::CbUpdateDroneData
[navigator_node-2] [INFO] [1697204323.676045738] [navigator]: [navigator::cl_drone_interface::CbUpdateDroneData] onExit - join async onEntry thread
[navigator_node-2] [INFO] [1697204323.676059316] [navigator]: [navigator::cl_drone_interface::CbUpdateDroneData] onExit - Creating asynchronous onExit thread
[navigator_node-2] [INFO] [1697204323.676096318] [navigator]: [Orthogonal navigator::OrDroneInterface] OnExit, current Behavior: navigator::cl_drone_interface::CbTargetPosChecker
[navigator_node-2] [INFO] [1697204323.676106365] [navigator]: [navigator::cl_drone_interface::CbTargetPosChecker] onExit - join async onEntry thread
[navigator_node-2] [INFO] [1697204323.676116317] [navigator]: [navigator::cl_drone_interface::CbTargetPosChecker] onExit - Creating asynchronous onExit thread
[navigator_node-2] [INFO] [1697204323.676134755] [navigator]: [Orthogonal navigator::OrDroneInterface] OnExit, current Behavior: navigator::cl_drone_interface::CbSendTargetPos
[navigator_node-2] [INFO] [1697204323.676227302] [navigator]: [navigator::cl_drone_interface::CbSendTargetPos] onExit - join async onEntry thread
[navigator_node-2] [INFO] [1697204323.676138942] [navigator]:  [CbTargetPosChecker] target_pos_checker OnExit()
[navigator_node-2] [INFO] [1697204323.676252770] [navigator]: [navigator::cl_drone_interface::CbSendTargetPos] onExit - Creating asynchronous onExit thread
[navigator_node-2] [INFO] [1697204323.676360224] [navigator]: [Orthogonal navigator::OrGlobalPlanner] OnExit
[navigator_node-2] [INFO] [1697204323.676369523] [navigator]: [Orthogonal navigator::OrHasControl] OnExit
[navigator_node-2] [INFO] [1697204323.676379803] [navigator]: [Orthogonal navigator::OrImageAnalyzer] OnExit
[navigator_node-2] [INFO] [1697204323.676390479] [navigator]: [Orthogonal navigator::OrPilotInterface] OnExit
[navigator_node-2] [WARN] [1697204323.676397824] [navigator]: exiting state: navigator::mission_executor_super_state::navigator::inner_states::StiNextPosState
[navigator_node-2] [INFO] [1697204323.676446241] [navigator]: Notification State Exit: leaving state 0x7f15742a3f80
[navigator_node-2] [INFO] [1697204323.676476765] [navigator]: [Orthogonal navigator::OrCameraInterface] OnExit
[navigator_node-2] [INFO] [1697204323.676504750] [navigator]: [Orthogonal navigator::OrDroneInterface] OnExit, current Behavior: navigator::cl_drone_interface::CbUpdateDroneData
[navigator_node-2] [INFO] [1697204323.676543594] [navigator]: [navigator::cl_drone_interface::CbUpdateDroneData] onExit - join async onEntry thread
[navigator_node-2] [INFO] [1697204323.676564944] [navigator]: [navigator::cl_drone_interface::CbUpdateDroneData] waiting future onEntryThread. It was not even created. Skipping wait.
[navigator_node-2] [INFO] [1697204323.676584911] [navigator]: [navigator::cl_drone_interface::CbUpdateDroneData] onExit - Creating asynchronous onExit thread
[navigator_node-2] [INFO] [1697204323.677270999] [navigator]:  [CbSendTargetPos] send_target_pos OnExit()
[navigator_node-2] [INFO] [1697204323.677305503] [navigator]: [navigator::cl_drone_interface::CbSendTargetPos] asynchronous onExit done.
[navigator_node-2] [INFO] [1697204323.677533443] [navigator]:  [CbUpdateDroneData] updatedrone OnExit()
[navigator_node-2] [INFO] [1697204323.677595558] [navigator]: [navigator::cl_drone_interface::CbUpdateDroneData] asynchronous onExit done.
[navigator_node-2] [INFO] [1697204323.677658614] [navigator]:  [CbUpdateDroneData] updatedrone OnExit()
[navigator_node-2] [INFO] [1697204323.677659283] [navigator]: [Orthogonal navigator::OrDroneInterface] OnExit, current Behavior: navigator::cl_drone_interface::CbTargetPosChecker
[navigator_node-2] [INFO] [1697204323.677703578] [navigator]: [navigator::cl_drone_interface::CbUpdateDroneData] asynchronous onExit done.
[navigator_node-2] [INFO] [1697204323.677725294] [navigator]: [navigator::cl_drone_interface::CbTargetPosChecker] onExit - join async onEntry thread
[navigator_node-2] [INFO] [1697204323.677782423] [navigator]: [navigator::cl_drone_interface::CbTargetPosChecker] waiting future onEntryThread. It was not even created. Skipping wait.
[navigator_node-2] [INFO] [1697204323.677788713] [navigator]: [navigator::cl_drone_interface::CbTargetPosChecker] onExit - Creating asynchronous onExit thread
[navigator_node-2] [INFO] [1697204323.677835327] [navigator]:  [CbTargetPosChecker] target_pos_checker OnExit()
[navigator_node-2] [INFO] [1697204323.875820690] [navigator]: [navigator::cl_drone_interface::CbTargetPosChecker] asynchronous onExit done.
[navigator_node-2] [INFO] [1697204323.875963703] [navigator]: [Orthogonal navigator::OrDroneInterface] OnExit, current Behavior: navigator::cl_drone_interface::CbSendTargetPos
[navigator_node-2] [INFO] [1697204323.876013385] [navigator]: [navigator::cl_drone_interface::CbSendTargetPos] onExit - join async onEntry thread
[navigator_node-2] [INFO] [1697204323.876031066] [navigator]: [navigator::cl_drone_interface::CbSendTargetPos] waiting future onEntryThread. It was not even created. Skipping wait.
[navigator_node-2] [INFO] [1697204323.876043312] [navigator]: [navigator::cl_drone_interface::CbSendTargetPos] onExit - Creating asynchronous onExit thread
[navigator_node-2] [INFO] [1697204323.876083438] [navigator]: [Orthogonal navigator::OrGlobalPlanner] OnExit
[navigator_node-2] [INFO] [1697204323.876087882] [navigator]:  [CbSendTargetPos] send_target_pos OnExit()
[navigator_node-2] [INFO] [1697204323.876094715] [navigator]: [Orthogonal navigator::OrHasControl] OnExit
[navigator_node-2] [INFO] [1697204323.876140073] [navigator]: [Orthogonal navigator::OrImageAnalyzer] OnExit
[navigator_node-2] [INFO] [1697204323.876149662] [navigator]: [Orthogonal navigator::OrPilotInterface] OnExit
[navigator_node-2] [INFO] [1697204323.876132445] [navigator]: [navigator::cl_drone_interface::CbSendTargetPos] asynchronous onExit done.
[navigator_node-2] [INFO] [1697204323.876161235] [StiNextPosState]: [NEXT_POS] onEXIT() next_pos inner state
[navigator_node-2] [WARN] [1697204323.876179130] [navigator]: exiting state: navigator::mission_executor_super_state::navigator::inner_states::StiNextPosState
[navigator_node-2] [INFO] [1697204323.876190837] [navigator]: Notification State Disposing: leaving state0x7f15742a3f80
[navigator_node-2] [INFO] [1697204323.876204131] [navigator]: [SmaccSignals] object signal disconnecting 139730120130400
[navigator_node-2] [INFO] [1697204323.876213868] [navigator]: [SmaccSignals] no signals found 139730120130400

┆Issue is synchronized with this Jira Task by Unito

yassiezar commented 1 year ago

Just a couple of questions: how many SmaccSignals do you have running while you're in the state you're transitioning from (including the outer state)? E.g., ROS topic subscriber/publisher callbacks, etc. Also, does the state machine get stuck in the same state on every run? I.e. is it a 'reliable' failure?

It looks like the orthogonals have exited, but its not progressing to where it actually triggers a state exit. Basically, you seem stuck at this line. Are you able to download SMACC2's source and compile it yourself? You can then add a breakpoint to the region where the state machine is getting stuck at and get a clearer picture of what's happening in the inside and exactly where its hanging up.

You can also try temporarily moving the troublesome states out of the outer state (i.e. remove StiNextPosState and StiCaptureState from mission_executor_state) and see if the transition takes place successfully? Its to check if an inner state transition bug we've observed and fixed before is making another appearance.

Crowdedlight commented 1 year ago

I have: 6 subscribers 2 publishers So a total of: 8

They are all in the StiNextPosState I got no ros-signals in the mission_executor_state.

It is the same state that gets halted every time. Although it's also the very first inner-state transition. So, I can't say if the following two inner-state transitions would also be stuck. It's been reproduced on 2 different computers running 22.04, ros2 Humble.

I will try and download the source and compile it myself today and see if I can figure out where it's getting stuck. And otherwise, also try to move the states out to see if it's an inner_state bug as you suggested.

Crowdedlight commented 1 year ago

Tried to move the states outside the super-state. So now it looks like this: image

I get same result.

It's using the cloned and self-compiled smacc2 on tag 2.3.18.

I tried to set a breakpoint and follow the flow to see where it halts. I didn't manage to find anything useful. I tried to follow the exit of the state from onExit() function and walk from there, but seemed like it followed the "normal exit" code parts and sent the signals it should.

I did manage to locate that the problem exists in one of my behaviours. If configure_orthogonal<OrDroneInterface, CbTargetPosChecker>(); is enabled in the state then it fails on triggering. Even when manually triggering the transition.

It's a behaviour I should have suspected earlier as it spawns a thread. However the exit and disposal works in the earlier version.

When I have a transition like this and manually publish the message it expects to trigger a transition it works fine, even with the behaviour mentioned above, enabled.

Transition<EvCbFailure<CbCheckControl, OrHasControl>, StiCaptureState, CONTINUELOOP>

However if I have the original transition that automatically gets triggered it fails and hangs as original

Transition<EvCbSuccess<CbTargetPosChecker, OrDroneInterface>, StiCaptureState, CONTINUELOOP>

I assume the update somehow means the way I have been checking and force-closing the thread no longer works. But at least I got it narrowed down where it happens. Now to figure out why. I'll continue looking into this, but seems like it's more a problem in my own code than smacc2, and the update to smacc2 just put it in focus. Although not sure if there have been changes to isShutdownRequested() in the update.

CbTargetPosChecker


#pragma once

#include <smacc2/smacc_asynchronous_client_behavior.hpp>
#include <navigator/clients/cl_drone_interface/cl_drone_interface.hpp>
#include <condition_variable>
#include <std_msgs/msg/float32.hpp>
#include "inspection_interfaces/msg/mission_info.hpp"
#include <navigator/navigator.hpp>
#include <rclcpp/rclcpp.hpp>
#include <navigator/LogHelper.hpp>
#include <navigator/Helpers.hpp>

namespace navigator {
namespace cl_drone_interface {

class CbTargetPosChecker : public smacc2::SmaccAsyncClientBehavior, LogHelper {
public:
  CbTargetPosChecker() = default;

  ~CbTargetPosChecker() override = default;

  // Log Function configs
  rclcpp::Logger getNodeLogger() { return this->getLogger(); }

  std::string getLogName() { return "CbTargetPosChecker"; }

  void onEntry() override {

    log_info("OnEntry, Checking pos with 5hz");
    // start thread that runs our function, as otherwise we never return and won't be able to change state...
    threadRun = std::thread(&CbTargetPosChecker::runThread, this);
  }

  void onExit() override {
    // close thread, so we don't hang behaviour
      log_info("target_pos_checker OnExit()");
    force_shutdown = true;
    if (threadRun.joinable())
      threadRun.join();
  }

private:
  void runThread() {
    rclcpp::Rate rate(5); // 10Hz
    while (!isShutdownRequested() && !force_shutdown) {
//      log_info("Is shutdown requested: " + std::to_string(isShutdownRequested()));
      this->checkDistanceToTarget();
      rate.sleep();
    }
  }

  void checkDistanceToTarget() {
    try {
        // check if gimbal reached gimbal pose. Threshold is 1deg
        auto at_orientation = PositionHelper::GimbalWithinThreshold(this->getLogger(),
                                                                    _gimbalPose, _targetGimbal.quaternion, 1);
        if (at_orientation) {
          force_shutdown = true;
          postSuccessEvent();
          log_info("Within threshold. Moving to Capture Image State");
        }
      }
    }
    catch (const std::exception &e) {
      log_info("TargetPosChecker experienced an exception and exited:");
      log_info(e.what());
      return;
    }
  }

//  bool triggered = false;
//  std::mutex mutex;
  bool force_shutdown = false;
  std::thread threadRun;
};
}
}
yassiezar commented 1 year ago

One of the changes that was brought in was improved handling of SmaccSignal disconnections to make sure that a state transition doesn't accidentally occur while the callback is running (useful for preventing a CB that owns a long-running processes on separate thread from being deleted too early). This might be related to the issue here, but there are a couple of other things to try first...

Can you confirm that the thread is being joined? I.e. is the if() statement in the onExit() being triggered? Btw you should probably make the force_shutdown flag atomic or lock it in some way, since you're accessing it from multiple threads. My gut feeling is that the thread isn't being shut down correctly which prevents the onExit() from completing. And since you're using an Async CB, this will prevent transitions from being completed. You can try swapping the Async CB for a normal CB - if it still hangs, then the thread seems to be blocking, otherwise there's something else blocking the Async CB from exiting and being deallocated.

Finally, if the GimbalWithinThreshold() isn't a long-running operation, you could consider is using the built-in Updateable client behaviours that SMACC2 supports. These clients let you define some function that's executed periodically and cleans itself up at exit, which seems like what you're trying to achieve here? You can see an example here.

To summarise:

  1. Make force_shutdown atomic
  2. Verify thread is being joined in onExit()
  3. Try sync CB instead of Async and see if CB exit is completing
  4. Consider using Updateable CB if it meets your needs
Crowdedlight commented 1 year ago

Thanks for the suggestions. You are absolutely right that I should have used atomic for the bool, was an oversight.

I changed bool to be atomic.

I tried adding debug prints in the onExit() to verify if thread is joined like this:

  void onExit() override {
    // close thread, so we don't hang behaviour
    force_shutdown = true;
    if (threadRun.joinable()) {
        log_info("onExit: Trying to join thread...");
        threadRun.join();
        log_info("onExit: Thread have been joined");
    }
  }

But as the image show then the statemachine still halts as originally.
image

I also added a debug print in the thread:

  void runThread() {
    rclcpp::Rate rate(5); // 10Hz
    while (!isShutdownRequested() && !force_shutdown) {
      this->checkDistanceToTarget();
      rate.sleep();
      log_info("runThread: Thread is running...");
    }
  }

I can confirm that it prints in the console as it's running up until the transition is triggered, then it stops and we get the output as in the image showing it joins the thread.

I tried to change the CB to sync by changing the following:

class CbTargetPosChecker : public smacc2::SmaccClientBehavior, LogHelper {
...
// event to trigger the transition
auto ev = new smacc2::EvCbSuccess<CbTargetPosChecker, OrDroneInterface>();
this->postEvent(ev);

I get the same output as in the earlier image. prints show thread has joined and transition is triggered. Halts and never goes into the next state. Transition table looks like:

  typedef mpl::list<
          Transition<EvCbSuccess<CbTargetPosChecker, OrDroneInterface>, StiCaptureState, CONTINUELOOP>
  > reactions;

  // STATE FUNCTIONS
  static void staticConfigure() {
    configure_orthogonal<OrHasControl, CbCheckControl>();
    configure_orthogonal<OrDroneInterface, CbUpdateDroneData>();
    configure_orthogonal<OrDroneInterface, CbTargetPosChecker>();
    configure_orthogonal<OrDroneInterface, CbSendTargetPos>();
  }

I did not know about the Updatebleclient behaviours. I will look into that. As the function is just one that checks the current position with the goal position, and triggers state transition if at the new position. So sounds like it would fit the usecase for the Updateable.

Crowdedlight commented 1 year ago

Trying to get Updatable to work. My Client behaviour looks like this. But when running it, it seems like the update() is never run. I never get any prints in the console, nor does it trigger transition as it did when running it in its own thread. Am I missing something specific to get the update function called?

class CbTargetPosChecker : public smacc2::SmaccAsyncClientBehavior, smacc2::ISmaccUpdatable, LogHelper {
public:
  CbTargetPosChecker() = default;

  ~CbTargetPosChecker() override = default;

  void onEntry() override {

    log_info("OnEntry, Checking pos with 5hz");
    // set period for 5hz, so 1s/5, as its less than 1s and we can only give in int second and int nano-second, we divide 1s in ns by 5
    this->setUpdatePeriod(rclcpp::Duration(0, 1000000000/5));
//    this->executeUpdate(this->getNode());
  }

  void onExit() override {}

  void update() override
  {
      log_info("Update: Running updatable function at 5hz...");
      this->checkDistanceToTarget();
  }
yassiezar commented 1 year ago

I get the same output as in the earlier image. prints show thread has joined and transition is triggered. Halts and never goes into the next state. Transition table looks like

I don't have access to my machine atm, but I'll look into this a bit more later today.

Regarding the Updatable() issue, have you tried using it with a syncronous CB instead of the Async one? Alternatively, can you add the Updatable constructor to your CB's constructor so that the run period is initialised during the CB construction process, rather than during transition and the async onEntry() execution

Crowdedlight commented 1 year ago

Regarding the Updatable() issue, have you tried using it with a syncronous CB instead of the Async one? Alternatively, can you add the Updatable constructor to your CB's constructor so that the run period is initialised during the CB construction process, rather than during transition and the async onEntry() execution

I tried both synchronous and async. I did however not try adding the updatable constructor to the CB constructor. I will try that on Monday when I am back in the office. (Got a field day tomorrow)

Crowdedlight commented 1 year ago

I tried with the duration in the constructor as you suggested, but without any luck:

class CbTargetPosChecker : public smacc2::SmaccClientBehavior, smacc2::ISmaccUpdatable, LogHelper {
public:
    CbTargetPosChecker() : ISmaccUpdatable(rclcpp::Duration(1,0)) {}

  void update() override
  {
      log_info("Update: Running updatable function at 5hz...");
      this->checkDistanceToTarget();
  }

During the OnEntry() I added a breakpoint to see if it had set the duration correctly. And it seems like it did. Set to 1 second. image

But I get no execution of the update function. No prints in console, and a breakpoint set in the update function is never hit.

Its like Updatable never get discovered.

When following the trace I can see that the update client is never found and added to the statemachines list of updatables. https://github.com/robosoft-ai/SMACC2/blob/cc5904016f7da6d1f174f387341bf76b5756c068/smacc2/src/smacc2/signal_detector.cpp#L88

But I am unsure if I am doing something wrong. Is my problem here that the client should also inherit from Updatable and not just the client behaviour? Although looking in reference library it appears that just adding Updatableon the client behaviour should be enough. image

My client is currently:

namespace navigator
{
namespace cl_drone_interface
{

class ClDroneInterface : public smacc2::ISmaccClient
{
public:
  ClDroneInterface()= default;

  ~ClDroneInterface() override= default;;
};
}  // namespace cl_image_analyzer
}  // namespace navigator

And the orthogonal

namespace navigator
{
    using namespace navigator::cl_drone_interface;

    class OrDroneInterface : public smacc2::Orthogonal<OrDroneInterface>
    {
    public:
        void onInitialize() override
        {
            // This needs client for talking with imaging node, and client to talk with operator
            auto drone_interface_client = this->createClient<ClDroneInterface>();
        }
    };
}  // namespace

With it configured in that state as:

  static void staticConfigure() {
    configure_orthogonal<OrDroneInterface, CbUpdateDroneData>();
    configure_orthogonal<OrDroneInterface, CbTargetPosChecker>();
  }
Crowdedlight commented 1 year ago

I managed to get it to run update by adding public to the inheritance of Updatable

class CbTargetPosChecker : public smacc2::SmaccClientBehavior, public smacc2::ISmaccUpdatable, LogHelper {

However I still get the halt on transition where it doesn't make it further as originally. Doesn't seem like there is any difference from transition problem from running my own thread or running the updatable.

image

Current behaviour:

namespace navigator {
namespace cl_drone_interface {

class CbTargetPosChecker : public smacc2::SmaccClientBehavior, public smacc2::ISmaccUpdatable, LogHelper {
public:
    CbTargetPosChecker() : ISmaccUpdatable(rclcpp::Duration(0,1000)) {}

    ~CbTargetPosChecker() override = default;

  void onEntry() override {

    log_info("OnEntry, Checking pos with 5hz");
  }

  void onExit() override { }

  void update() override
  {
      log_info("Update: Running updatable function at 5hz...");
      this->checkDistanceToTarget();
  }

private:

  void checkDistanceToTarget() {
    try {
        auto at_orientation = PositionHelper::GimbalWithinThreshold();
        if (at_orientation) {
          auto ev = new smacc2::EvCbSuccess<CbTargetPosChecker, OrDroneInterface>();
          this->postEvent(ev);
          log_info("Within position threshold. Moving to Capture Image State");
        }
      }
    }
    catch (const std::exception &e) {
      log_info("TargetPosChecker experienced an exception and exited:");
      log_info(e.what());
      return;
    }
  }
};
}
}
Crowdedlight commented 1 year ago

I don't know if it helps, but I found out that if I add a breakpoint here: https://github.com/robosoft-ai/SMACC2/blob/cc5904016f7da6d1f174f387341bf76b5756c068/smacc2/include/smacc2/impl/smacc_state_machine_impl.hpp#L660

And hit "resume" every time its triggered, then the transition into the next state works as expected. But if I don't "slow down the transition" by doing that and just letting it run, then it errors as originally and never make it into the next state. This was done with the exact same compiled code.

I am not knowledgeable enough yet to figure out where in SMACC2 this is caused. But it seems like there is some sort of race condition on the transition? I just can't figure out why it depends on if you are running a CB with thread/updateable...

EDIT:

Adding this to the behaviours OnExit() method is enough to let it keep going.

  void onExit() override {
      log_info("Exiting Target Pos Checker...");
      sleep(1);
  }
yassiezar commented 1 year ago

Sorry for the delay, I've finally had some time to look at this on my machine.

I originally suspected there might be some race condition or deallocation blockage when I saw you spawning a thread inside an async CB. However, you've done away with both now and you're still getting this issue

Looking at your code, I can't see an obvious reason why it won't work. Indeed, I put together a small example state machine that's very similar to your code that switches between 2 states with a timer and updatable and they switch as you'd expect (see my fork here).

Can you confirm that the CbTargetPosChecker client behaviour is the only CB active in the state that won't exit? I ask because in your original diagram you have 3 CB's active and it looks like there's something running in the background that's preventing the state from exiting.

Crowdedlight commented 1 year ago

It seems like it happens if I have a CB with a thread in it, and at the same time its started, it gets told to change state, then it have a hard time exiting properly.

You were right that I forgot I also had the CB SendTargetPos initially starting an async-thread like TargetPosChecker before I changed it to updatable, the thread does get stopped after a few seconds by itself as it just waits for a confirmation reply. But seems to be enough to halt it if you transition as soon as you enter the state.

Currently as soon as it goes into sti_next_pos the first time it's already at the initial target position, so it fulfils the transition conditions and tries to change state immediately. Which seems to be too fast for it to clean up properly.

When I make it fly to an initial point a bit further away, there is 2-4s between entering sti_next_pos and it hitting the state transition, which has the same function as the sleep, and I avoid the error.

I did the following test.
I have the TargetPosChecker CB with this in OnEntry():

  void onEntry() override {
    log_info("OnEntry, Checking pos every 5hz");
    auto ev = new smacc2::EvCbSuccess<CbTargetPosChecker, OrDroneInterface>();
    this->postEvent(ev);
  }

I then have another CB SendTargetPos (simplified):

namespace navigator {
namespace cl_drone_interface {

class CbSendTargetPos : public smacc2::SmaccAsyncClientBehavior, LogHelper {
public:
  CbSendTargetPos() = default;
  ~CbSendTargetPos() override = default;

  // Log Function configs
  rclcpp::Logger getNodeLogger() { return this->getLogger(); }
  std::string getLogName() { return "CbSendTargetPos"; }

  void onExit() override {
    // close thread, so we don't hang behaviour
    force_shutdown = true;
    if (threadRun.joinable())
      threadRun.join();

    // unsub subs/pubs
    _sub_mavros_setpoint.reset();
    _pub_position_target.reset();
    _pub_gimbal.reset();
  }

  void onEntry() override {
    // get current data from datastore
    DataStore *_ds;
    this->getStateMachine()->getGlobalSMData("datastore", _ds);

    auto _plan = _ds->getCurrentMissionPlan();
    auto _index = _ds->getCurrentMissionIndex();

    // get target pos
    auto _targetPos = _plan.plan[_index];
    _msg.position = _targetPos.pose.pose.position;

    // start thread that runs our function, as otherwise we never return and won't be able to change state...
    threadRun = std::thread(&CbSendTargetPos::runThread, this);
  }

private:
  void runThread() {
    rclcpp::Rate rate(1); // 1Hz
    // remember to check the shutdown request, so we don't block state transitions
    while (!isShutdownRequested() && !force_shutdown) {
      _pub_position_target->publish(_msg);
      rate.sleep();
    }
  }
  void onMessageSetpointRaw(const mavros_msgs::msg::PositionTarget::SharedPtr msg) {
    // validate if the current published setpoint is == to our requested setpoint.
    //  A hacky way to ensure this node gets closed, but not before the setpoint has been accepted
    if (msg->position == _msg.position) {
      force_shutdown = true;
      _sub_mavros_setpoint.reset();
    }
  }

  bool force_shutdown = false;
  std::thread threadRun;
  interfaces::msg::LarkePositionTarget _msg;
  rclcpp::Subscription<mavros_msgs::msg::PositionTarget>::SharedPtr _sub_mavros_setpoint;
  rclcpp::Publisher<interfaces::msg::LarkePositionTarget>::SharedPtr _pub_position_target;
  rclcpp::Publisher<interfaces::msg::GimbalAttitude>::SharedPtr _pub_gimbal;
};
}
}

This does not work, as it seems to get hung up on the second CB here not being properly disposed. The spawned thread is joined and stopped, as breakpoints inside it doesn't get triggered after the transition is triggered, but the state change still halts as originally.

However if I add a sleep in first CB on exit:

  void onEntry() override {
    log_info("OnEntry, Checking pos every 5hz");
    auto ev = new smacc2::EvCbSuccess<CbTargetPosChecker, OrDroneInterface>();
    this->postEvent(ev);
  }
    void onExit() override {
      log_info("Exiting Target Pos Checker...");
      sleep(1);
  }

Then it works fine. So I think part of the problem is an instant state-change causing some kind of race condition between the CBs.

Crowdedlight commented 1 year ago

Further testing.

Having the second CB look like this will halt transition, as long as the first CB trigger transition before this OnEntry() has returned:

class CbSendTargetPos : public smacc2::SmaccClientBehavior, public smacc2::ISmaccUpdatable, LogHelper {
public:
    CbSendTargetPos() : ISmaccUpdatable(rclcpp::Duration(0,UPDATE_RATE)) {}
  ~CbSendTargetPos() override = default;

  void onExit() override { }

  void onEntry() override {
    sleep(10);
    return;
  }

However removing the sleep and it works fine. So I think what happens is that the transition breaks if the transition is called before the onEntry() has finished. As I do some light work in OnEntry() of creating a ros_message and saving it to be published on each update. The creation fetches some data from my datastore, and while it takes much less than a second to execute, it might be slow enough that an instant state-change will happen before it returns.

I can try to see if I can make a reproducable sample on the example you linked with two simple CB where one trigger transition before the other returns its OnEntry() function. I will have a look on Monday when I am back at my computer.

Crowdedlight commented 1 year ago

But it solved my issue here and now. As my SendTargetPos CB is subscribing to a message that would make it exit itself anyway as soon as the position it transmitted was received, so having this on the topic subscription callback to stop the Updatable function:

  void onMessageSetpointRaw(const mavros_msgs::msg::PositionTarget::SharedPtr msg) {
    // validate if the current published setpoint is == to our requested setpoint.
    if (msg->position == _msg.position) {
      this->dispose();
    }
  }

Meant that even in the case where it would halt and not get disposed because a state change happens before it exists onEntry() it would receive this message at the same time and dispose itself, and then the state-change would continue.

So it solves my problem here and now, but as the original problem came with the version update, I think there is still something messing with disposing states, if they haven't exited their onEntry() at the time, which is worth looking at.

Also Thanks for the all the help and getting me on the right track! Made a big difference trying to figure out how to debug what was going on! :)

pabloinigoblasco commented 1 year ago

Hello. Would it be possible to simplify the code and share it as an small example? In that way I could try to reproduce that error around and diagnose what it is happening.

Crowdedlight commented 1 year ago

Hello Pablo

I tried to make a very simple version of inner states where one behaviour would sleep in onEntry while the other would finish. And that still seemed to work as it would change state as soon as the sleep was over. So it seems like I might do something in the behaviour that influences this error, although I am not quite sure what yet.

I will try to see if I can make an example closer to the original code that can reproduce it next week. Or see if I can narrow down if there is a particular subscriber/publisher/datastore that is causing this.

brettpac commented 8 months ago

Hi @Crowdedlight, I'm closing the ticket. Lemme know if you're still having issues like this and we can reopen it. Thank you,

Crowdedlight commented 8 months ago

Hi @Crowdedlight, I'm closing the ticket. Lemme know if you're still having issues like this and we can reopen it. Thank you,

Sorry for not getting back. I ended up moving the inner states out of the super-state and that seemed to be more robust. I spent a day trying to make a minimal reproduction example, but couldn't get the error to appear reliably on that.

So in the end assumed we must have done something wrong or some kind of very odd circumstance that caused the initial error and some of the developments and refactors done since having helped remove the issue.

Keep the issue closed. If I revisit my earlier attempt and manage to get a reproducible example at some point I will get back to you, but for now, it seems to work fine for us. :)

And thanks for the help in diagnosing and troubleshooting it!