Closed Crowdedlight closed 8 months ago
Just a couple of questions: how many SmaccSignals
do you have running while you're in the state you're transitioning from (including the outer state)? E.g., ROS topic subscriber/publisher callbacks, etc. Also, does the state machine get stuck in the same state on every run? I.e. is it a 'reliable' failure?
It looks like the orthogonals have exited, but its not progressing to where it actually triggers a state exit. Basically, you seem stuck at this line. Are you able to download SMACC2's source and compile it yourself? You can then add a breakpoint to the region where the state machine is getting stuck at and get a clearer picture of what's happening in the inside and exactly where its hanging up.
You can also try temporarily moving the troublesome states out of the outer state (i.e. remove StiNextPosState
and StiCaptureState
from mission_executor_state
) and see if the transition takes place successfully? Its to check if an inner state transition bug we've observed and fixed before is making another appearance.
I have: 6 subscribers 2 publishers So a total of: 8
They are all in the StiNextPosState
I got no ros-signals in the mission_executor_state
.
It is the same state that gets halted every time. Although it's also the very first inner-state transition. So, I can't say if the following two inner-state transitions would also be stuck. It's been reproduced on 2 different computers running 22.04, ros2 Humble.
I will try and download the source and compile it myself today and see if I can figure out where it's getting stuck. And otherwise, also try to move the states out to see if it's an inner_state bug as you suggested.
Tried to move the states outside the super-state. So now it looks like this:
I get same result.
It's using the cloned and self-compiled smacc2 on tag 2.3.18.
I tried to set a breakpoint and follow the flow to see where it halts. I didn't manage to find anything useful. I tried to follow the exit of the state from onExit()
function and walk from there, but seemed like it followed the "normal exit" code parts and sent the signals it should.
I did manage to locate that the problem exists in one of my behaviours. If configure_orthogonal<OrDroneInterface, CbTargetPosChecker>();
is enabled in the state then it fails on triggering. Even when manually triggering the transition.
It's a behaviour I should have suspected earlier as it spawns a thread. However the exit and disposal works in the earlier version.
When I have a transition like this and manually publish the message it expects to trigger a transition it works fine, even with the behaviour mentioned above, enabled.
Transition<EvCbFailure<CbCheckControl, OrHasControl>, StiCaptureState, CONTINUELOOP>
However if I have the original transition that automatically gets triggered it fails and hangs as original
Transition<EvCbSuccess<CbTargetPosChecker, OrDroneInterface>, StiCaptureState, CONTINUELOOP>
I assume the update somehow means the way I have been checking and force-closing the thread no longer works. But at least I got it narrowed down where it happens. Now to figure out why. I'll continue looking into this, but seems like it's more a problem in my own code than smacc2, and the update to smacc2 just put it in focus. Although not sure if there have been changes to isShutdownRequested()
in the update.
CbTargetPosChecker
#pragma once
#include <smacc2/smacc_asynchronous_client_behavior.hpp>
#include <navigator/clients/cl_drone_interface/cl_drone_interface.hpp>
#include <condition_variable>
#include <std_msgs/msg/float32.hpp>
#include "inspection_interfaces/msg/mission_info.hpp"
#include <navigator/navigator.hpp>
#include <rclcpp/rclcpp.hpp>
#include <navigator/LogHelper.hpp>
#include <navigator/Helpers.hpp>
namespace navigator {
namespace cl_drone_interface {
class CbTargetPosChecker : public smacc2::SmaccAsyncClientBehavior, LogHelper {
public:
CbTargetPosChecker() = default;
~CbTargetPosChecker() override = default;
// Log Function configs
rclcpp::Logger getNodeLogger() { return this->getLogger(); }
std::string getLogName() { return "CbTargetPosChecker"; }
void onEntry() override {
log_info("OnEntry, Checking pos with 5hz");
// start thread that runs our function, as otherwise we never return and won't be able to change state...
threadRun = std::thread(&CbTargetPosChecker::runThread, this);
}
void onExit() override {
// close thread, so we don't hang behaviour
log_info("target_pos_checker OnExit()");
force_shutdown = true;
if (threadRun.joinable())
threadRun.join();
}
private:
void runThread() {
rclcpp::Rate rate(5); // 10Hz
while (!isShutdownRequested() && !force_shutdown) {
// log_info("Is shutdown requested: " + std::to_string(isShutdownRequested()));
this->checkDistanceToTarget();
rate.sleep();
}
}
void checkDistanceToTarget() {
try {
// check if gimbal reached gimbal pose. Threshold is 1deg
auto at_orientation = PositionHelper::GimbalWithinThreshold(this->getLogger(),
_gimbalPose, _targetGimbal.quaternion, 1);
if (at_orientation) {
force_shutdown = true;
postSuccessEvent();
log_info("Within threshold. Moving to Capture Image State");
}
}
}
catch (const std::exception &e) {
log_info("TargetPosChecker experienced an exception and exited:");
log_info(e.what());
return;
}
}
// bool triggered = false;
// std::mutex mutex;
bool force_shutdown = false;
std::thread threadRun;
};
}
}
One of the changes that was brought in was improved handling of SmaccSignal disconnections to make sure that a state transition doesn't accidentally occur while the callback is running (useful for preventing a CB that owns a long-running processes on separate thread from being deleted too early). This might be related to the issue here, but there are a couple of other things to try first...
Can you confirm that the thread is being joined? I.e. is the if()
statement in the onExit()
being triggered? Btw you should probably make the force_shutdown
flag atomic or lock it in some way, since you're accessing it from multiple threads. My gut feeling is that the thread isn't being shut down correctly which prevents the onExit()
from completing. And since you're using an Async CB, this will prevent transitions from being completed. You can try swapping the Async CB for a normal CB - if it still hangs, then the thread seems to be blocking, otherwise there's something else blocking the Async CB from exiting and being deallocated.
Finally, if the GimbalWithinThreshold()
isn't a long-running operation, you could consider is using the built-in Updateable
client behaviours that SMACC2 supports. These clients let you define some function that's executed periodically and cleans itself up at exit, which seems like what you're trying to achieve here? You can see an example here.
To summarise:
force_shutdown
atomiconExit()
Updateable
CB if it meets your needsThanks for the suggestions. You are absolutely right that I should have used atomic for the bool, was an oversight.
I changed bool to be atomic.
I tried adding debug prints in the onExit()
to verify if thread is joined like this:
void onExit() override {
// close thread, so we don't hang behaviour
force_shutdown = true;
if (threadRun.joinable()) {
log_info("onExit: Trying to join thread...");
threadRun.join();
log_info("onExit: Thread have been joined");
}
}
But as the image show then the statemachine still halts as originally.
I also added a debug print in the thread:
void runThread() {
rclcpp::Rate rate(5); // 10Hz
while (!isShutdownRequested() && !force_shutdown) {
this->checkDistanceToTarget();
rate.sleep();
log_info("runThread: Thread is running...");
}
}
I can confirm that it prints in the console as it's running up until the transition is triggered, then it stops and we get the output as in the image showing it joins the thread.
I tried to change the CB to sync by changing the following:
class CbTargetPosChecker : public smacc2::SmaccClientBehavior, LogHelper {
...
// event to trigger the transition
auto ev = new smacc2::EvCbSuccess<CbTargetPosChecker, OrDroneInterface>();
this->postEvent(ev);
I get the same output as in the earlier image. prints show thread has joined and transition is triggered. Halts and never goes into the next state. Transition table looks like:
typedef mpl::list<
Transition<EvCbSuccess<CbTargetPosChecker, OrDroneInterface>, StiCaptureState, CONTINUELOOP>
> reactions;
// STATE FUNCTIONS
static void staticConfigure() {
configure_orthogonal<OrHasControl, CbCheckControl>();
configure_orthogonal<OrDroneInterface, CbUpdateDroneData>();
configure_orthogonal<OrDroneInterface, CbTargetPosChecker>();
configure_orthogonal<OrDroneInterface, CbSendTargetPos>();
}
I did not know about the Updateble
client behaviours. I will look into that. As the function is just one that checks the current position with the goal position, and triggers state transition if at the new position. So sounds like it would fit the usecase for the Updateable
.
Trying to get Updatable
to work. My Client behaviour looks like this. But when running it, it seems like the update()
is never run. I never get any prints in the console, nor does it trigger transition as it did when running it in its own thread. Am I missing something specific to get the update function called?
class CbTargetPosChecker : public smacc2::SmaccAsyncClientBehavior, smacc2::ISmaccUpdatable, LogHelper {
public:
CbTargetPosChecker() = default;
~CbTargetPosChecker() override = default;
void onEntry() override {
log_info("OnEntry, Checking pos with 5hz");
// set period for 5hz, so 1s/5, as its less than 1s and we can only give in int second and int nano-second, we divide 1s in ns by 5
this->setUpdatePeriod(rclcpp::Duration(0, 1000000000/5));
// this->executeUpdate(this->getNode());
}
void onExit() override {}
void update() override
{
log_info("Update: Running updatable function at 5hz...");
this->checkDistanceToTarget();
}
I get the same output as in the earlier image. prints show thread has joined and transition is triggered. Halts and never goes into the next state. Transition table looks like
I don't have access to my machine atm, but I'll look into this a bit more later today.
Regarding the Updatable()
issue, have you tried using it with a syncronous CB instead of the Async one? Alternatively, can you add the Updatable
constructor to your CB's constructor so that the run period is initialised during the CB construction process, rather than during transition and the async onEntry()
execution
Regarding the
Updatable()
issue, have you tried using it with a syncronous CB instead of the Async one? Alternatively, can you add theUpdatable
constructor to your CB's constructor so that the run period is initialised during the CB construction process, rather than during transition and the asynconEntry()
execution
I tried both synchronous and async. I did however not try adding the updatable constructor to the CB constructor. I will try that on Monday when I am back in the office. (Got a field day tomorrow)
I tried with the duration in the constructor as you suggested, but without any luck:
class CbTargetPosChecker : public smacc2::SmaccClientBehavior, smacc2::ISmaccUpdatable, LogHelper {
public:
CbTargetPosChecker() : ISmaccUpdatable(rclcpp::Duration(1,0)) {}
void update() override
{
log_info("Update: Running updatable function at 5hz...");
this->checkDistanceToTarget();
}
During the OnEntry()
I added a breakpoint to see if it had set the duration correctly. And it seems like it did. Set to 1 second.
But I get no execution of the update
function. No prints in console, and a breakpoint set in the update function is never hit.
Its like Updatable
never get discovered.
When following the trace I can see that the update client is never found and added to the statemachines list of updatables. https://github.com/robosoft-ai/SMACC2/blob/cc5904016f7da6d1f174f387341bf76b5756c068/smacc2/src/smacc2/signal_detector.cpp#L88
But I am unsure if I am doing something wrong. Is my problem here that the client should also inherit from Updatable and not just the client behaviour? Although looking in reference library it appears that just adding Updatable
on the client behaviour should be enough.
My client is currently:
namespace navigator
{
namespace cl_drone_interface
{
class ClDroneInterface : public smacc2::ISmaccClient
{
public:
ClDroneInterface()= default;
~ClDroneInterface() override= default;;
};
} // namespace cl_image_analyzer
} // namespace navigator
And the orthogonal
namespace navigator
{
using namespace navigator::cl_drone_interface;
class OrDroneInterface : public smacc2::Orthogonal<OrDroneInterface>
{
public:
void onInitialize() override
{
// This needs client for talking with imaging node, and client to talk with operator
auto drone_interface_client = this->createClient<ClDroneInterface>();
}
};
} // namespace
With it configured in that state as:
static void staticConfigure() {
configure_orthogonal<OrDroneInterface, CbUpdateDroneData>();
configure_orthogonal<OrDroneInterface, CbTargetPosChecker>();
}
I managed to get it to run update by adding public to the inheritance of Updatable
class CbTargetPosChecker : public smacc2::SmaccClientBehavior, public smacc2::ISmaccUpdatable, LogHelper {
However I still get the halt on transition where it doesn't make it further as originally. Doesn't seem like there is any difference from transition problem from running my own thread or running the updatable
.
Current behaviour:
namespace navigator {
namespace cl_drone_interface {
class CbTargetPosChecker : public smacc2::SmaccClientBehavior, public smacc2::ISmaccUpdatable, LogHelper {
public:
CbTargetPosChecker() : ISmaccUpdatable(rclcpp::Duration(0,1000)) {}
~CbTargetPosChecker() override = default;
void onEntry() override {
log_info("OnEntry, Checking pos with 5hz");
}
void onExit() override { }
void update() override
{
log_info("Update: Running updatable function at 5hz...");
this->checkDistanceToTarget();
}
private:
void checkDistanceToTarget() {
try {
auto at_orientation = PositionHelper::GimbalWithinThreshold();
if (at_orientation) {
auto ev = new smacc2::EvCbSuccess<CbTargetPosChecker, OrDroneInterface>();
this->postEvent(ev);
log_info("Within position threshold. Moving to Capture Image State");
}
}
}
catch (const std::exception &e) {
log_info("TargetPosChecker experienced an exception and exited:");
log_info(e.what());
return;
}
}
};
}
}
I don't know if it helps, but I found out that if I add a breakpoint here: https://github.com/robosoft-ai/SMACC2/blob/cc5904016f7da6d1f174f387341bf76b5756c068/smacc2/include/smacc2/impl/smacc_state_machine_impl.hpp#L660
And hit "resume" every time its triggered, then the transition into the next state works as expected. But if I don't "slow down the transition" by doing that and just letting it run, then it errors as originally and never make it into the next state. This was done with the exact same compiled code.
I am not knowledgeable enough yet to figure out where in SMACC2 this is caused. But it seems like there is some sort of race condition on the transition? I just can't figure out why it depends on if you are running a CB with thread/updateable...
EDIT:
Adding this to the behaviours OnExit()
method is enough to let it keep going.
void onExit() override {
log_info("Exiting Target Pos Checker...");
sleep(1);
}
Sorry for the delay, I've finally had some time to look at this on my machine.
I originally suspected there might be some race condition or deallocation blockage when I saw you spawning a thread inside an async CB. However, you've done away with both now and you're still getting this issue
Looking at your code, I can't see an obvious reason why it won't work. Indeed, I put together a small example state machine that's very similar to your code that switches between 2 states with a timer and updatable and they switch as you'd expect (see my fork here).
Can you confirm that the CbTargetPosChecker
client behaviour is the only CB active in the state that won't exit? I ask because in your original diagram you have 3 CB's active and it looks like there's something running in the background that's preventing the state from exiting.
It seems like it happens if I have a CB with a thread in it, and at the same time its started, it gets told to change state, then it have a hard time exiting properly.
You were right that I forgot I also had the CB SendTargetPos
initially starting an async-thread like TargetPosChecker
before I changed it to updatable
, the thread does get stopped after a few seconds by itself as it just waits for a confirmation reply. But seems to be enough to halt it if you transition as soon as you enter the state.
Currently as soon as it goes into sti_next_pos
the first time it's already at the initial target position, so it fulfils the transition conditions and tries to change state immediately. Which seems to be too fast for it to clean up properly.
When I make it fly to an initial point a bit further away, there is 2-4s between entering sti_next_pos
and it hitting the state transition, which has the same function as the sleep, and I avoid the error.
I did the following test.
I have the TargetPosChecker
CB with this in OnEntry()
:
void onEntry() override {
log_info("OnEntry, Checking pos every 5hz");
auto ev = new smacc2::EvCbSuccess<CbTargetPosChecker, OrDroneInterface>();
this->postEvent(ev);
}
I then have another CB SendTargetPos
(simplified):
namespace navigator {
namespace cl_drone_interface {
class CbSendTargetPos : public smacc2::SmaccAsyncClientBehavior, LogHelper {
public:
CbSendTargetPos() = default;
~CbSendTargetPos() override = default;
// Log Function configs
rclcpp::Logger getNodeLogger() { return this->getLogger(); }
std::string getLogName() { return "CbSendTargetPos"; }
void onExit() override {
// close thread, so we don't hang behaviour
force_shutdown = true;
if (threadRun.joinable())
threadRun.join();
// unsub subs/pubs
_sub_mavros_setpoint.reset();
_pub_position_target.reset();
_pub_gimbal.reset();
}
void onEntry() override {
// get current data from datastore
DataStore *_ds;
this->getStateMachine()->getGlobalSMData("datastore", _ds);
auto _plan = _ds->getCurrentMissionPlan();
auto _index = _ds->getCurrentMissionIndex();
// get target pos
auto _targetPos = _plan.plan[_index];
_msg.position = _targetPos.pose.pose.position;
// start thread that runs our function, as otherwise we never return and won't be able to change state...
threadRun = std::thread(&CbSendTargetPos::runThread, this);
}
private:
void runThread() {
rclcpp::Rate rate(1); // 1Hz
// remember to check the shutdown request, so we don't block state transitions
while (!isShutdownRequested() && !force_shutdown) {
_pub_position_target->publish(_msg);
rate.sleep();
}
}
void onMessageSetpointRaw(const mavros_msgs::msg::PositionTarget::SharedPtr msg) {
// validate if the current published setpoint is == to our requested setpoint.
// A hacky way to ensure this node gets closed, but not before the setpoint has been accepted
if (msg->position == _msg.position) {
force_shutdown = true;
_sub_mavros_setpoint.reset();
}
}
bool force_shutdown = false;
std::thread threadRun;
interfaces::msg::LarkePositionTarget _msg;
rclcpp::Subscription<mavros_msgs::msg::PositionTarget>::SharedPtr _sub_mavros_setpoint;
rclcpp::Publisher<interfaces::msg::LarkePositionTarget>::SharedPtr _pub_position_target;
rclcpp::Publisher<interfaces::msg::GimbalAttitude>::SharedPtr _pub_gimbal;
};
}
}
This does not work, as it seems to get hung up on the second CB here not being properly disposed. The spawned thread is joined and stopped, as breakpoints inside it doesn't get triggered after the transition is triggered, but the state change still halts as originally.
However if I add a sleep in first CB on exit:
void onEntry() override {
log_info("OnEntry, Checking pos every 5hz");
auto ev = new smacc2::EvCbSuccess<CbTargetPosChecker, OrDroneInterface>();
this->postEvent(ev);
}
void onExit() override {
log_info("Exiting Target Pos Checker...");
sleep(1);
}
Then it works fine. So I think part of the problem is an instant state-change causing some kind of race condition between the CBs.
Further testing.
Having the second CB look like this will halt transition, as long as the first CB trigger transition before this OnEntry()
has returned:
class CbSendTargetPos : public smacc2::SmaccClientBehavior, public smacc2::ISmaccUpdatable, LogHelper {
public:
CbSendTargetPos() : ISmaccUpdatable(rclcpp::Duration(0,UPDATE_RATE)) {}
~CbSendTargetPos() override = default;
void onExit() override { }
void onEntry() override {
sleep(10);
return;
}
However removing the sleep and it works fine. So I think what happens is that the transition breaks if the transition is called before the onEntry()
has finished. As I do some light work in OnEntry()
of creating a ros_message and saving it to be published on each update. The creation fetches some data from my datastore, and while it takes much less than a second to execute, it might be slow enough that an instant state-change will happen before it returns.
I can try to see if I can make a reproducable sample on the example you linked with two simple CB where one trigger transition before the other returns its OnEntry()
function. I will have a look on Monday when I am back at my computer.
But it solved my issue here and now. As my SendTargetPos
CB is subscribing to a message that would make it exit itself anyway as soon as the position it transmitted was received, so having this on the topic subscription callback to stop the Updatable
function:
void onMessageSetpointRaw(const mavros_msgs::msg::PositionTarget::SharedPtr msg) {
// validate if the current published setpoint is == to our requested setpoint.
if (msg->position == _msg.position) {
this->dispose();
}
}
Meant that even in the case where it would halt and not get disposed because a state change happens before it exists onEntry()
it would receive this message at the same time and dispose itself, and then the state-change would continue.
So it solves my problem here and now, but as the original problem came with the version update, I think there is still something messing with disposing states, if they haven't exited their onEntry()
at the time, which is worth looking at.
Also Thanks for the all the help and getting me on the right track! Made a big difference trying to figure out how to debug what was going on! :)
Hello. Would it be possible to simplify the code and share it as an small example? In that way I could try to reproduce that error around and diagnose what it is happening.
Hello Pablo
I tried to make a very simple version of inner states where one behaviour would sleep in onEntry
while the other would finish. And that still seemed to work as it would change state as soon as the sleep was over. So it seems like I might do something in the behaviour that influences this error, although I am not quite sure what yet.
I will try to see if I can make an example closer to the original code that can reproduce it next week. Or see if I can narrow down if there is a particular subscriber/publisher/datastore that is causing this.
Hi @Crowdedlight, I'm closing the ticket. Lemme know if you're still having issues like this and we can reopen it. Thank you,
Hi @Crowdedlight, I'm closing the ticket. Lemme know if you're still having issues like this and we can reopen it. Thank you,
Sorry for not getting back. I ended up moving the inner states out of the super-state and that seemed to be more robust. I spent a day trying to make a minimal reproduction example, but couldn't get the error to appear reliably on that.
So in the end assumed we must have done something wrong or some kind of very odd circumstance that caused the initial error and some of the developments and refactors done since having helped remove the issue.
Keep the issue closed. If I revisit my earlier attempt and manage to get a reproducible example at some point I will get back to you, but for now, it seems to work fine for us. :)
And thanks for the help in diagnosing and troubleshooting it!
Describe the bug After updating smacc2 to 2.3.18 my state machine no longer transitions into the next inner state in a superstate. The transition rule is triggered and the RTA also confirm the trigger is executed, but the next state is never created and state machine just comes to a halt with no further output.
Any suggestions as to how I debug what is causing it?
I have tried to comment out all orthogonals in the next state, but it has no change. I tried to find a changelog for the update 0.4.0 to 2.3.18 and see if something changed for inner-states in a super-state but didn't find any. This problem is not present in 0.4.0.
The state machine goes through 2-3 "normal" states before it gets to the superstate, and those work as usual.
Added image of the super-state and events from RTA.
Environment (please complete the following information):
ROS DETAILS:
REPO DETAILS: smacc2 2.3.18 installed through ubuntu PPA
apt install ros-humble-smacc2
Error Details
┆Issue is synchronized with this Jira Task by Unito