Joint state controller throws exception when start time is less than update rate

01binary commented 6 months ago

The joint state controller in joint_state_controller.cpp makes an assumption in its JointStateController::starting(const ros::Time& time) method that the time value passed to this function is always greater than its publish interval, which is set in Hz.

last_publish_time_ = time - ros::Duration(1.001/publish_rate_);

If the time passed into this function is less than 0.02 seconds for a 50 Hz publish rate, for example, this will result in last_publish_time being negative after the subtraction, which would throw a std::runtime_error exception causing the joint state controller to die:

terminate called after throwing an instance of 'std::runtime_error'
what():  Time is out of dual 32-bit range

The exception is thrown because the time in ROS is represented as an unsigned value, which would cause it wrap to the maximum possible value of the number type if a negative value was set.

Proposed fix

If we calculate the publish interval...

auto interval = ros::Duration(1.001/publish_rate_);

...and compare to time before subtracting...

if (time.toSec() > interval.toSec())

Then we can set the last_publish_time to 0 if the interval is larger than time, or set to their difference as before if the time is greater than interval. I see that Dave Coleman also recommends setting last update time to this in ROS Controllers Boilerplate:

// Get current time for use with first update
 clock_gettime(CLOCK_MONOTONIC, &last_time_);

I don't know what this means yet, will investigate if I should do this here instead of simply setting to 0.

Testing

I will submit a pull request for this issue, and test the joint state controller with Panda arm used in MoveIt tutorials to ensure everything is working as expected, and test results are not polluted by my own robot configuration which could be broken.

Hopefully setting last publish time to zero will not break anything.

You may have unit tests as well, I will see if I can add one.

mjs973 commented 6 months ago

This bug in the joint_state_controller was introduced by commit bba197e4 made on 2023-May-29.

see that Dave Coleman also recommends setting last update time to this in ROS Controllers Boilerplate: // Get current time for use with first update clock_gettime(CLOCK_MONOTONIC, &lasttime);

This is bad advice. In a simulation, the current time does not necessarily advance at the same rate as this clock.

mjs973 commented 6 months ago

@01binary Thank you for taking a look at this. It's unclear to me what commit bba197e was attempting to accomplish (since there's no discussion, and the commit message doesn't say.)

In your patch, I don't see that it's useful to check for last_publish_time_.isZero() as this has a virtually zero chance of ever being true, even in simulation.

01binary commented 6 months ago

Hello, let's address your questions.

It's unclear to me what commit https://github.com/ros-controls/ros_controllers/commit/bba197e449b99fe43e8b3f0903f471f24820da17 was attempting to accomplish

The commit you are referring to is linked to the following pull request:

https://github.com/ros-controls/ros_controllers/pull/617

There is not much discussion in pull request comments, however that pull request in turn was trying to fix the following bug:

https://github.com/ros-controls/ros_controllers/issues/616

This does have a detailed description and a discussion thread. The author asserts that the starting() method of joint state controller gets called every time the controller starts or restarts, such as in a scenario where an emergency stop was triggered. The existing logic was resetting the last publish time to the current time because it made a wrong assumption about when the starting() method can be called (assuming it will only be called once after loading and initialization).

The author goes on to say that when an E-stop condition occurs and the last publish time is reset to the current time, then the rate-limiting logic in the update method will cause the controller to be never updated again, because starting() method will be called repeatedly (every single loop until E-stop is cleared). I never used E-stops, maybe you know if that's true.

last_publish_time_ + ros::Duration(1.0/publish_rate_) < time)

If last_publish_time_ was continuously being reset to the current ROS time, then it will be equal to time every time the update() method executes. Therefore it would never be less than time, especially if we add the publish interval to it.

I don't see that it's useful to check for last_publishtime.isZero() as this has a virtually zero chance of ever being true, even in simulation.

I built & tested my change, loading joint state controller for my robot (unfortunately I could not do this for Panda arm because it defines its own joint state controller and doesn't use this one).

I confirmed the update() loop executes, and added two logging statements: one printed only when last_publish_time_.isZero() evaluates to true, and one printed on every update(). Both statements got printed, proving that the change works in practice.

jsc-updates

As for theory, you'll notice last_publish_time_ is a class member variable declared in the header without an initial value:

https://github.com/ros-controls/ros_controllers/blob/noetic-devel/joint_state_controller/include/joint_state_controller/joint_state_controller.h#L93

ros::Time last_publish_time_;

The ros::Time class stores time like this:

class TimeBase
{
  public:
    uint32_t sec, nsec;

The publish_rate_ member variable of the joint state controller is initialized like this in the inline constructor:

JointStateController() : publish_rate_(0.0) {}

Here's the ros::Time constructor this is using:

explicit TimeBase(double t) { fromSec(t); }

The implementation looks like this:

T& TimeBase<T, D>::fromSec(double t) {
      if (t < 0)
        throw std::runtime_error("Time cannot be negative.");
      if (!std::isfinite(t))
        throw std::runtime_error("Time has to be finite.");
      constexpr double maxInt64AsDouble = static_cast<double>(std::numeric_limits<int64_t>::max());
      if (t >= maxInt64AsDouble)
        throw std::runtime_error("Time is out of 64-bit integer range");
      int64_t sec64 = static_cast<int64_t>(floor(t));
      if (sec64 > std::numeric_limits<uint32_t>::max())
        throw std::runtime_error("Time is out of dual 32-bit range");
      sec = static_cast<uint32_t>(sec64);
      nsec = static_cast<uint32_t>(boost::math::round((t-sec) * 1e9));
      // avoid rounding errors
      sec += (nsec / 1000000000ul);
      nsec %= 1000000000ul;
      return *static_cast<T*>(this);
  }

Since the underlying representation is just "seconds" and "nanoseconds", first it calculates the second by taking a floor of the double value and converting to a signed 64-bit integer. That will be 0 if you pass 0.0.

Next it subtracts the seconds (which, let's say gets something like 0.0000000000000001 due to limitations of floating point values) and multiplies by a billion to convert the remainder to nanoseconds.

Finally it does this bit of magic:

// avoid rounding errors
...
nsec %= 1000000000ul;

You'll notice that std::numeric_limits<double>::epsilon() is something like 2.22045e-16 while this clamps between zero and 1e-9. So if you think of the epsilon as the "noise" level in the "time nanoseconds" signal, we're doing a low pass to 9 digits of precision, and the noise is at 16 digits. That will make sure we ignore values that are close to zero but less than 1 nanosecond.

In other words, the joint state controller constructor will cause the last_publish_time_ to be initialized to zero seconds and zero nanoseconds. Finally, here's the isZero() method I am using:

inline bool isZero() const { return sec == 0 && nsec == 0; }

So here's the proof that my changes work, besides executing them and verifying it all works when running:

Joint state controller initializes with 0 publish time, treated as a "special value" indicating it has not published even once
The update method will run the update if one of the following is true:
- The controller has never published, because last publish time is zero
- It is time to publish again, because the difference between current and last publish time is greater than publish interval

I know this is long, but you asked for it! :)

I need help with unit tests to complete this PR. The failing test is called publishOk. I don't have experience with ROS tests yet and the logic in that test is very messy. The gist of what it's trying to do is verify the controller publishes a specific number of times within a second that should reflect the publish rate in Hz, and that the delta of each update is equal to the publish rate. No idea how to fix yet, but will probably figure this out within a couple of days - help welcome.

01binary commented 5 months ago

Had a busy few weeks, sorry. So the failing test, as I mentioned is called publishOK. Here's how I run the test suite for joint state controller:

catkin_make run_tests_joint_state_controller

I get the following output:

[joint_state_controller.rosunit-joint_state_controller_test/publishOk][FAILURE]-
ros_controllers/joint_state_controller/test/joint_state_controller_test.cpp:140

The difference between real_pub_rate and pub_rate is 7.7832986269866815, which exceeds 0.05 * pub_rate, where
real_pub_rate evaluates to 17.783298626986682,

pub_rate evaluates to 10, and
0.05 * pub_rate evaluates to 0.5.

01binary commented 4 months ago

This was addressed by #630 which was merged.

ros-controls / ros_controllers

Joint state controller throws exception when start time is less than update rate #628