NodeHandle_::now() jumps back 49.7 days after power on

rgreid commented 4 years ago

During a safety review we found that the NodeHandle_::now() time wraps after 49.7 days of up time. This will cause now() to jump backwards 49.7 days into the past. The glitch will last for a few seconds until the next requestSyncTime() and setNow(Time & new_now) occurs.

On our robots, this will trigger an emergency power off!

The bug is in the NodeHandle_ member functions in node_handle.h:

  Time now()
  {
    uint32_t ms = hardware_.time();
    Time current_time;
    current_time.sec = ms / 1000 + sec_offset;
    current_time.nsec = (ms % 1000) * 1000000UL + nsec_offset;
    normalizeSecNSec(current_time.sec, current_time.nsec);
    return current_time;
  }

  void setNow(Time & new_now)
  {
    uint32_t ms = hardware_.time();
    sec_offset = new_now.sec - ms / 1000 - 1;
    nsec_offset = new_now.nsec - (ms % 1000) * 1000000UL + 1000000000UL;
    normalizeSecNSec(sec_offset, nsec_offset);
  }

Here the low-level hardware_.time() is an uint32_t in milliseconds and it wraps after 2^32/1000/3600/24 = 49.7 days. This is typical, and is not the cause of the problem.

In setNow() the sec_offset and nsec_offset values establish an offset "epoch" for when the rosserialclient was powered on, i.e. `hardware.time() == 0`.

In now() this epoch is used to calculate and return the ros::Time current_time. It makes the assumption that hardware_.time() increases monotonically, however, it's not; when the uint32_t wraps, the current_time jumps back to the epoch.

There's a few ways to fix this. We'd be happy to implement and do a PR, however since it's core functionality, it'd be good to get the community's input. E.g. I'm looking at this issue, and wondering if we should be avoiding large offsets.

iamrajee commented 4 years ago

Wow! Nice observation,

rgreid commented 4 years ago

@mikepurvis tagging you since this bug will affect quite a few Clearpath robots. We can fix it, just looking for some guidance to ensure the PR will be accepted.

mikepurvis commented 4 years ago

@rgreid Thanks for pointing this out! I would welcome a PR which addresses the issue. Note that we have some existing gtests which cover time/duration math, so it should be fairly easy to add one which manually "ticks" whatever the new system is through the wraparound scenario in order to prove that it does the right thing:

https://github.com/ros-drivers/rosserial/blob/noetic-devel/rosserial_client/test/

Also note that if you are working in this area, it may also be of interest to examine the ETH Zurich branch which slews the microcontroller's time to match ongoing updates from the host, rather than just setting the epoch once on startup:

https://github.com/ethz-asl/rosserial/blob/9136b35e15482d0384848613ab96959224397f86/rosserial_client/src/ros_lib/ros/node_handle.h#L351-L362

I don't think it's in a mergeable state as-is— the control loop variables would need to be namespaced in a class or something, but if this is of interest to others, I would definitely entertain including it here.

ros-drivers / rosserial

NodeHandle_::now() jumps back 49.7 days after power on #502