Open rgreid opened 4 years ago
Wow! Nice observation,
@mikepurvis tagging you since this bug will affect quite a few Clearpath robots. We can fix it, just looking for some guidance to ensure the PR will be accepted.
@rgreid Thanks for pointing this out! I would welcome a PR which addresses the issue. Note that we have some existing gtests which cover time/duration math, so it should be fairly easy to add one which manually "ticks" whatever the new system is through the wraparound scenario in order to prove that it does the right thing:
https://github.com/ros-drivers/rosserial/blob/noetic-devel/rosserial_client/test/
Also note that if you are working in this area, it may also be of interest to examine the ETH Zurich branch which slews the microcontroller's time to match ongoing updates from the host, rather than just setting the epoch once on startup:
I don't think it's in a mergeable state as-is— the control loop variables would need to be namespaced in a class or something, but if this is of interest to others, I would definitely entertain including it here.
During a safety review we found that the
NodeHandle_::now()
time wraps after 49.7 days of up time. This will causenow()
to jump backwards 49.7 days into the past. The glitch will last for a few seconds until the next requestSyncTime() and setNow(Time & new_now) occurs.On our robots, this will trigger an emergency power off!
The bug is in the
NodeHandle_
member functions in node_handle.h:Here the low-level
hardware_.time()
is anuint32_t
in milliseconds and it wraps after 2^32/1000/3600/24 = 49.7 days. This is typical, and is not the cause of the problem.In
setNow()
thesec_offset
andnsec_offset
values establish an offset "epoch" for when the rosserialclient was powered on, i.e. `hardware.time() == 0`.In
now()
this epoch is used to calculate and return theros::Time current_time
. It makes the assumption thathardware_.time()
increases monotonically, however, it's not; when theuint32_t
wraps, thecurrent_time
jumps back to the epoch.There's a few ways to fix this. We'd be happy to implement and do a PR, however since it's core functionality, it'd be good to get the community's input. E.g. I'm looking at this issue, and wondering if we should be avoiding large offsets.