Robot communications timing is not consistent, due to UART Latency

We have observed a phenomenon related to latency in the serial port interface on Linux.

Reproduce example 1:

Put a firmware on the robot that has if(consecutive_nones > 0) on line 187 of omni_control.c. The threshold of zero here means that the robot will error out if it misses even a single command.
Run node omnid_control mobile_interface (specific source we used is at hash 2de56cddde33e626940a63042fe1fd88e3627636 in the omnid_control repo). This node sends twist commands to the mobile robot controller.
Run the test node omnid_control mobile_interface_test according to documented procedure.
Wait a while. Anytime from a couple minutes to a couple hours (usually sooner than later), the robot will fail on this consecutive_nones check. Sometimes the test will complete, but if you make the test long enough, it will eventually fail.

Reproduce example 2:

Check out branch strong/uart-latency on omnid_control, omnid_core, omnid_firmware, and nuhal.
Build with catkin build. The node omnid_control mobile_interface will now be in a test configuration which does not subscribe to ros topics. It simply tries to send zero twist commands to the robot continuously. This example is simpler than the example above and sometimes produces different errors, but the root cause of failure is suspected to be the same.
Run the node omnid_control mobile_interface on the robot.
Wait a while. Anytime from a couple minutes to a couple hours (usually sooner than later), the robot will fail on this consecutive_nones check. Node output will include some timing information which may prove useful.

After a number of tests with different code (all available in the strong/uart-latency branch in the relevant repos) we have concluded that this is likely to be due to a delay in the return of linux system calls. The robot appears to respond normally, but the host node doesn't send response fast enough (this has been confirmed on a scope). On the frame in which the failure occurs, the protocol_read_block call in the node takes longer to respond.

Things we tried:

Ran some test code with numbered packets to see if a packet was being dropped in hardware (rough code exists for this in the omnid_control repo, but is not well-maintained). We never saw a dropped packet; we concluded from this that there was no physical issue on the line.
Used setserial to enable the low_latency flag on the port. Didn't seem to make a difference.
Set up kernel preemption (see https://devarea.com/understanding-linux-kernel-preemption/) and tried running the code under that condition. This changed the timing of the failure somewhat, causing an even greater delay in the receive_block function in at least one case. This was reflected on the scope.
Try some of Linux's scheduling policy settings. See chrt -d and -r settings and sched(7) man page. Note that we ran the node manually on the robot with sched -r 99 and never saw a failure, though we only ran it for a few minutes. We did observe some weird timing when we did this; the firmware task pin seemed to have a slowly varying duty cycle, a phenomenon we had never seen before. Not sure if related to the scheduling setting, or just because we'd never gotten it to run long enough to notice this before.

Potential paths to rectify this include:

The current solution: just increase the threshold number of consecutive_nones allowed on the omni control board. This approach allows us to miss more than one control frame, which means there will be jitter that gets through to the robot's motion. We can limit the impact of this by setting the threshold to a reasonable value like 1/20s (10 counts at a loop speed of 200 Hz).
Configure the Linux system on the NUC to be closer to a real-time system (see the links below).
Wait and see. If/when we switch to ROS2, this might be better supported.

Relevant links: https://wiki.linuxfoundation.org/realtime/documentation/technical_basics/preemption_models https://nicolovaligi.com/concurrency-and-parallelism-in-ros1-and-ros2-linux-kernel-tools.html https://lwn.net/Articles/743740/ http://wiki.ros.org/realtime_tools https://askubuntu.com/questions/656771/process-niceness-vs-priority

omnid / nuhal

Robot communications timing is not consistent, due to UART Latency #1