omnid / nuhal

Miscellaneous cross-platform (including microcontroller) C utilities
BSD 3-Clause "New" or "Revised" License
0 stars 0 forks source link

Robot communications timing is not consistent, due to UART Latency #1

Open ghost opened 3 years ago

ghost commented 3 years ago

We have observed a phenomenon related to latency in the serial port interface on Linux.

Reproduce example 1:

  1. Put a firmware on the robot that has if(consecutive_nones > 0) on line 187 of omni_control.c. The threshold of zero here means that the robot will error out if it misses even a single command.
  2. Run node omnid_control mobile_interface (specific source we used is at hash 2de56cddde33e626940a63042fe1fd88e3627636 in the omnid_control repo). This node sends twist commands to the mobile robot controller.
  3. Run the test node omnid_control mobile_interface_test according to documented procedure.
  4. Wait a while. Anytime from a couple minutes to a couple hours (usually sooner than later), the robot will fail on this consecutive_nones check. Sometimes the test will complete, but if you make the test long enough, it will eventually fail.

Reproduce example 2:

  1. Check out branch strong/uart-latency on omnid_control, omnid_core, omnid_firmware, and nuhal.
  2. Build with catkin build. The node omnid_control mobile_interface will now be in a test configuration which does not subscribe to ros topics. It simply tries to send zero twist commands to the robot continuously. This example is simpler than the example above and sometimes produces different errors, but the root cause of failure is suspected to be the same.
  3. Run the node omnid_control mobile_interface on the robot.
  4. Wait a while. Anytime from a couple minutes to a couple hours (usually sooner than later), the robot will fail on this consecutive_nones check. Node output will include some timing information which may prove useful.

After a number of tests with different code (all available in the strong/uart-latency branch in the relevant repos) we have concluded that this is likely to be due to a delay in the return of linux system calls. The robot appears to respond normally, but the host node doesn't send response fast enough (this has been confirmed on a scope). On the frame in which the failure occurs, the protocol_read_block call in the node takes longer to respond.

Things we tried:

Potential paths to rectify this include:

Relevant links: https://wiki.linuxfoundation.org/realtime/documentation/technical_basics/preemption_models https://nicolovaligi.com/concurrency-and-parallelism-in-ros1-and-ros2-linux-kernel-tools.html https://lwn.net/Articles/743740/ http://wiki.ros.org/realtime_tools https://askubuntu.com/questions/656771/process-niceness-vs-priority

m-elwin commented 3 years ago

Main link that is important is linuxfoundation.com/realtime.

See https://lwn.net/Articles/837019/ (the slides therein also show us how to plot where each process is executing vs suspending in time)

the chrt command can help but we can also set this with code

We can use the linux-lowlatency kernel to do a bit better (if we increase process priority/niceness) but overall what we want is to be able to use the deadline scheduler for this task (have not quite figured out how to do this yet, but see man sched and the realtime stuff).

We can play with all this scheduling and afinity etc (including the deadline scheduler) without the RT_PREEMPT kernel, its just that only the RT_PREEMPT kernel can gives us guarantees.

The delay in sending out serial data is almost certainly caused by latency in waiting for the read syscall to be processed or something like that. But scheduling our process with a higher priority or the deadline scheduler should help.