robotology / yarp

YARP - Yet Another Robot Platform
http://www.yarp.it
Other
525 stars 195 forks source link

Discussion about real time thread priorities #1215

Open diegoferigo opened 7 years ago

diegoferigo commented 7 years ago

The current status of real time thread support in yarp is still in experimental status and hence undocumented. This could be a useful feature for critic threads that run into icub-head, and could potentially mitigate situations like https://github.com/robotology/icub-support/issues/449.

Before starting, some bit of history, at least what I found so far.

Some year ago yarp gained the experimental support of tweaking the priorities of important threads (C++11 support, later tweak). Beyond PortCore, this feature is used in yarprobotinterface and ethManager through the ICUB_USE_REALTIME_LINUX experimental CMake option (default to OFF). These latter support is unix specific, and this could be the reason of the default value of the mentioned option. More generally, all the classes that inherit from ThreadImpl have now a multiplatform setPriority() method.

I have some comments about:

  1. The real-time priorities set in this way must be manually selected, and no guidelines still exist. I think that defining predefined ranges wrt task categories (the ones that could require RT) is something worth discussing
  2. It would be nice having a static int setPriority() method in ThreadImpl that can be called also in ethManager in order to provide multiplatform support and to avoid breaking builds on non-unix machines when ICUB_USE_REALTIME_LINUX=1
  3. Even if further testing is needed, the current (non-multiplaform) tweak of priorities of critic threads (like ethManager) should be enabled by default if unix is detected

For future reference (and to clear my mind), I recap below what I understood about the policy of unix priorities.

There are two kind of tasks, each linked to a different priority range:

Dynamic priority Static Priority
real-time not real-time
SCHED_FIFO SCHED_RR SCHED_OTHER SCHED_NORMAL
RTPRIO=[1, 99] NICENESS=[-20;19]
High RTPRIO means high priority High NICENESS means low priority

One important detail that is worth noting is that internally the kernel handles the queue with a certain logic, and the userspace tools such as top show a different range. Yes, it is confusing, and you can find funny discussion in the linux-rt ml. This document and <include/linux/sched/prio.h> have been really helpful.

Here below I'll focus of the PRI field you can obtain from ps -eo pid,rtprio,ni,pri,comm (check Note#1 at the end of the post). To my understanding, the general priority PRI=[0:139] of a thread can be defined as follows:

PRI = (19 - NICENESS) + RTPRIO

considering:

* RTPRIO=0     if sched is OTHER or NORMAL
* NICESESS=-21 if the sched is FIFO or RR  (21 because RTPRIO=1 means PRI=41)

The range generated by this equation is the following:

   --------->                                default priority of kernel
     higher                                  tasklets and interrupt handler
    priority                                 v
                                 90 (=19+21+50)
0           39 41                v                               139      PRI field
|-------------|----------------------------------------------------|
0           39 1                                                  99
<------------><---------------------------------------------------->
   NICENESS                         RTPRIO
(scaled and inverted)

Note: PRI=40 is not allowed. See Note#1 below.

Notes

Note#1

Something is still not clear, and I hope someone sooner or later could shine light on this. man ps reports at its bottom an explanation of the fields that the UNIX-style option -o can produce. In particular:

pri PRI priority of the process. Higher number means lower priority.

But, checking at ps -eo pid,rtprio,ni,pri,comm, that column is orderered [0, 139] with low values = low priority logic. You can check yourself grepping the irqs, pulseaudio, etc, that have a bigger PRI. Apparently I'm not the only one with doubts.

References:

Overview of CPU scheduling A complete guide to Linux process scheduling Real-Time Linux Kernel Scheduler RT Preempt HowTO IRQs in RT kernels

DanielePucci commented 7 years ago

@lornat75 @marcoaccame @randaz81 @pattacini @traversaro @francesco-romano

marcoaccame commented 7 years ago

Hi all, I have some more info about previous history of real time threads in robotInterface (around 2014). Let me search the related redmine issues. I will soon post a brief on that.

marcoaccame commented 7 years ago

Here is some more history.

  1. There was a time (about end of 2014) when robotInterface threads had SCHED_FIFO scheduling policy and ad-hoc RT priorities. This mechanism was introduced to solve some problems we experienced such as sporadic packet loss.
    In particular: all robotInterface threads had scheduling policy SCHED_FIFO. All had priority 33, apart from thread EthSender (48) and EthReceiver (49). See Note-1 for my comments on these priorities. I recall that @apaikan and I worked with RT another time, in 2015, to reduce the CPU load of old PC104. See Appendix for details of both activities.
  2. The RT mode was enabled by means of the ICUB_REALTIME_EXPERIMENTAL CMake variable which defines the value of macro ICUB_USE_REALTIME_LINUX which adds to the code some Linux system calls. At the time we did not have YARP support and we had to use Linux system calls, making the system non-portable.
  3. Some time after YARP gained RT support but the code under macro ICUB_USE_REALTIME_LINUX was not changed to use the new APIs.
  4. For several reasons, the yarprobotinterface threads don't have this RT feature anymore. Reasons are:
    • The default value of the ICUB_REALTIME_EXPERIMENTAL variable has always been 0 and it never became 1.
    • Documentation in http://wiki.icub.org/wiki/Compilation_on_the_pc104 does not mention the use of this variable, hence new installations don't have RT mode turned on.
    • When the main.cpp of robotInterface moved to YARP, the ICUB_REALTIME_EXPERIMENTAL did not move as well with the result that the code under macro ICUB_USE_REALTIME_LINUX is never compiled.
  5. In some robots we now have a more powerful processor (new COM Express vs old PC104) with a local hard disk. These features much improve computation capabilities and system responsiveness.

Now, I recall that in the time of old PC104 the RT mode was required to mitigate packet loss. So, why now ICUB_USE_REALTIME_LINUX is off and we dont see the problems of years ago?

I think the reason is in point (5): now the system has much more computation capabilities and the problem does not (statistically?) shows.

However, I think that the improved computation capabilities are not enough to guarantee a safe system.

And here is a proposal for a way forward:

  1. We should at least revert back to RT mode as tested some time ago.
  2. After, I agree with @diegoferigo, we should define / experiment a proper assignment of priorities to the threads running inside icub-head. In Note-2 are some thoughts on priorities.

Notes

Note-1 I don't recall the reason of numbers 33, 48 and 49. I remember only that originally they were 33, 49 and 49 and later we moved one 49 into 48. Maybe @apaikan remembers why.

Note-2 If I should define priorities now I would choose values in the range [50, 99], so that they sit between the system IRQs which maintain the highest priorities and the remaining common user threads. Moreover, I would assign highest priority to the EthSender (it normally executes for shorter time, plus we really want to deliver commands to the motors ASAP). Then, as second, I would put the EthReceiver (because it parses UDP packets, unlocks blocking requests and fills all data coming from the ETH boards inside the devices - embObjStrain, embObjMotionControl, etc. -, data which is normally read at 10 ms rate). Finally, as third I would put all other threads in yarprobotinterface (because I assume they just deal with port communication ... but see my Question-1).


Questions

Question-1 What are the other threads in yarprobotinteface? Those associated to the wrappers?


Appendix

Here are details of past activities reported in redmine.

Issue about Packet loss in PC104

In https://redmine.robotology.eu/issues/87 it is fist mentioned that in Oct 2014 @apaikan introduced a variable to toggle on /off realtime kernel with FIFO scheduling and suitable priorities in threads of robotInterface.

I tested that feature for the sake of solving the problem of packet loss in PC104.

For those of you who don't have access to redmine, here are some excerpts from the issue:

Marco Accame 13 Oct 2014 07:31 PM


The result of these tests is that the packet loss happens in the PC104 and is due to 
incorrect scheduling of the receiving thread and to small size of the buffer size. 
By applying suitable small changes the existing kernel setup and in the code of 
robotInterface, the problem seems to be solved.

The brief of all the changes which solve the problem is in the following.

In system:
1. Use of a realtime kernel with FIFO scheduling 
2. Increased max size of UDP receiving buffer to 8MB: added line "net.core.rmem_max=8388608" 
   in file /etc/sysctl.conf
3. Given to icub user the right to run high priority threads: added the two lines 
   "icub soft rtprio 99" and "icub hard rtprio 99" in file /etc/security/limits.conf

In the code of robotInterface:
4. priority of robotInterface process set to 33 (see main.cpp),
5. priority of receiving thread equal to 49 (see ethManager.cpp, EthSender::threadInit()),
6. priority of transmitting thread equal to 48 (see ethManager.cpp, EthReceiver::threadInit()),
7. buffer size of receiving thread equal to 1MB (see ethmanager.cpp, EthReceiver::config())

Marco Accame 03 Nov 2014 11:55 AM

With the above changes and after continuous use on iCubGenova04 the problem has not 
shown anymore.

The above changes related to robotIntenface (points 4 to 7) are already in icub-main.

The changes related to system (points 1 to 3) are issue of another task dealing with
preparation of a new live linux key.
Can anybody working on this task please add a link?

NOTE1: the iCubGenova04 of end 2014 has now become iCubNancy01. NOTE2: The system settings of points 1 to 3 are now in use in our robots.

Issue about CPU load in PC104

Moreover, in April 2015 I did some more work with @apaikan focussed to mitigate the CPU load of the old PC104.

See https://redmine.robotology.eu/issues/561. We did many tests with the RT mode turned ON. It was decided:

  1. the use periodic threads for both EthSender and EthReceiver (earlier the receiver was not periodic and was triggered on reception of any UDP packet) and,
  2. to reduce the TX rate of the ETH boards to a reasonable value which is now configurable in runtime via xml files

We tested on old iCubGenova04, upper part of a new iCubGenova02, iCubHeidelberg, and later on we deployed on all other ETH robots.

lornat75 commented 7 years ago

I agree we should enable realtime support for the software running on the head pc and raise the priority for the ethManager, also making sure it is always on by default and in future installations of the robot.

diegoferigo commented 7 years ago

Note: This post was edited after @apaikan comment below.


Thank you @marcoaccame for this detailed report, well done. Following especially your Note-1 and Note-2, this is the priorities logic I propose:

| 0 ------------------ 19 | 20 ------------------- 39 | 40 ------- 49 |   50   |
    Threads that should        Threads containing          Critic       Kernel
    run on top of user         functions with strict       general      IRQs
    threads but w/o            RT constraints              threads
    strict RT constraints      

Please let me know what's your opinion about.

cc @marcoaccame @drdanz @lornat75 @randaz81 @traversaro @francesco-romano @pattacini @barbalberto

marcoaccame commented 7 years ago

Hi @diegoferigo, it seems fine by me.

apaikan commented 7 years ago

@diegoferigo Please notice that the Ethernet controller thread in RT Linux has priority level of 50. When talking about FIFO scheduling, higher number has actually higher priority value. So increasing the application thread's priority higher than 50 can preempt Network thread and may result higher latency in general! This is the main reason why the yarprobotinterface interface threads has priority (33) less than 50! Reminder from @marcoaccame :

In particular: all robotInterface threads had scheduling policy SCHED_FIFO. All had priority 33, apart from
thread EthSender (48) and EthReceiver (49). See Note-1 for my comments on these priorities.
I recall that @apaikan and I worked with RT another time, in 2015, to reduce the CPU load of old PC104.
marcoaccame commented 7 years ago

Hi @apaikan and @diegoferigo,

two things in parallel:

  1. to be sure of the rule, hence to understand if higher number is also higher priority or if it is the other way round.
  2. agree on the priority ladder.

About 1: I simply don't know.

About 2: see also issue #1233. My opinion is that from highest to lowest priorities:

diegoferigo commented 7 years ago

@apaikan I'm seeing a big misconception here, apparently there is a big confusion about how these ranges work, and it looks like you're right. The two scales use an opposite logic, and this invalidates the scale of the first post of this issue. I'll edit the posts accordingly.

Thanks for your intervention, it was really helpful!

Since you're here, did you choose FIFO over RR for any reason?

lornat75 commented 7 years ago

I suggest we start by rising priorities only of ethReceiver and ethSender. Changing priorities for other threads may have hidden side-effects, to it needs to be one with care.

diegoferigo commented 7 years ago

I updated the first post, I hope now the situation is more clear to everyone.

@lornat75 After the @apaikan comments, it seems that the current situation of priorities already looks good and it is compliant of the ranges I envisioned.

During the last few days, the flag ICUB_USE_REALTIME_LINUX has been set to ON in the iCubGenova04 and all worked flawlessly.