What is the expected behavior for missing loop rate?

tylerjw commented 2 years ago

Currently when the loop rate is missed following iterations "catch up" because the design is to ask for the loop to start at a set time.

If you combine that with trajectory execution and position control the resulting command values are set with close to 0 charge in position.

If your robot takes control messages and the robot always takes the loop rate amount of time in it's internal controller to process a single command you have now have a problem.

You have commanded as close to maxim possible acceleration at your control rate because in one command the robot was moving and in the next n commands while ros control "catches up" are sent with zero position change.

Is there anything we should charge about the way the loop works to make it more robust to position control when run on a non deterministic system like normal Ubuntu?

AndyZe commented 2 years ago

This is a good reason to get the Ruckig smoothing PR merged: https://github.com/ros-controls/ros2_controllers/pull/324

You can actually get infinite vels/accels right now, with the spline interpolation algorithm.

AndyZe commented 2 years ago

I don't completely understand the loop rate issue. Maybe it would be smart to reject incoming commands if time_since_previous_command is less than 0.8x the control period? Or something like that.

tylerjw commented 2 years ago

I don't completely understand the loop rate issue. Maybe it would be smart to reject incoming commands if time_since_previous_command is less than 0.8x the control period? Or something like that.

This is exactly what I did and it greatly improved the performance.

I'll try to draw a diagram or something to make it more clear. I think the current design is reasonable with regard to loop rates but when combined with how the interpolation works and position control it creates this nastily behavior.

tylerjw commented 2 years ago

Here is an example of the result on my hardware. The first column is the time delta in calls to my write function, and the rest are the position delta for each joint.

Here is a chart of the position deltas on the first joint. By detecting really small-time deltas and not sending those commands to the robot I remove the spikes that go down (the ones that go up I don't have a strategy for abating yet).

gavanderhoorn commented 2 years ago

To me, this looks like what I would expect from a non-deterministic system.

Some periods are too short, others too long.

tylerjw commented 2 years ago

To me, this looks like what I would expect from a non-deterministic system.

Some periods are too short, others too long.

I agree, but I think there are things we could do to make this more robust to running on a non-deterministic system. As it is failures are constant and poorly handled.

gavanderhoorn commented 2 years ago

What do you suggest?

Almost all control systems I am familiar with assume a fixed dT. If that's not achieved, behaviour becomes almost 'undefined'.

Dropping "too old commands" is not really an option when you're working with a position controlled system (as the system-under-control would stop), which makes it difficult to come up with a generic solution.

tylerjw commented 2 years ago

I don't think it is helpful to try to "catch up" in the case of position control because hardware does assume fixed dt and will take that as a hard command to stop motion.

Instead, I think we could calculate the start of the next iteration without trying to catch up by sending commands with near zero dt between them.

gavanderhoorn commented 2 years ago

So would you want to make that the responsibility of a hardware interface, controller manager or individual controllers?

I could see dropping references working, or something more intelligent, but that seems like it should be the responsibility of the controller, as that's probably more aware of what desired behaviour should be (it implements how things should be controlled, after all).

Note also: control systems not making their deadlines is a nightmare for control engineers (most forms of (stability) analysis go out-of-the-window), and can easily lead to divergent system behaviour.

tylerjw commented 2 years ago

Why would control engineers choose this then to write their control systems with if they don't start with the assumption that it isn't going to be deterministic. To my understanding ros2_control does not support any deterministic OS schedulers. It is open source and there is nothing stopping someone from making it work in a deterministic system, but by default it targets normal Linux.

Unless it wants to target vxworks or preempt or some other more deterministic system I think it should try to fail softly when it does not meet the loop rate.

tylerjw commented 2 years ago

When you say responsibility of the controller do you mean in this case joint trajectory controller? If the controllers responsibility is to see that it is asked to create a new command at a much shorter than normal dt and it fails the current design of ros2_control has no way of communicating that failure to the hardware controller.

tylerjw commented 2 years ago

If what you mean by controller is the hardware interface, that is what I'm currently doing. However I found this choice to send "catch up" position controls surprising, hence the issue.

gavanderhoorn commented 2 years ago

Edit: and of course: I'm not a maintainer here, so this is just me, another stupid user, and a personal opinion/insight.

Why would control engineers choose this then to write their control systems with if they don't start with the assumption that it isn't going to be deterministic.

I assume you mean to write: "if they can't start with the assumption it's going to be deterministic".

Control engineers I've talked to and who've used ros_control do assume it's capable of providing deterministic behaviour. That's why I started posting comments on your issue: I recognised what you reported.

To my understanding ros2_control does not support any deterministic OS schedulers.

Separation of concerns would suggest that would also not necessarily be a responsibility of ros2_control. There are plenty of other systems capable of scheduling tasks (your OS fi). What would it need to explicitly support (not talking about safety-critical systems here, where there could be a tighter coupling between tasks and scheduler)?

It is open source and there is nothing stopping someone from making it work in a deterministic system [..] Unless it wants to target vxworks or preempt or some other more deterministic system

For non-safety-critical systems this would certainly be sufficient in most cases.

By running your controller manager in a thread which is configured with something like FIFO or RR scheduling and an RT priority on a PREEMPT_RT kernel (or Xenomai) you would already get much better behaviour, scheduling-wise, and consequently, more consistent activation times of your control task, with less jitter on periods.

but by default it targets normal Linux.

I don't know whether this is true. I don't believe it was true for ros_control, but perhaps ros2_control has changed this.

When you say responsibility of the controller do you mean in this case joint trajectory controller?

If that would be the active controller, then: yes.

If the controllers responsibility is to see that it is asked to create a new command at a much shorter than normal dt and it fails the current design of ros2_control has no way of communicating that failure to the hardware controller.

unless we're talking about a RR scheduled hard real-time system, a "shorter dT" is not a failure. IIRC, periods are communicated to controllers, via the call to update(..).

Controllers can inspect that argument, and should take the length of the period into account. They probably often do, but it's only part of what's going on: scheduling of the rest of your task is also important.

If the reported period to a controller differs from the actual time between read(..) and write(..), you'll still end up with the spikes you see in your plot.

If what you mean by controller is the hardware interface, that is what I'm currently doing. However I found this choice to send "catch up" position controls surprising, hence the issue.

no, a hardware interface's (or hardware component if you prefer) responsibility should be (in my opinion) transforming domain concepts from ros2_control to-and-from whatever the targetted hw-system uses, combined with offering communication with that system (and some coordination is probably needed, as there are typically things like control modes which need to be switched, and protocols can be quite complex).

Perhaps if the targetted system offers some way of mitigating missed control periods the hw interface could share some responsibility (as it would probably be in a good position to communicate that), but in principle, it would make sense to me for the active controller to compensate for variable dT if possible.

As I wrote: dropping or repeating actuation commands for position controlled actuators will likely result in stuttering. For velocity and force controlled actuators it's less of a problem. That seems like a control decision to me, hence my suggestion to make controllers responsible.

AndyZe commented 2 years ago

Thank you for the reality check @gavanderhoorn :+1:

AndyZe commented 2 years ago

Why would control engineers choose this then to write their control systems with if they don't start with the assumption that it isn't going to be deterministic.

Some robot drivers seem to handle the jitter in commands much better than others. UR and ABB do a great job. I think you're working with a less sophisticated, raw driver that doesn't handle the jitter well.

Which is to say, we usually don't use a realtime kernel with UR robots (for example) because we can get away with it. But the kernel running on the UR pendant itself is realtime, I believe.

gavanderhoorn commented 2 years ago

Some robot drivers seem to handle the jitter in commands much better than others. UR and ABB do a great job

those two robot controllers apply a massive amount of filtering. UR cannot really be configured otherwise, but EGM's can be dialed down significantly, at which point it becomes much more difficult to handle as well.

But they are actually two major exceptions.

Most other external motion interfaces -- RSI, J519, MXT, etc -- are far less forgiving (not at all actually).

Thank you for the reality check @gavanderhoorn +1

well, just to be clear: I did not imply the observations posted by @tylerjw are wrong, and I'm not dismissing his experience at all.

If he/the maintainers/we can come up with a more robust approach to missed or delayed control iterations, that would be great.

I just wanted to get the most obvious solution (ie: use a deterministic environment) out of the way.

tylerjw commented 2 years ago

When I said it does not support other schedulers I was simply referring to how the main loop is written and that it does not call any of the scheduler configuration API functions.

By running your controller manager in a thread which is configured with something like FIFO or RR scheduling and an RT priority on a PREEMPT_RT kernel (or Xenomai) you would already get much better behaviour,

This is what I mean. ros2_control does not do this in the implemented control loop... There is nothing stopping me or anyone else from doing this themselves though.

I do think the project could though make some decisions to fail in a softer way when run on a non deterministic system if it doesn't out of the box target being run on a system with a scheduler that has options for RR, FIFO, or deadline.

gavanderhoorn commented 2 years ago

This is what I mean. ros2_control does not do this in the implemented control loop... There is nothing stopping me or anyone else from doing this themselves though.

off-topic almost, but you do not need to 'wait' for ros2_control to implement this, nor any particular driver author.

If lifting the complete process "up" to RT priorities with particular scheduling suffices, you should be able to use chrt with suitable parameters.

tylerjw commented 2 years ago

To be clear I've started working on tooling to build ROS for VxWorks because I want to solve this problem for good and make it accessable. The problem though is I just haven't finished that project.

gavanderhoorn commented 2 years ago

ROS 1 or ROS 2?

Doesn't Wind River already maintain a ROS 2 build for VxWorks 7?

tylerjw commented 2 years ago

My understanding is you wouldn't want to lift the whole process, only the control thread because if you lifted all that ROS stuff you might be back in the same place.

tylerjw commented 2 years ago

ROS 2... I should have googled before I started that. 😢

gavanderhoorn commented 2 years ago

Perhaps taking a look at OROCOS could help at this point.

There was a OROCOS compatible ros_control ControllerManager component available, and OROCOS separates coordination from computation (and the rest of the Cs) much better than we have ever done in ROS (1 and 2).

tylerjw commented 2 years ago

I will go look at what they've done there. Most of my concern here could be summed up in this statement:

ros2_control could do better to give users better behavior, even on normal Ubuntu.

Your point about solving my problem the more correct way of using a deterministic system is well taken. When I have more time I hope to find ways to do that for ourselves, and also to make it easier for normal users of ros2_control to also run their system that way.

Ubuntu 22 now has a RT kernel they are officially supporting, I should also investigate how easy that is too use try to write a ros2_control loop with deadline scheduling.

zultron commented 2 years ago

In my mind, the above discussion divides into two problems.

Relevant to the ros2_control code base, one problem is the absence of any RT threading features in the ROS2 controller_manager node. IMO those features would greatly improve the utility of this code base, since users would be a lot closer to acceptable RT performance right out of the box. Optimizing biggest effect on latency vs. implementation effort, several important features worth adding are SCHED_FIFO scheduling, elevated RT scheduling priority, easy granting CAP_SYS_ADMIN & CAP_SYS_NICE privs for those to either the executable or the user, memory locking to avoid page faults, power management and /dev/cpu_dma_latency, and CPU isolation w/cgroups. Those would make a drastic difference on both RT_PREEMPT and vanilla distro kernels. The cyclictest program implements most of these, and could serve as an example. Alternatively, run the controller_manager in another RT programming environment, such as OROCOS, or (my approach) Machinekit HAL.

The other problem, where commanded position doesn't account for variation in the update period/dT, sounds like an issue in the ros2_controllers repo or @tylerjw 's controller. The controller_manager node passes both the current time and current period (i.e. duration since previous update) to the controllers in its update() method, and it's up to the controllers to do the right thing with that. The joint_trajectory_controller update() function calls its sample() method, which in turn calls its interpolate_between_points() method. The interpolators should use the current time (not the expected update time) as input. Note that other controllers like the forward_command_controller have no concept of a trajectory, and forward whatever command they're told, and it's up to higher-level layers to deal with what command to send for any given update time T.

Also note that nobody has figured out how to (& I suspect never will) achieve perfectly deterministic real time latency in software threads on PC or ARM architectures under a Linux (or Xenomai or any other RT) threading environment(s). It IS fairly easy to find combinations of hardware, OS and OS settings, and RT programming techniques that limit worst-case latency to under 100µs. Controllers must be designed for this, as the joint_trajectory_controller is. The best RT configuration I've ever seen was under 8µs max latency, but in my practical experience, there's actually enough jitter in the controller manager update() return time (up to 60µs on an Atom CPU recently tested) that further hardware, OS & RT application thread optimization won't help beyond some point and we have to start looking into ros2_control & ros2_controllers.

destogl commented 2 years ago

Just for the record at maintainer of the framework: I fully agree with you guys and most of these things actually on my roadmap, but without any timestamp on it, unfortunately.

@tylerjw and @zultron you described some great ideas here, and I appreciate very much. What I would appreciate even more are some practical examples, documentation how to set up this or even PR to the existing code-base. If this means we have to add another version of controller_mananger or the ros2_control_node for this, no problem, I am open for all the possible solutions.

The best RT configuration I've ever seen was under 8µs max latency, but in my practical experience, there's actually enough jitter in the controller manager update() return time (up to 60µs on an Atom CPU recently tested) that further hardware, OS & RT application thread optimization won't help beyond some point and we have to start looking into ros2_control & ros2_controllers.

You are fully right. Implementation of some (parts) of controllers and framework itself really need some love and deep check what is happening in there w.t.r. of memory allocation and time performance.

destogl commented 2 years ago

And to answer to the question: “There is no expected behavior. You should take care that you never miss the loop rate.” If you are missing, it more often, try RT_PREEMT or run ros2_control on a separate machine. Sorry that I don't have a better solution for now.

Your proposal seems sensible, would you like at least write up a concept of it in the roadmap repository?

gavanderhoorn commented 2 years ago

The best RT configuration I've ever seen was under 8µs max latency, but in my practical experience, there's actually enough jitter in the controller manager update() return time (up to 60µs on an Atom CPU recently tested) that further hardware, OS & RT application thread optimization won't help beyond some point and we have to start looking into ros2_control & ros2_controllers.

as another 'reality check' perhaps: in my experience, users with a requirement to reach the kind of determinism and low jitter @zultron describes here skip using a "complex OS" like Linux altogether (patched with PREEMPT or Xenomai or not).

Trying to cover those use-cases with vanilla ros2_control would perhaps not be the best goal for current development.

I would suggest to set a realistic goal, which covers the majority of use-cases ("diff drive on a 2 wheel + 1 caster" and "6 dof industrial robot with an external motion interface" for instance).

Determinism there is important, and low jitter as well, but I doubt control performance in those cases would be very much worse if jitter would be a couple hundred micro-seconds.

tylerjw commented 2 years ago

To be clear, my requirement for jitter for my current piece of hardware is around 1ms.

I am going to work on getting rt_preempt working this week for me to see if I can make that work for my application.

I do think it would be ideal for ros_control to eventually target bare metal for some specific targets using a rtos. Maybe using micro-ros for the external ros interface.

tylerjw commented 2 years ago

Here is my PR I'm testing with: https://github.com/ros-controls/ros2_control/pull/748

ros-controls / ros2_control

What is the expected behavior for missing loop rate? #736