Lagging, buffering and freezing behaviour in DS402 motor control with PDOs

lkm1321 commented 9 months ago

Hi,

I'm observing strange buffering + freezing behaviour in motor control with DS402.

We have 8 motors, 4 in velocity mode (mode 2), and 4 in cyclic synchronous position or profiled position mode (mode 1 or 8). All target positions/velocities have been mapped to RPDOs, and we can see that in cansniffer output. The bus operates at 500kbps with a PEAK PCI module with vendor-supplied kernel driver. We have previously worked with 4 motors in velocity modes quite successfully using the same motor drive vendor + PEAK PCI + ros_canopen.

We have a non-CANOpen joystick controlling these motors through ros_control. After sending ~30s of commands with joystick, the commands seem to pile up in that the motors execute previous joystick commands for a very long time after commands are no longer sent or a zero command is sent (~1 min). The joystick publishes at around 40 Hz. For us, the desired behaviour is to always follow the latest joystick command even if that means dropping some messages.

We believe this is an issue in either ros_canopen or the kernel because:

During the "piled-up" period, we do see changes in DS402 commands (the PDO values) from cansniffer output, so drive firmware is not the culprit and
Unloading the ros_control controller during the "piled-up" period does not stop the variations in DS402 commands, so four_wheel_steering_controller is not the culprit

We also tried placing debug statements inside the ros_control controller, and saw there are freezes in between update calls (I assume this runs on the same thread as the DS402 layer).

Can you suggest some places we can debug? Anywhere there are buffers that might be piling up, or mutex locks that may pile up with more nodes.

Thanks, Brian Lee

mathias-luedtke commented 9 months ago

It is hard to tell what's going on as every vendor handles things a little bit differently, but I will try to give some pointers.

cyclic synchronous position or profiled position mode (mode 1 or 8).

Do not use profiled position mode, it is not meant for updating the target constantly (-> use IP or CSP instead). It is meant to send only a single command and let the motor reach the target before send a new one. It works be filling a set-point buffer on the device, the length must be configured properly in the device or EDS. ros_canopen resets the buffer, but perhaps this is not working properly in your case.

the commands seem to pile up in that the motors

Do you see that for all motors and CSP and PP mode?

Anywhere there are buffers that might be piling up

There is not much buffering done in ros_canopen. All data is written at the end of update(). SocketCAN has a buffer, too. But be default it should be rather small (10 messages?).

mutex locks that may pile up with more nodes

Most of the locks are short-lived, just long enough to update the command values etc.

All target positions/velocities have been mapped to RPDOs, and we can see that in cansniffer output

Synchronous PDOs? Have you checked the intervals? All data from the devices should arrive in the first quarter of the sync interval. ros_canopen will process the data in the middle of the interval and the bus must be fast enough so that the devices can load the data before the next sync.

and saw there are freezes in between update calls

update() must not freeze. Please make sure that there are no SDOs for anything motor related, as this will block the update loop. (SDOs to the error register are run in a different thread). However, this will not pile up commands, but it might delay the processing.

Can you suggest some places we can debug?

Test with a single motor and a controller which lets you choose the command, e.g JTC or the topic-based controllers. Then analyse the CAN data and check if the commands change at all.

lkm1321 commented 9 months ago

Thanks for the prompt reply! I will stick to CSP, revise socketcan queue length, and test with JTC and topic controllers and report back. Now that you mention the socketcan queue, we may have made that too large.

Below are some clarifications.

Do you see that for all motors and CSP and PP mode?

Yes.

Synchronous PDOs? Have you checked the intervals? All data from the devices should arrive in the first quarter of the sync interval. ros_canopen will process the data in the middle of the interval and the bus must be fast enough so that the devices can load the data before the next sync.

The slave devices are configured as asynchronous PDOs with 100ms update rate. We tried setting sync/interval_ms to 0 and also 10 and 100 with no luck. Am I correct in guessing that, when using sync/interval_ms = 0, ros_canopen will process the data in the middle of the interval based on update_ms instead?

Related question: do different PDOs run on different threads? What constitutes own thread in canopen_motor_node?

update() must not freeze. Please make sure that there are no SDOs for anything motor related, as this will block the update loop. (SDOs to the error register are run in a different thread). However, this will not pile up commands, but it might delay the processing.

We actually tried removing the EMCYLayer after seeing SDOs to the error register with no luck. But it's good to know that the error register SDOs are on separate thread and should not block.

mathias-luedtke commented 9 months ago

The slave devices are configured as asynchronous PDOs with 100ms update rate

Why asynchronous? And why only at 10Hz (which limits your control to 5Hz..) Please try synchronous PDOs or go for a higher PDO rate (80Hz?)

sync/interval_ms = 0, ros_canopen will process the data in the middle of the interval based on update_ms instead?

No, then PDOs get processed and sent at the update interval.

do different PDOs run on different threads?

No. PDOs are received in the CAN threads and written to the buffer. Processing and sending is happening in a different buffer.

What constitutes own thread in canopen_motor_node?

This depends a little bit on the configuration.

The CAN driver is run in a separate thread. The SocketCAN driver uses boost::asio, which runs additional threads. Then there the foreground thread for ROS callbacks (timers, subscriptions, servers).

CANopen layers (including ros_control) are run in one separate thread, basically a read, update, write cycle. This is the main thread for all data processing.

The controllers often spawn additional threads, for example for action servers.

But it's good to know that the error register SDOs are on separate thread and should not block.

SDOs will always block. That's why all data read or sent for control, should better be mapped to PDOs. Otherwise the update function will block.

lkm1321 commented 8 months ago

Thanks Mathias, no buffering now with synchronous PDOs with a lower sync interval and a revised txqueuelen for the CAN netdev. I am closing the issue.

ros-industrial / ros_canopen

Lagging, buffering and freezing behaviour in DS402 motor control with PDOs #481