koenlek commented 6 years ago

Problem

We are sending data from a ros topic to a SocketCAN device via the socketcan_bridge. Sometimes, (presumably when the load on the bus and/or our PEAK PCI CAN card is too high) an error triggers which completely breaks all socketcan_bridge communication in the direction topic -> socketcan. The only solution is to restart the socketcan_bridge node. This affects both topic_to_socketcan_node and socketcan_bridge_node. In the case of socketcan_bridge_node, only the communication from topic -> socketcan breaks. The other direction remains functional.

We don't always get the same errors. This are the ones I've seen:

$ rosrun socketcan_bridge topic_to_socketcan_node
[ INFO][1527859346.894875374][/topic_to_socketcan_node]: Successfully connected to can0.
FAILED system:105
[ INFO][1527859354.097552304][/topic_to_socketcan_node]: State: OK, asio: No buffer space available
[ INFO][1527859354.097612712][/topic_to_socketcan_node]: State: OK, asio: No buffer space available
[ERROR][1527859354.097690590][/topic_to_socketcan_node]: Failed to send message: 203#0c00000000000000.
[ERROR][1527859354.097739552][/topic_to_socketcan_node]: Failed to send message: 203#0c00000000000000.

[ INFO][1527864701.420629833][/topic_to_socketcan_node]: Successfully connected to can0.
error: 2
[ERROR][1527864707.969902135][/topic_to_socketcan_node]: Error: lost arbitration;, asio: Success
[ERROR][1527864707.969948359][/topic_to_socketcan_node]: Error: lost arbitration;, asio: Success
...  Next: many "Failed to send message" errors ...

[ INFO][1527864726.063292969][/topic_to_socketcan_node]: Successfully connected to can0.
error: 4
[ERROR][1527864727.369968480][/topic_to_socketcan_node]: Error: controller problems;, asio: Success
[ERROR][1527864727.370025329][/topic_to_socketcan_node]: Error: controller problems;, asio: Success
...  Next: many "Failed to send message" errors ...

The errors will print endlessly (a new error is printed for every msg that we try to send via topic_to_socketcan). The communication remains broken (also after lowering the load on the canbus) until we restart the socketcan_bridge node.

How to reproduce

You need a phycical CAN card. We have a PEAK PCI CAN card (exact type unknown)
- it won't reproduce against vcan (that seems to be able to handle extreme loads without problems).
sudo ip link set can0 type can bitrate 500000 triple-sampling on. I also tried without triple-sampling.
rosrun socketcan_bridge topic_to_socketcan_node
Send msgs to /sent_messages. To enforce triggering it, we send them at rate = rospy.Rate(10000), which triggers the error practically immediately. When we send them at 10 hz, all is fine. At 100Hz we have problems in our real system, but not always immediately (seems to be related with how busy it is on the bus, though I'm not sure).

Environment:

Ubuntu 16.04
Linux 4.4.0-121-generic #145-Ubuntu SMP Fri Apr 13 13:47:23 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux
ros_canopen 0.7.8
ros kinetic

I am curious if we could increase buffers to fix this issue, or if we could make it at least recover from the error (I guess it all starts with FAILED system:105).

I'd be willing to contribute a PR to fix it, if people can point me in the right direction.

mathias-luedtke commented 6 years ago

You might increase the socketcan buffer size (#102). For arbitration errors please have a look at #264 and the comments. Recovering is discussed in #249 and tracked in #244.

100 Hz should be no problem on a more or less ordered CAN bus. canopen_motor_node works fine with PEAK and 14 messages sent at 100Hz with 500K bitrate (txqueuelength 20). While you might be able to fix the auto-recover from these errors somehow, I would recommend you to review your protocol and adapt it for stable operation.

If you keep sending ROS messages at 10k Hz, the ROS callback system and the CANopen multi-threading won't perform properly anymore (at least on non RT-linux systems). Given that the buffer is big enough, the jitter will increase dramtically.

koenlek commented 6 years ago

Thanks for the tips. I've increased buffer size, which seems to lead to improvements. Tomorrow I'll do some further real-world tests. In practice we won't send our messages above 100Hz. The 10k Hz that I ran was only to stress test it (but I have the impression that can actually trigger other types of errors than the ones we had in practice at 100Hz).

You wrote:

While you might be able to fix the auto-recover from these errors somehow, I would recommend you to review your protocol and adapt it for stable operation.

I agree, but the problems are in the CAN bus of the original vehicle, which is already quite erroneous of its own unfortunately... We don't plan to go into rewiring that whole CAN bus, which would be the only solution I'd say.

I tried the autorecover in #249. It works good for recovery (though you'll loose a significant interval of communication) and thus seems useful for us as a fallback.

koenlek commented 6 years ago

I did some real-world tests today. Results:

Autorecover (#249) does not provide a solution, because we often get controller and arbitration errors (during the vehicle re-arm procedure, which causes a peak in traffic in the canbus). This causes an endless cycle of problems. We need to send a certain CAN msg to be able to re-arm (enable) the vehicle, otherwise a watchdog will unarm (disable) the vehicle. So after we set up sending that frame at 100Hz, we re-arm (enable) the vehicle, which causes a peak in traffic on the canbus, which causes controller and/or arbitration errors, which causes CAN communication to break, which triggers the watchdog and unarms (disables) the vehicle again. The Autorecover Pull Request (#249) takes multiple seconds to recover, so that won't fix any of these issues.
Turning Arbitration errors into warnings using #264 brought us one step in the right direction, but we still had controller errors that broke communication.
I did a quick check by removing both arbitration errors and controller errors from the error mask, and now everything works like a charm. The diff is very simple:

diff --git a/socketcan_interface/include/socketcan_interface/socketcan.h b/socketcan_interface/include/socketcan_interface/socketcan.h
index 96845db..68e270a 100644
--- a/socketcan_interface/include/socketcan_interface/socketcan.h
+++ b/socketcan_interface/include/socketcan_interface/socketcan.h
@@ -53,8 +53,6 @@ public:
             }
             can_err_mask_t err_mask =
                 ( CAN_ERR_TX_TIMEOUT   /* TX timeout (by netdevice driver) */
-                | CAN_ERR_LOSTARB      /* lost arbitration    / data[0]    */
-                | CAN_ERR_CRTL         /* controller problems / data[1]    */
                 | CAN_ERR_PROT         /* protocol violations / data[2..3] */
                 | CAN_ERR_TRX          /* transceiver status  / data[4]    */
                 | CAN_ERR_ACK           /* received no ACK on transmission */

Maybe it would be nice to make it configurable via a rosparam setting to enable (ROS_ERROR) or disable (ROS_WARN only) the escalation of these errors via the error mask?

mathias-luedtke commented 6 years ago

Please don't confuse this bitmask of reported errors with the handling of these errors. IMHO all errors should get reported.

koenlek commented 6 years ago

I fully agree. This was just a quick and dirty hack to confirm that it works for us when removing those two errors. I made it slightly cleaner now:

git diff HEAD~2 -- socketcan_interface/include/socketcan_interface/socketcan.h
diff --git a/socketcan_interface/include/socketcan_interface/socketcan.h b/socketcan_interface/include/socketcan_interface/socketcan.h
index 96845db..277bdda 100644
--- a/socketcan_interface/include/socketcan_interface/socketcan.h
+++ b/socketcan_interface/include/socketcan_interface/socketcan.h
@@ -199,11 +199,16 @@ protected:
             if(frame_.can_id & CAN_ERR_FLAG){ // error message
                 input_.id = frame_.can_id & CAN_EFF_MASK;
                 input_.is_error = 1;
-
-                LOG("error: " << input_.id);
                 setInternalError(input_.id);
-                setNotReady();

+                LOG("error: " << input_.id);
+                if (frame_.can_id & CAN_ERR_LOSTARB ||
+                    frame_.can_id & CAN_ERR_CRTL)
+                {
+                  LOG("warn: ignoring arbitration (2) or controller (4) error");
+                } else {
+                  setNotReady();
+                }
             }else{
                 input_.is_extended = (frame_.can_id & CAN_EFF_FLAG) ? 1 :0;
                 input_.id = frame_.can_id & (input_.is_extended ? CAN_EFF_MASK : CAN_SFF_MASK);

I can clearly see now that both happen every now and then, but overall communication now remains working.

Prj1508 commented 4 years ago

I am facing same problem as described above by @koenlek for controller issue. With suggested changes to socketcan.h, the error disappears and I can see the CAN message being transmitted on to the bus correctly. However, the intended functionality in the vehicle is not realized. My question is

Should I continue to debug my implementation and focus on functionality issue rather than controller problem issue?
How to ensure controller issue is not affecting my functionality implementation? Are the two not related to each other as I have the CAN message transmitted correctly?

@koenlek and @ipa-mdl, appreciate your inputs on this

mathias-luedtke commented 4 years ago

lost arbitration are not fatal anymore (#362)

daijiaoshou commented 1 year ago

After improving according to the method you provided, the fault has not been eliminated

ros-industrial / ros_canopen

`topic_to_socketcan` communication breaks under high load #285

Problem

How to reproduce