Closed koenlek closed 4 years ago
You might increase the socketcan buffer size (#102). For arbitration errors please have a look at #264 and the comments. Recovering is discussed in #249 and tracked in #244.
100 Hz should be no problem on a more or less ordered CAN bus. canopen_motor_node works fine with PEAK and 14 messages sent at 100Hz with 500K bitrate (txqueuelength 20). While you might be able to fix the auto-recover from these errors somehow, I would recommend you to review your protocol and adapt it for stable operation.
If you keep sending ROS messages at 10k Hz, the ROS callback system and the CANopen multi-threading won't perform properly anymore (at least on non RT-linux systems). Given that the buffer is big enough, the jitter will increase dramtically.
Thanks for the tips. I've increased buffer size, which seems to lead to improvements. Tomorrow I'll do some further real-world tests. In practice we won't send our messages above 100Hz. The 10k Hz that I ran was only to stress test it (but I have the impression that can actually trigger other types of errors than the ones we had in practice at 100Hz).
You wrote:
While you might be able to fix the auto-recover from these errors somehow, I would recommend you to review your protocol and adapt it for stable operation.
I agree, but the problems are in the CAN bus of the original vehicle, which is already quite erroneous of its own unfortunately... We don't plan to go into rewiring that whole CAN bus, which would be the only solution I'd say.
I tried the autorecover in #249. It works good for recovery (though you'll loose a significant interval of communication) and thus seems useful for us as a fallback.
I did some real-world tests today. Results:
diff --git a/socketcan_interface/include/socketcan_interface/socketcan.h b/socketcan_interface/include/socketcan_interface/socketcan.h
index 96845db..68e270a 100644
--- a/socketcan_interface/include/socketcan_interface/socketcan.h
+++ b/socketcan_interface/include/socketcan_interface/socketcan.h
@@ -53,8 +53,6 @@ public:
}
can_err_mask_t err_mask =
( CAN_ERR_TX_TIMEOUT /* TX timeout (by netdevice driver) */
- | CAN_ERR_LOSTARB /* lost arbitration / data[0] */
- | CAN_ERR_CRTL /* controller problems / data[1] */
| CAN_ERR_PROT /* protocol violations / data[2..3] */
| CAN_ERR_TRX /* transceiver status / data[4] */
| CAN_ERR_ACK /* received no ACK on transmission */
Maybe it would be nice to make it configurable via a rosparam setting to enable (ROS_ERROR) or disable (ROS_WARN only) the escalation of these errors via the error mask?
Please don't confuse this bitmask of reported errors with the handling of these errors. IMHO all errors should get reported.
I fully agree. This was just a quick and dirty hack to confirm that it works for us when removing those two errors. I made it slightly cleaner now:
git diff HEAD~2 -- socketcan_interface/include/socketcan_interface/socketcan.h
diff --git a/socketcan_interface/include/socketcan_interface/socketcan.h b/socketcan_interface/include/socketcan_interface/socketcan.h
index 96845db..277bdda 100644
--- a/socketcan_interface/include/socketcan_interface/socketcan.h
+++ b/socketcan_interface/include/socketcan_interface/socketcan.h
@@ -199,11 +199,16 @@ protected:
if(frame_.can_id & CAN_ERR_FLAG){ // error message
input_.id = frame_.can_id & CAN_EFF_MASK;
input_.is_error = 1;
-
- LOG("error: " << input_.id);
setInternalError(input_.id);
- setNotReady();
+ LOG("error: " << input_.id);
+ if (frame_.can_id & CAN_ERR_LOSTARB ||
+ frame_.can_id & CAN_ERR_CRTL)
+ {
+ LOG("warn: ignoring arbitration (2) or controller (4) error");
+ } else {
+ setNotReady();
+ }
}else{
input_.is_extended = (frame_.can_id & CAN_EFF_FLAG) ? 1 :0;
input_.id = frame_.can_id & (input_.is_extended ? CAN_EFF_MASK : CAN_SFF_MASK);
I can clearly see now that both happen every now and then, but overall communication now remains working.
I am facing same problem as described above by @koenlek for controller issue. With suggested changes to socketcan.h, the error disappears and I can see the CAN message being transmitted on to the bus correctly. However, the intended functionality in the vehicle is not realized. My question is
@koenlek and @ipa-mdl, appreciate your inputs on this
lost arbitration are not fatal anymore (#362)
After improving according to the method you provided, the fault has not been eliminated
Problem
We are sending data from a ros topic to a SocketCAN device via the
socketcan_bridge
. Sometimes, (presumably when the load on the bus and/or our PEAK PCI CAN card is too high) an error triggers which completely breaks allsocketcan_bridge
communication in the direction topic -> socketcan. The only solution is to restart thesocketcan_bridge
node. This affects bothtopic_to_socketcan_node
andsocketcan_bridge_node
. In the case ofsocketcan_bridge_node
, only the communication from topic -> socketcan breaks. The other direction remains functional.We don't always get the same errors. This are the ones I've seen:
The errors will print endlessly (a new error is printed for every msg that we try to send via
topic_to_socketcan
). The communication remains broken (also after lowering the load on the canbus) until we restart thesocketcan_bridge
node.How to reproduce
vcan
(that seems to be able to handle extreme loads without problems).sudo ip link set can0 type can bitrate 500000 triple-sampling on
. I also tried without triple-sampling.rosrun socketcan_bridge topic_to_socketcan_node
/sent_messages
. To enforce triggering it, we send them atrate = rospy.Rate(10000)
, which triggers the error practically immediately. When we send them at 10 hz, all is fine. At 100Hz we have problems in our real system, but not always immediately (seems to be related with how busy it is on the bus, though I'm not sure).Environment:
I am curious if we could increase buffers to fix this issue, or if we could make it at least recover from the error (I guess it all starts with
FAILED system:105
).I'd be willing to contribute a PR to fix it, if people can point me in the right direction.