Suppress TX transfers on CAN bus error state

PeterBowman commented 4 years ago

Sometimes, bad things happen on a CAN network. CAN bus controllers store a record of RX/TX errors and report these mishaps via API calls. The CanBusPeak device learned to expose them at https://github.com/roboticslab-uc3m/yarp-devices/issues/242:

https://github.com/roboticslab-uc3m/yarp-devices/blob/8a3505927b54b0b4f2854aa1ceb3296186677f2d/libraries/YarpPlugins/CanBusPeak/ICanBusErrorsImpl.cpp#L37

Such errors are acted upon by CanBusControlboard's write threads. Availability of the CAN network is taken for granted in case no errors are reported by the instantiated CAN bus device. That is, the following check targets the specific scenario of a power-off condition (most usually, the emergency button was pressed). This is to prevent the bus device from attempting futile CAN write operations, which may ultimately lead to the saturation of the physical TX buffer.

https://github.com/roboticslab-uc3m/yarp-devices/blob/5f2d4272bbe2e064c8c2a832a754d04b49299e82/libraries/YarpPlugins/CanBusControlboard/CanRxTxThreads.cpp#L152-L158

However, we learned that sporadic bus errors can arise at any time and contribute to the overall error counter. This counter is never reset until the CAN bus device is properly closed. In case a pre-defined threshold is reached, CanBusPeak raises the internal passive error flag and triggers the above check despite these errors not posing an actual power-off state. As a result, the accumulation of negligible CAN errors may disable write operations altogether.

A workaround was introduced at https://github.com/roboticslab-uc3m/yarp-devices/commit/7c12f50bda9421fc45d7cc31e60431f70526a7ee. However, we lost the ability to suppress unwanted CAN writes on power-off. This issue seeks to restore the previous behavior while enabling normal operation regardless of the number of casual CAN errors.

PeterBowman commented 4 years ago

Sometimes, bad things happen on a CAN network.

I didn't specify, but those bad things can also span to a simple bus off state, which we are unable to detect programatically. For instance, it would be great to stop listening to RX/TX events in case the emergency stop button is pressed. That means for us that no CAN communications are allowed nor expected. However, it seems the PCANFD_ERROR_BUSOFF is only meaningful (not verified, though) whenever the physical CAN card is disconnected from the PC board, or perhaps some upstream wiring issue occurs.

In other words: somehow determine whether CAN nodes are powered off and act upon conveniently (i.e. pause comms).

PeterBowman commented 3 years ago

I have analyzed the behavior of the PEAK driver on an emergency stop condition:

https://github.com/roboticslab-uc3m/yarp-devices/blob/5b8dd5e01371fcef8777420b4a49ba7bb1c704c3/libraries/YarpPlugins/CanBusPeak/ICanBusErrorsImpl.cpp#L30-L39

At first, the TX error counter raises instantaneously to 128 (error passive state threshold), it remains so unchanged. A few iterations later (50-60), the TX pending messages counter starts and grows until 500, which is the buffer limit. Then, non-blocking calls would block, i.e. the following condition is triggered:

https://github.com/roboticslab-uc3m/yarp-devices/blob/5b8dd5e01371fcef8777420b4a49ba7bb1c704c3/libraries/YarpPlugins/CanBusPeak/ICanBusImpl.cpp#L227-L231

Bus-off is never notified. A similar behavior has been observed in the new CanBusSocket device: https://github.com/roboticslab-uc3m/yarp-devices/issues/251#issuecomment-889144688.

PeterBowman commented 3 years ago

Monitoring passive error state (as opposed to bus-off condition) is doable and quite easy for both CanBusPeak and CanBusSocket devices, but for some reason I opened this issue since it wasn't enough due to the bad shape of our wiring. Perhaps we should revisit this decision, otherwise I'd remove error checking altogether since it's not being used anymore.

PeterBowman commented 2 years ago

This page explains the protocol-compliant error treament for CAN messages (see also other links at https://github.com/roboticslab-uc3m/yarp-devices/issues/251#issuecomment-887518098). It is desired to switch the controller into bus-off state as soon as the kill switch is pressed, but this is not possible: our hardware should generate a specific signal on button press, then this signal should be intercepted somehow in order to change the controller state via system calls (which I don't see any).

The main goal behind switching to bus-off is to destroy all CAN traffic, i.e. stop CAN writes altogether. I fear that incoming messages will fill the TX buffer, which as soon as the kill switch is released will resume the send operations and command the drives in ways that shouldn't happen.

PeterBowman commented 2 years ago

A few iterations later (50-60), the TX pending messages counter starts and grows until 500, which is the buffer limit.

Based on this, I can assume outgoing messages are queued in the TX buffer. It has been confirmed that Socket CAN sends queued messages as soon as the button is released, and I can also assume the same applies for Peak.

I see three ways to tackle/circumvent/ignore the original problem (mind this essay on error handling):

(tackle) Relay kill switch signals to the CPU and trigger a linked software event that prevents further writes. Seems hard at best, I'd deem it even undoable.
(circumvent) Lower the panic threshold. Since we can't easily nor feasibly detect CAN bus off condition (>= 256 RX/TX errors), make things explode at the PASSIVE error state (>= 128 RX/TX errors). However, it wouldn't prevent the TX buffer from sending stuff on button release, and it led to stability issues in the past due to faulty wiring that triggered error passive state too soon (see issue description). It could be possible to enable this behavior conditionally via YARP .ini option (similarly to message filtering).
(ignore) Assume we can't respond accordingly at all and drop error handling altogether. If anything fishy happens, at least CanBusSocket is wise enough to interpret and log the error.

I'm going with 3 since I'm not keen on preserving dead/unused code in CanBusControlboard.

PeterBowman commented 2 years ago

Following up on these:

As found out by @rsantos88, BUS-OFF is in fact signaled (at least in SocketCAN):

WhatsApp Image 2021-09-21 at 12 51 32

WhatsApp Image 2021-09-21 at 12 56 34

On another test run, CAN overrun errors were reported (ref).

I need to investigate this one more time. Idea: use restart-ms to exit BUS-OFF in an automatic fashion. See:

PeterBowman commented 2 years ago

Idea: use restart-ms to exit BUS-OFF in an automatic fashion.

There is something fishy with can1 (left leg), CAN message overruns are observed on a regular basis. I have configured all CAN interfaces to start with the restart-ms 100 option and it seems to help a lot in the faulty limb. It is quite common to see several warnings regarding error-passive state and recovery, which in certain cases (higher bus load, e.g. when yarpmotorgui is running) leads to saturation and bus-off. The automatic restart helps mitigate this undesired behavior, although sometimes it doesn't prevent certain joints from not being able to start anew.

PeterBowman commented 1 year ago

I didn't specify, but those bad things can also span to a simple bus off state, which we are unable to detect programatically. For instance, it would be great to stop listening to RX/TX events in case the emergency stop button is pressed. That means for us that no CAN communications are allowed nor expected.

Perhaps the TechnosoftIpos subdevices could feed their heartbeat status back to the calling CanBusControlboard's threads?

PeterBowman commented 1 year ago

Done at https://github.com/roboticslab-uc3m/yarp-devices/commit/e935e364751c86112b5dead877e3ee2a62df6999. This approach adds a reportAvailability method to ICanSenderDelegate to be used by wrapped nodes for signalling that they are unavailable (using heartbeat reports). This simplifies and reduces the amount of checks done by YarpCanSenderDelegate, which just needs to test a boolean atomic value. I have verified that candump now prints a very limited number of SYNC messages upon restart (those can't be fully avoided due to the delay imposed by the heartbeat checks, i.e. this period is always bigger than the SYNC one).

Just for the record, I tried another approach in https://github.com/roboticslab-uc3m/yarp-devices/commit/83d159574b08f0dce23f09c9c828208d6fc75682 that inverted the logic (the sender delegate tests the availability of each node) and would impose a lot more mutex action.

roboticslab-uc3m / yarp-devices

Suppress TX transfers on CAN bus error state #246