can_send_frame reporting CAN_ERC_NO_ROOM once TEC gets rolled

edragan commented 1 year ago

Just curious whether anybody else encountered this.

Test setup:

A two-device setup on the CAN bus
Device1 (d1) with MCP2517 sends a periodic message every 50ms
d1 has an instrumentation in place displaying the CAN status obtained via _can_getstatus() each pass through the loop
Device2 (d2) ACKs messages and the two devices can chat for long periods of time as long as there aren't any disturbances on the CAN bus
Short the two CAN wires momentarily

Observe:

_can_getstatus() shows a non-zero TEC (does not reach error passive if the short is short-lived). All well so far.
can_send_frame will start reporting CAN_ERC_NO_ROOM shortly and it will not recover

Running at 125kbps, but it probably happens at any baud.

kentindell commented 1 year ago

The drivers obviously think there are frames in the queue, but they aren't being sent. This can happen if the bus off event is lost (the controller discards all its frames but the driver won't record that) or a transmit event is lost. There is a silicon bug that can corrupt register reads (it's documented in a recent errata list from Microchip) which might trip this.

The drivers contain code to try to help diagnose this (written before the errata list was published). What does the get_diagnostics() call return? If there are above 0 then this shows that the drivers spotted event corruption (the call is documented in the API reference manual on page 7).

Just a note: is "momentarily" in human time or is there a test rig to disconnect it? Once there is a CAN error then the transmitter will go into a loop of retransmits and after 32 errored frames it will go bus off - and then automatically recover.

It will take about 11 milliseconds to automatically recover, so a polling of the status with a 50ms period quite likely won't see that bus off status.

I'm coding workarounds to the silicon bugs right now, so this is all useful to know. I can also add in to the next firmware a call for "was_bus_off()" that keeps a flag to see if it happened.

There are also some SPI test functions in the API (do a dir() on the CAN controller instance: they all begin "testspi") that can read and write to addresses in the chip. We can then find out directly what the state of the controller is when it's refusing to send.

kentindell commented 1 year ago

I just realized that you're using the C API directly and not via MicroPython. D'oh.

So, the diagnostic information is still there: in the target_specific structure of the the controller, with fields seq_bad, txqua_bad and txqsta_bad. If any of these is non-zero then a silicon bug tripped.

Also it's important to set the host CPU to level sensitive interrupts (if they're set to edge triggered then interrupts can be lost, and losing 'went to bus off' interrupt would lead to the queue filling up and not emptying.

The tx_pri_queue.num_free_slots field of the controller structure gives the free space. To get 'no room' that's become 0, but this field can be monitored during your experiments to see if it goes down suddenly, or gradually after each shorting event. That might give a clue to how this is happening.

My overall conclusion is that the bus off event is being lost somehow and the queue is consequently not being cleared. I'm updating the drivers now, so I'm adding a diagnostics field with a bus off counter that remembers how many times it's happened.

kentindell commented 1 year ago

I've had a bit more thought on what could be happening here. When there is a bus-off condition, the controller is supposed to clear the transmit queue. The software also clears its transmit queue, and should report free space (via can_get_send_space) of 32. So after shorting the bus, this is what you should see.

If this is reporting there is space but attempting to send a frame returns CAN_ERC_NO_ROOM then this is because the controller reported its hardware transmit queue is full, which should not happen (bus off should have cleared that queue). This can be checked via the controller diagnostics: controller->target_specific.txqsta_bad will have incremented after each attempt to queue that failed due to this inconsistency.

edragan commented 1 year ago

seq_bad, txqua_bad and txqsta_bad always read 0 in my tests.

It appears that the problem happens even prior to getting to bus off. I set the tx rate to 100ms and the status polling rate to 10ms. With a quick short (done manually) we reach error passive state and then TEC starts to decrement 1 count every 10ms (makes sense given the tx rate). While this is happening, num_free_slots are decrementing 1 every 10ms pointing to the fact that the ISR no longer fires.

The interrupt is edge triggered. The EXTI line I'm wired to does not support level triggering.

I was hoping someone else would run a similar test to see whether they observe the same behavior.

kentindell commented 1 year ago

I think that's the problem then: the interrupt pin on the MCP25xxFD is active low and asserted when one or more interrupt conditions are true: it's designed for a level-sensitive GPIO input. The ISR services one interrupt and assumes that it will be re-raised if other conditions are true. Almost certainly you have more than one interrupt condition true at the same time (errors and transmit events) and the error will be serviced then the ISR quits, leaving INT low, and no interrupt will ever again be re-raised because there is no edge.

You can check this by putting a logic analyzer on to the INT pin from the controller. It's active low, and if it goes low and stays low with this problem then you know you've missed an interrupt and because it's edge triggered, it won't see an edge again.

If you can't make that interrupt level sensitive then there is some work to do: basically, the ISR needs to see that the INT line has gone high before quitting so that there will be an edge again (the race between checking and quitting should be handled by the CPU's interrupt logic: there is normally a latch so that edges in an ISR are picked up and the interrupt re-raised after the ISR quits).

So there are several things to do to address this:

Handle all the outstanding interrupts in the top level handler (mcp2517fd_irq_handler). That means turning the if-then-else chain into a sequence of ifs. There are four interrupts enabled, and this will service up to four interrupt events.
At the end of the 'if' chain, read the INT pin (via a GPIO register) and test to see if it has gone high. If it has gone high then that willl ensure that any new event will cause a falling edge and the CPU interrupt logic will latch this.
If INT is still low at the end of the 'if' chain then a new event occurred during the ISR. If so, repeat the 'if' chain (i.e. there should be a while loop that only terminates when INT is high).
There could be spurious interrupts: something might be handled in the ISR but the edge for it is latched, and the ISR re-raised but now with nothing to do. The while() loop should catch the spurious case because INT will be high at the top of the handler.

If this is what's going on then I'll re-write the handler. It's a bit of a hack but it will allow edge sensitive interrupts to work. Can you check with the logic analyzer to see if INT stays low?

edragan commented 1 year ago

Compensating for the lack of level-triggered interrupt support via changes inside the ISR works.

Thank you for your insights.

kentindell commented 1 year ago

Glad that works. I will roll this fix into the next release.

See: https://github.com/kentindell/canis-can-sdk/issues/3

kentindell commented 1 year ago

@edragan The edge sensitive workaround has now been added to the core IRQ handler with today's new release of the drivers.

kentindell / canis-can-sdk

can_send_frame reporting CAN_ERC_NO_ROOM once TEC gets rolled #2