Race condition in rf12_canSend()

jcw commented 11 years ago

see discussion at http://openenergymonitor.org/emon/node/1051?page=3

CapnBry commented 11 years ago

I don't know if you're taking comments on this issue, but I've also seen lockups in the rf12_xmit() function, but not necessarily in rf12_canSend(). I use a variant of your jee-code for my itplus transmission system in my project. I don't use canSend(), rather I just sendStart() whenever I damn well please. I occasionally would lock in the rf12_xfer(RF_XMITTER_ON), I assume because an interrupt had fired and eaten the flag. This only happens on transmitters, not receivers.

This code I've been testing for over a week transmitting at 1 second intervals and I've not run into the problem again, however due to the extremely low chance of a collision there's no telling if this is actually a fix.

jcw commented 11 years ago

Thanks, I like your changes - the mistake I made in the code is that rf12_xfer itself is also susceptible to interrupts. Will incorporate something similar, just to err on the safe side.

jcw commented 11 years ago

On second thought... none of these three cases appear to actually have a race condition. But calling rf12_sendStart without a rf12_canSend to prepare things is problematic (sendStart makes sure there will be no further interrupts).

So my suggestion is: please try with the fix to rf12_canSend applied and then do call it. I should perhaps add an extra wrapper to make the sending step a simple one-liner.

CapnBry commented 11 years ago

The issue with using the rf12_byte() (which never asserts !CS) is something I brought up a while ago but all rf12_xfer operations are vulnerable to lockups. You may notice that my rf12_canSend() never even checks the RSSI bit.

The lockup in my case is occurring in rf12_byte() called by rf12_xfer(RF_IDLE_MODE). rf12_xfer is always in danger of being preempted regardless of the rxstate or radio mode. An interrupt can occur for the low battery condition, an underflow or overflow, a byte received, etc and therefore must be guarded by disabling interrupts for the duration of the 16 bit transfer when not called from the ISR.

It is mentioned on that thread that the while (!(SPSR & _BV(SPIF)); loop can be interrupted by the ISR at any time. The ISR transfers a byte to the rf12, which leaves SPIF cleared (by the read of SPDR), and returns to the other code, which sits spinning forever waiting for SPIF. SPIF won't ever be set because the transfer in the interrupt reset it.

There's also the chance of the transfer being interrupted between the two bytes of the 16 bit transfer if an interrupt occurs between them, even one for another SPI device.

So all xfers from the "main thread" need to be guarded, not just the RSSI check.

jcw commented 11 years ago

Hm, yes - low battery, that too was added later. You're right. Not certain about the others yet, because reception and underflow are only possible when the radio is in specific states. I also agree that the 16-bit xfer is another source of race conditions.

Ok, I'll keep this issue open. It clearly needs more work.

JohnOH commented 9 years ago

Umh, did this come to a resolution?

CapnBry commented 9 years ago

I wrapped the mode switch in an atomic block and the code has been in production without fail, even in numerous transmitter / receiver environments without any lockups for months and months. https://github.com/CapnBry/HeaterMeter/blob/master/arduino/libraries/rf12_itplus/rf12_itplus.cpp#L357

I did that and in rf12_sendStart() as well even though theoretically there is a very very low chance of there being an interrupt if the receiver is off. My code is a little different though because it does the IT+ protocol rather than the Jeelab protocol.

marcolettieri commented 7 years ago

Hello, we're probably experiencing similar problems. Like Open Energy Monitor discussion some of our devices, with no apparent reason, cease to communicate each other. simply electrically restart solves the problem immediatly but is a poor solution. We would like to avoid this restart obviusly. I don't know if it could be interesting: we have noted, using an oscilloscope, that on the question of 0x0000 by the processor (read status) the antenna responds 575 (0000001000111111) that it seems means FIFO EMPTY. It's quietly difficult to debug this issue, it happens apparently random. To exclude that our issue is related a race condition in rf12_canSend() we have simply force atomic code like @CapnBry suggested? is that the solution for this issue?

CapnBry commented 7 years ago

Hi Marco and Jean Claude, I'm still using the code with ATOMIC_BLOCK()s on rf12_recvStart, rf12_canSend and rf12_sendStart and still have had no lockups on any of my transmitting or receiving nodes in years of operation. The power has gone out a dozen times here over the past few years, but this isn't really an uptime thing as much as it is a eventually-it-will-happen thing. I have 9 modules all communicating with a transmit frequency between 4 seconds and 30 seconds depending on the application.

Oddly enough, I just bought an emonTx and put it in service last week. However, I have the RFM69 turned off because I have nothing to receive it @433MHz. I would recommend that you try adding the 3 atomic blocks to your source code Marco and seeing if it fixes your lockup issues (if the code is still relevant now that the source has changed a lot for the RF69). My source code for reference, ignore the drssi bit.

jcw commented 7 years ago

The atomic wrapper looks like an excellent solution.

jeelabs / jeelib

Race condition in rf12_canSend() #33