jgromes / RadioLib

Universal wireless communication library for embedded devices
https://jgromes.github.io/RadioLib/
MIT License
1.56k stars 388 forks source link

Race condition in interrupt examples #657

Closed matthijskooijman closed 1 year ago

matthijskooijman commented 1 year ago

Describe the bug The interrupt examples use an enableInterrupt variable to suppress interrupt handling while the result of a previous interrupt is handled and a new request (TX/RX/CAD/etc) is being configured. Typically, the flow is something like:

if (IrqFlag) { // Check flag set by ISR
  enableInterrupt = false;
  irqFlag = false;  
  ... // Handle event
  state = radio.startChannelScan();
  ... // print state
  enableInterrupt = true;
}

Here, a race condition occurs when the IRQ is triggered after the call to e.g. startChannelScan() and before enableInterrupt is set again. Usually this does not happen, but since timings like these can vary in different circumstances, there is a window for problems. And if this problem occurs, the IRQ will not be handled and the code will lock up, indefinitely waiting for the IRQ that already happenedd.

On the STM32WL with CAD I've ran into this problem in practice. I've seen that the IRQ happens after around 20ms or so (so that's a lot quicker than a TX and, in most cases an RX interrupt) and (for some reason I have not investigated), the "print state" step is usually fast but consistently takes around 50ms on the 18th (or so) try.

To confirm this is wat happened, I applied the following patch:

--- a/examples/STM32WLx/STM32WLx_Channel_Activity_Detection_Interrupt/STM32WLx_Channel_Activity_Detection_Interrupt.ino
+++ b/examples/STM32WLx/STM32WLx_Channel_Activity_Detection_Interrupt/STM32WLx_Channel_Activity_Detection_Interrupt.ino
@@ -47,6 +47,9 @@ void setup() {
     Serial.print(F("failed, code "));
     Serial.println(state);
   }
+  pinMode(2, OUTPUT);
+  pinMode(3, OUTPUT);
+  pinMode(4, OUTPUT);
 }

 // flag to indicate that a packet was detected or CAD timed out
@@ -60,6 +63,7 @@ volatile bool enableInterrupt = true;
 // IMPORTANT: this function MUST be 'void' type
 //            and MUST NOT have any arguments!
 void setFlag(void) {
+  digitalWrite(3, !digitalRead(3));
   // check if the interrupt is enabled
   if(!enableInterrupt) {
     return;
@@ -67,6 +71,8 @@ void setFlag(void) {

   // something happened, set the flag
   scanFlag = true;
+
+  digitalWrite(4, !digitalRead(4));
 }

 void loop() {
@@ -99,6 +105,7 @@ void loop() {

     // start scanning the channel again
     Serial.print(F("[STM32WLx] Starting scan for LoRa preamble ... "));
+    digitalWrite(2, HIGH);
     state = radio.startChannelScan();
     if (state == RADIOLIB_ERR_NONE) {
       Serial.println(F("success!"));
@@ -106,6 +113,7 @@ void loop() {
       Serial.print(F("failed, code "));
       Serial.println(state);
     }
+    digitalWrite(2, LOW);

     // enable interrupt service routine
     enableInterrupt = true;

Which produced the following trace on my logic analyzer:

image

In other words: slowness in serial printing (which might happen for various reasons, especially if USB Serial is involved) causes the code to lock up.

To Reproduce This can be easily reproduced by taking e.g. the SX126x_Channel_Activity_Detection_Interrupt example and adding a delay to fake slow serial printing:

--- a/examples/SX126x/SX126x_Channel_Activity_Detection_Interrupt/SX126x_Channel_Activity_Detection_Interrupt.ino
+++ b/examples/SX126x/SX126x_Channel_Activity_Detection_Interrupt/SX126x_Channel_Activity_Detection_Interrupt.ino
@@ -121,6 +121,7 @@ void loop() {
       Serial.print(F("failed, code "));
       Serial.println(state);
     }
+    delay(20);

     // enable interrupt service routine
     enableInterrupt = true;

Alternatively, slow down serial printing by repeating "success" a few times (this depends on the board and serial driver, I've tested this on AVR HardwareSerial (Arduino Uno), it might not work on native USB serial ports or serial ports with larger buffers).

--- a/examples/SX126x/SX126x_Channel_Activity_Detection_Interrupt/SX126x_Channel_Activity_Detection_Interrupt.ino
+++ b/examples/SX126x/SX126x_Channel_Activity_Detection_Interrupt/SX126x_Channel_Activity_Detection_Interrupt.ino
@@ -116,7 +116,7 @@ void loop() {
     Serial.print(F("[SX1262] Starting scan for LoRa preamble ... "));
     state = radio.startChannelScan();
     if (state == RADIOLIB_ERR_NONE) {
-      Serial.println(F("success!"));
+      Serial.println(F("successsuccesssuccesssuccesssuccess!"));
     } else {
       Serial.print(F("failed, code "));
       Serial.println(state);

With either of these patches, the sketch can do CAD once and then locks up:

[2023-01-06 18:38:11] [SX1262] Initializing ... success!
[2023-01-06 18:38:11] [SX1262] Starting scan for LoRa preamble ... success!
[2023-01-06 18:38:11] [SX1262] Channel is free!
[2023-01-06 18:38:11] [SX1262] Starting scan for LoRa preamble ... successsuccesssuccesssuccesssuccess!

To fix To fix, the enableInterrupt = true line should be moved up. At least before the serial printing, but that still leaves a small race condition (e.g. when some other IRQ happens at the right moment that takes 20ms). To really fix this, enableInterrupt must be before the call to startChannelScan(), but I'm not sure if this would introduce other problems.

In general, I wonder if enableInterrupt is really needed at all, though. If the hardware guarantees that only one IRQ is triggered for every operation requested, then I think there would be no need to suppress interrupts? But I guess it was introduced for a reason....

jgromes commented 1 year ago

In general, I wonder if enableInterrupt is really needed at all, though.

You're corect, it is not needed. I think the original idea was in Receive_Interrupt, to make sure reception of a second packet doesn't start before the previous one was processed. But that doesn't really make sense, since to receive a packet, the device would first have to set back to Rx mode e.g. by calling startReceive(), which obviously cannot happen until the first packet is processed. I guess I then cargo-culted this flag variable into every other interrupt example.

I tested all the examples for SX127x without the enable flag, and they are working correctly. I was also able to force the race condition by including a delay, and fix it by removeing the enable flag.

Thanks for reporting this, I will remove the enable flag from all the examples.

jgromes commented 1 year ago

All removed now, thanks for reporting.