mysensors / MySensors

MySensors library and examples
https://www.mysensors.org
1.31k stars 892 forks source link

Deadlock in NRF5_ESB_sendMessage #1410

Closed fonitzu closed 3 years ago

fonitzu commented 4 years ago

Hi,

I see from time to time a deadlock in NRF5_ESB_sendMessage(): https://github.com/mysensors/MySensors/blob/8103c6cf6bcb3e271df08b17d0fb2fee3b444c86/hal/transport/NRF5_ESB/driver/Radio_ESB.cpp#L470

For whatever reason, my nRF52840 doesn't get the interrupt, events_end_tx remains false and the while loop never terminates.

I'm using a nRF52840 Dongle, PlatformIO and a J-Link emulator. I can debug the code, but I don't know Nordic chips, so any help is welcome.

fonitzu commented 4 years ago

Today tripped again. It looks like the radio is powered on but all registers (I have verified them) have their reset values. It does not happen in all builds, I had one running flawless for hours. deadlock

petewill commented 4 years ago

@fvpetrov I don't know if this is the same issue but you may want to try the fix noted here: https://github.com/mysensors/MySensors/issues/1371

I was getting frequent locks/freezing with my NRF52832 and after I changed the interrupt order my device has been running without any issues for months.

fonitzu commented 4 years ago

Thank you Pete, I will give it a try.

Somehow, this issue depends on what I build:

fonitzu commented 4 years ago

I have added tracking to the interrupts, see:

https://github.com/fvpetrov/MySensors/blob/issue/1410-nrf5-Deadlock/hal/transport/NRF5_ESB/driver/Radio_ESB.cpp

The flow in a deadlock situation is attached. Somehow the code ends in _stopACK which is undocumented but shuts off the radio and then switches to Rx mode, letting transmission in a zombie mode. Unfortunately, it looks like the timer does not fire, otherwise transmission would be restarted.

For now, the safest solution is to implement a wait timeout and return false in NRF5_ESB_sendMessage().

deadlock_track.txt

phodgers commented 3 years ago

I have a test setup with about 8 NRF5 devices (mix of NRF15822 and NRF52832). I have each device sending out a message every 1 second. I see lock up with both tyres of devices (more often on the NRF52832), usually within 15-30 min for first device to lock up. I have seen other threads on the forum with unexplained freezing of NRF5 devices and this all seems to sound like this same issue. It's pretty major as the devices need a power cycle to reset.

I tried the 'fix' of #1371 but it didn't seem to make any difference.

@fonitzu could you give an example of the wait timeout and return false workaround you mention above? Is this added in the logic for the NRF5_RADIO_TIMER_IRQ_HANDLER() function at the end of the Radio_ESB.cpp file?

fonitzu commented 3 years ago

@phodgers The workaround was implemented by myself, take a look here: https://github.com/fonitzu/MySensors/blob/issue/1410-nrf5-Deadlock/hal/transport/NRF5_ESB/driver/Radio_ESB.cpp#L488-L507

The idea is to see if the microcontroller is busy for more than 100 milliseconds and break the loop in this case. It's not the cleanest workaround but will help (or at least, I hope so :) )

Some more work is required to cleanly fix up this issue and I was afraid to change the code (there are some good parts of the code to rewrite) since I only have 1x nRF52840. If you (or anyone else) would like to help with testing code on your sensors, I can rewrite the code.

d00616 commented 3 years ago

@fonitzu gread work. I try to support you with testing. Over the years had no idea if the Rx or Tx side is hanging.

I have an actor with 13 outputs sending states back after a change. When I trigger the all 13 outputs without delays the RF-Network stops working mostly on both, sender and receiver. If I add some delay between triggering the output, the connections is much more stable.

I don't know what the problem is, but maybe it's enough to change https://github.com/fonitzu/MySensors/blob/issue/1410-nrf5-Deadlock/hal/transport/NRF5_ESB/driver/Radio_ESB.cpp#L481 to "NRF_RADIO->RXADDRESSES = (1 << NRF5_ESB_TX_ADDR);" to listen only on the TX-Adress. Maybe the radio jumps into an unknown state if it receives something other than ACK. I have added the functionality as an enhancement. The NRF24 don't listen at the own address after TX.

fonitzu commented 3 years ago

@d00616 I have figured in April how the Tx routine dead-ends, but I wasn't able to figure out the intention of the original programmer at that point.

There is a second issue with a timer being started but the interrupt handler routine doesn't process the timer interrupt, see https://github.com/fonitzu/MySensors/blob/2a6b29b07a41ea4b1e6710a5239809bf826f04f2/hal/transport/NRF5_ESB/driver/Radio_ESB.cpp#L771 The ISR takes the if branch if EVENTS_COMPARE[3] is set, but EVENTS_COMPARE[1] is cleared instead, firing the interrupt again and again.

I will "reload" all my knowledge about this project in this weekend and will figure out a solution.

d00616 commented 3 years ago

@fonitzu I'm the original programmer. My itention was to let the CPU sleep as long as possible and use the capabilities provided by the NRF5X. As far I can remember I had changed the used timers to avoid conflicts with other applications or the SoftDevice. This mistake can be the consequence.

phodgers commented 3 years ago

@phodgers The workaround was implemented by myself, take a look here: https://github.com/fonitzu/MySensors/blob/issue/1410-nrf5-Deadlock/hal/transport/NRF5_ESB/driver/Radio_ESB.cpp#L488-L507

The idea is to see if the microcontroller is busy for more than 100 milliseconds and break the loop in this case. It's not the cleanest workaround but will help (or at least, I hope so :) )

Some more work is required to cleanly fix up this issue and I was afraid to change the code (there are some good parts of the code to rewrite) since I only have 1x nRF52840. If you (or anyone else) would like to help with testing code on your sensors, I can rewrite the code.

I implemented the suggested fix, but unfortuantely it did not stop me locking up.

phodgers commented 3 years ago

@fonitzu gread work. I try to support you with testing. Over the years had no idea if the Rx or Tx side is hanging.

I have an actor with 13 outputs sending states back after a change. When I trigger the all 13 outputs without delays the RF-Network stops working mostly on both, sender and receiver. If I add some delay between triggering the output, the connections is much more stable.

I don't know what the problem is, but maybe it's enough to change https://github.com/fonitzu/MySensors/blob/issue/1410-nrf5-Deadlock/hal/transport/NRF5_ESB/driver/Radio_ESB.cpp#L481 to "NRF_RADIO->RXADDRESSES = (1 << NRF5_ESB_TX_ADDR);" to listen only on the TX-Adress. Maybe the radio jumps into an unknown state if it receives something other than ACK. I have added the functionality as an enhancement. The NRF24 don't listen at the own address after TX.

I tried to change the line to "NRF_RADIO->RXADDRESSES = (1 << NRF5_ESB_TX_ADDR);" but then my device would not join my controller. I then changed to "NRF_RADIO->RXADDRESSES = (1 << NRF5_ESB_NODE_ADDR) | (1 << NRF5_ESB_TX_ADDR);" and it joined my controller, but after a while also froze again. So for me this approach did not work either.

phodgers commented 3 years ago

I am using the MDBT42Q modules https://shop.espruino.com/mdbt42q-breakout unless there is some issue with that HW. Although I have also had problems with lockup with smart watch based on nRF51822, though not currently testing on that HW.

phodgers commented 3 years ago

My controller is RPi with EByte nRF24L01+, so it is mixed in terms of nRF24 and nRF5x devices. It all seems OK until the nRF5 devices lock up at random intervals.

phodgers commented 3 years ago

I have some follow up on the issues, and I think the reason as to why these devices are locking up, for me at least, but possibly others also. TLDR oscillator error seems to be the issue to cause messages to trample over each other.

I noticed that when I put just 2 of my MDBT42Q modules (NRF52832) on air and observed their messages, set at regular 1000ms interval, that actually the transmissions were drifting into each other! In approx. 30s the transmissions drifted to be on top of one another and caused a clash and loss of message. This seemed strange since the modules are identical and there is nothing else on the network. I would expect that each device would transmit at the same time during each interval i.e. with same relative offset from one-another.

This seems like some kind of oscillator issue and I came across the nRF52832 Errata doc at https://infocenter.nordicsemi.com/topic/errata_nRF52832_Rev2/ERR/nRF52832/Rev2/latest/err_832.html and discover there are many known bugs/issues with the hardware, including several oscillator issues. In particular anomaly [77] seems pretty major (error of -25% to + 40%) and would explain what I've been observing. This anomaly is evident in earlier series of the nRF528x and nRF5281x series (which explains why my other device, a smart watch with nRF51822 also experienced the issue).


3.18 [77] CLOCK: RC oscillator is not calibrated when first started This anomaly applies to IC Rev. Revision 2, build codes CIAA-Ex0, QFAA-Ex0, QFAB-Ex0. It was inherited from the previous IC revision Revision 1.

Symptoms The LFCLK RC oscillator frequency can have a frequency error of up to -25 to +40% after reset. A +/- 2% error is stated in the Product Specification.

Conditions Always.

Consequences The LFCLK RC oscillator frequency is inaccurate.

Workaround Calibrate the LFCLK RC oscillator before its first use after a reset.


For users with only 1 or 2 devices on the MySensors network, and low message rates, you would likely never come across this as an issues, but if you have a larger number of devices and heavy traffic, it seems only a matter of time before multiple messages overlap and seems to start causing the lock up of the devices.

I looked at my settings in Arduino for programming and the oscillator for "Low Frequency Clock:" was set as Crystal Oscillator. When I tried the RC Oscillator, I observed the same behaviour with the message drift (as expected). But when I switched to the final option of synthesized for the oscillator, the messages now transmit regularly and no longer drift into each other.

image

I've been running for many hrs now without any re-occurrence of the lock ups (usually observed within 15-20 min), with 5 of the devices pumping out the messages every 1s and everything looks rock solid and no drift. I'll try and get a long test running to see how it goes.

@d00616 or others maybe know enough to add in this calibration fix?

I suppose this won't explain why messages clashing should lock up the nRF5 devices, but should help make devices more stable on MySensors network if they can be controlled to issue messages in a more reliable time slot, or maybe the oscillator timings do have some further effect on the reliability of messaging?

Also it seems that nRF52840 device doesn't have these kind of issues, but I don't believe this device is supported in MySensors yet?

phodgers commented 3 years ago

I see in MyHwNRF5.cpp the setting up the High Frequency Clock, so after that is done then the Low Frequency clock would be calibrated? https://github.com/mysensors/MySensors/blob/d88506fbc62dac72d6b8b08c1fa3e179143c4542/hal/architecture/NRF5/MyHwNRF5.cpp#L91-L96 According to the last part of https://infocenter.nordicsemi.com/index.jsp?topic=%2Fcom.nordic.infocenter.sdk5.v15.0.0%2Fhardware_driver_clock.html you can set up a periodic calibration event with a few lines of code, but I'm not sure how to integrate that into the existing MySensors library.

d00616 commented 3 years ago

@phodgers These are good news. Thank you for finding this issue. This can be the main reason for the locking issue but maybe there are other reasons. I think its a good idea to implement your timeout.

I think the right place to fix the LFCLK it is the arduino-nrf5 project. When the LFCLK is enabled the calibration process must be initialized.

It looks like the calibration can be handled via interrupt handler: https://infocenter.nordicsemi.com/topic/ps_nrf52840/clock.html?cp=4_0_0_4_3_1_2#calibration_timer

d00616 commented 3 years ago

Now, I can reproduce the error. And, with platform.io I can debug it.

The CPU jumps endless in https://github.com/mysensors/MySensors/blob/development/hal/transport/NRF5_ESB/driver/Radio_ESB.cpp#L696

Register states are: NRF5_RADIO_TIMER->EVENTS_COMPARE[1] == 1 NRF5_RADIO_TIMER->EVENTS_COMPARE[3] == 0 NRF5_RADIO_TIMER->CC[1] == 0x160 // stop waiting for ack NRF5_RADIO_TIMER->CC[3] == 0x550 // retransmit time

The Compare_clear3 and compare_stop3 shortcuts are enabled. That means the timer starts again, if compare[3] is reached.

I have seen the NRF5_RADIO_TIMER->EVENTS_COMPARE[3] condition, but not in the failure case.

There are PPI, shortcut and timer logic. I think the Compare[3] event has a major role. Maybe the interrupt isn't triggered in some cases or never (https://github.com/mysensors/MySensors/blob/development/hal/transport/NRF5_ESB/driver/Radio_ESB.cpp#L195) but enabling CC[3] stops sending all messages.

I think everything around the timer needs an review.

d00616 commented 3 years ago

I have a fix. A "NRF5_RADIO_TIMER->TASKS_CLEAR = 1;" in the interrupt routine should be enough.

The fix is available https://github.com/mysensors/MySensors/pull/1445

@phodgers Please give some feedback on the PR page. Thank you.