rbaron / b-parasite

🌱💧 An open source DIY soil moisture sensor
1.85k stars 143 forks source link

Implement workaround for errata KRKNWK-12017 #126

Closed MJDSys closed 1 year ago

MJDSys commented 1 year ago

Nordic has published an errata for the nRF Connect SDK for versions

1.8.0 where a Zigbee End Device can end up getting stuck if the parent device does not acknowledge the "Device Announcement packet".

They have a suggested workaround to implement in the SDK, which has been adapted for the custom signal handler used here.

This is an effort to solve issues where my parasites would occasionally drop off my network and require a reboot. After 24Hrs, I've not yet had a device disappear but it has taken >weeks before a device would fail. Unforunately it's hard to debug the board as the chips are in a low power state when this occurs.

CC: @oleo65 I saw you were having a similar problem in https://github.com/rbaron/b-parasite/issues/113#issuecomment-1484903062, could you also try this branch?

@rbaron I'm not sure this actually fixes my problem, so I understand if you'd prefer to wait a couple weeks before merging. If you have any feedback I'm happy to incorporate that now.

rbaron commented 1 year ago

That's fantastic, thanks for digging this up and following up with the fix.

For added context, here's Nordics known issues page with the suggested workaround for KRKNWK-12017. 2.3.0 is still affected.

It would explain some of the behavior we're seeing here, so I'm hopeful. I will flash a couple of b-parasites with this branch. I propose we let it roast for a few days and discuss the results here.

oleo65 commented 1 year ago

Thanks @MJDSys for digging this up. I will flash this firmware to some of my parasites today and will observe. 🎉

I am struggling with very fast drained batteries at the moment and maybe this will fix this as well. I suspect a mix of some software wakelock due to a firmware bug and possibly a bad batch of coin cells. Quite hard to pin down...

MJDSys commented 1 year ago

That's fantastic, thanks for digging this up and following up with the fix.

For added context, here's Nordics known issues page with the suggested workaround for KRKNWK-12017. 2.3.0 is still affected.

It would explain some of the behavior we're seeing here, so I'm hopeful. I will flash a couple of b-parasites with this branch. I propose we let it roast for a few days and discuss the results here.

Sounds good, I'll keep an eye on mine too.

Thanks @MJDSys for digging this up. I will flash this firmware to some of my parasites today and will observe. :tada:

I am struggling with very fast drained batteries at the moment and maybe this will fix this as well. I suspect a mix of some software wakelock due to a firmware bug and possibly a bad batch of coin cells. Quite hard to pin down...

I also experience battery life issues with my b-parasites. I ended up buying the Nordic power monitor, and I think I have a couple ideas. I've ordered a new batch of the v2 b-parasites (the plants must grow!) that I'll be using to gather some measurements and post about it in a new issue, but I don't think this will solve it.

rbaron commented 1 year ago

Following up on https://github.com/rbaron/b-parasite/pull/126#issuecomment-1541434345, the firmware has been running for a couple of weeks, and it's still connected and working nominally as far as I can tell.

It's hard to say whether we exercised that fix as is, but there's an interesting blip on the collected data points:

Screenshot 2023-05-28 at 09 21 45

The temporary drop of may be unrelated, but either way it picked back up again with no intervention.

@MJDSys , @oleo65, have you had a similar positive experience with this firmware?

oleo65 commented 1 year ago

I am experiencing mixed results with the firmware. It is fairly stable, but one sensor was dropping the connection multiple times usually after 2 or 3 days and needed to be power cycled for reconnecting.

This might also be related to the power drain issue but I don't have a real idea how to approach this.

IMG_20230528_094550.jpg

rbaron commented 1 year ago

@oleo65 what setup do you have? I'm running HA + SkyConnect + ZHA. Are all these 3 boards on your chart running this PR's firmware?

oleo65 commented 1 year ago

My Setup is HA + ConBee 2 + ZHA. So except for the Zigbee Stick the same setup.

The three sensor are all running the discussed firmware variant. I have in total around 10 sensors deployed with different firmware revisions.

Some are running a variant I am testing which will manually reinit the Join Procedure if the connection is dropped and not reconnected within a defined period of time. I wanted to gain more insights before discussing it here but so far it seems to be promising. Background is that Zephyr only tries to reconnect to the network for a fixed and hardcoded amount of time. (somewhat around 15 Min.) If no connection could be established than you either need to restart the join procedure by software or power cycle the sensor.

MJDSys commented 1 year ago

My setup is HA + ConBee2 + Z2M with 4 currently deployed sensors, so a little different again.

So far I've found this to be relatively stable. I had one sensor drop off the network and refuse to reconnect without a power cycle, but so far the other sensors generally stay available (and that one was at least blinking it's led, so it was more noticeable). Before this change I had sensors constantly disconnecting and staying that way. I initially blamed it on the power draw, with the batteries being emptied.

@oleo65 Your comment about Zephyr matches my experience with that one sensor. Is your firmware variant working on top of this change, or in parallel?

oleo65 commented 1 year ago

I deployed the "Zephyr auto reconnect" firmware about 7 weeks ago on different sensors. The next step would then be to combine both approaches into one firmware and try that. 😊

In addition I disabled most of the LED blinking because I suspected this to be a additional source of power drain if the sensor is in a faulty state but not discovered for say on or two days. This happened to me multiple times. Some sensors are deployed below foilage and not easily visible. I was also thinking about a HA automation to create a push notification if a sensor seems to be offline but I did not create it yet.

I will clean the auto reconnect code up in the next days and push it to a different branch for discussion. I appreciate all the discussion here and hope we can improve the reliability. My future plans are to automize my irrigation system more be using the soil moisture values as an additional input but need them to be more reliable for this. 😊

rbaron commented 1 year ago

Awesome, thanks a lot @oleo65 and @MJDSys.

While this PR may not fix all instabilities, I have been using it successfully for 3 weeks, and it is also what Nordic recommends. I'm going to merge it and kindly ask @oleo65 to rebase #130 so we can easily test both improvements together.

Thanks!