rbaron / b-parasite

🌱💧 An open source DIY soil moisture sensor
1.85k stars 143 forks source link

[ZigBee Sample] Introduce debug counters #133

Closed rbaron closed 1 year ago

rbaron commented 1 year ago

In the effort to increse reliability of the Zigbee sample, https://github.com/rbaron/b-parasite/pull/126 and https://github.com/rbaron/b-parasite/pull/130 introduced handling for possibly rare, hard-to-debug events that may be causing instability. While these changes seem to help in practice, we don't know exactly how often, if ever, they are triggered.

This PR introduces debug_counters in hope to add some visibility in these hard-to-debug, rare conditions.

Calling prst_debug_counters_increment("some_counter") will increment a value stored in flash. These values are then dumped via logging when the device boots. The goal is to instrumentalize some code paths and find out how often they are triggered.

Moving forward we can also consider exposing these values via a ZigBee cluster for convenience.

rbaron commented 1 year ago

@MJDSys, @oleo65, this PR will hopefully help us investigate what's going on. I defined some code path debug counters that are dumped during boot:

Dumping debug counters:
- boot: 3
- double_reset: 1
- main_finish: 2
- steering_success: 1

I also added counters for krknwk_12017_reset (from #126) and steering_watchdog_restart (from #130). It would be cool to see any of these being hit after a few weeks.

So far I've had a b-parasite flashed with #126 running for over a month (HA + SkyConnect + ZHA), and it's still going strong. I will keep this PR updated with my findings. Please also ping this thread if you see something weird 😬

rbaron commented 1 year ago

Unfortunately the issue from https://github.com/rbaron/b-parasite/issues/113#issuecomment-1484903062 is still happening :(

Two weeks ago I deployed two b-parasites with the sample from this PR, named 1.2.0-counters2 and 2.0.0-counters1. Oddly they both stopped working within 24h from each other. Debugging notes:

1.2.0-counters2

First deployed on 2023-06-06.

On 2023-06-19.

Stopped responding a couple of days ago (2023-06-15):

1 2 0-counters2-2023-06-18-stale

Battery dead.

Counters after connecting to ext power:

[00:00:00.068,481] <inf> app: - boot: 210
[00:00:00.082,916] <inf> app: - double_reset: 81
[00:00:00.097,259] <inf> app: - main_finish: 209
[00:00:00.116,027] <inf> app: - steering_failure: 76
[00:00:00.134,979] <inf> app: - steering_success: 9
[00:00:00.153,778] <inf> app: - steering_watchdog_restart: 18

There's a concerning number of boots and double resets. But my theory is that once the battery gets super low it gets into a reboot loop, so the problem may be before this. There's also an unrelated problem with this, which is we're going to possibly trigger the double reset wipe whenever the battery gets near empty, which we need to address.

2.0.0-counters1

First deployed on 2023-06-06.

On 2023-06-19.

Stopped responding a couple of days ago (2023-06-15):

2 0 0-counters1-2023-06-18-stale

Battery dead.

Counters after connecting to ext power:

[00:00:00.026,855] <inf> app: - boot: 8
[00:00:00.032,226] <inf> app: - main_finish: 7
[00:00:00.038,146] <inf> app: - steering_failure: 18
[00:00:00.043,487] <inf> app: - steering_success: 11
[00:00:00.049,316] <inf> app: - steering_watchdog_restart: 1

The plot thickens. Not a lot of reboots now, and not a single double reset. Still, drained battery. A single steering_watchdog_restart from #130.

Other notes

Ideas / theories

rbaron commented 1 year ago

I implemented the following debugging changes in the zb-debug++ branch:

I've reflashed and redeployed both sensors. Let's see.

oleo65 commented 1 year ago

Thanks @rbaron for your detailed insights. I can confirm some of your described behaviour.

Especially with low battery, I observed, that the parasite tries to connect to the network and as soon as the steering process starts it reboots, assumingly due to low power. I suspect that there is a short high power draw during steering which cannot be met by the low battery and thus the device reboots.

I also additionally struggled with the double reset to be triggered either involuntarily by me or by the device itself. Especially in the low power state from above, this might lead to a circle of reboots, reprovisioning and even higher power draw. So disabling the double reset might be a good idea and maybe for the v2 devices one could implement something like a 5 or 7 seconds continuous press on the reset button to factory reset the device. This is highly unlikely to be triggered by the software itself I suspect. 😉

I am not sure if the zb_reset() approach is right. During implementation I tried to understand what is going on in ZBOSS with the various methods, but then sticked to the method described in the docs. To me it seems that ZBOSS is somewhat "obscure" what it is really doing and sets some flags here and there I did not fully understand. 🤷

rbaron commented 1 year ago

100% agree on the factory reset approach with the button. There's even a register_factory_reset_button() that does exactly that (although I think it's coupled with the dev kits).

For the zb_reset(), I'm also not 100% sure. Just seeing if we gain some extra insight, but it's such a tricky thing to debug if it takes two weeks to reproduce. I'm really confused since both b-parasites from this branch stopped within 24h (but not at the same time!), and an older one (from #126) is still going strong...

oleo65 commented 1 year ago

Just some more feedback from my end. I had a sensor which was disconnecting the network. I suspect a mix of low battery, high temperature variations (sensor is outside) and weak reception on the edge of the network.

In all cases it reconnected to the network after a varying amount of time. In most cases it took still too long for my expectations but no manual intervention was required, which definately is a big plus. It is kind of buried below in a jungle of leaves... 😉

So I am eager to assume that the implemented steering restart mechanisms are helpful. The used firmware is with both #126 and #130 implemented.

image
rbaron commented 1 year ago

Merging this to push another PR on top of it.