Closed rbaron closed 1 year ago
@MJDSys, @oleo65, this PR will hopefully help us investigate what's going on. I defined some code path debug counters that are dumped during boot:
Dumping debug counters:
- boot: 3
- double_reset: 1
- main_finish: 2
- steering_success: 1
I also added counters for krknwk_12017_reset
(from #126) and steering_watchdog_restart
(from #130). It would be cool to see any of these being hit after a few weeks.
So far I've had a b-parasite flashed with #126 running for over a month (HA + SkyConnect + ZHA), and it's still going strong. I will keep this PR updated with my findings. Please also ping this thread if you see something weird 😬
Unfortunately the issue from https://github.com/rbaron/b-parasite/issues/113#issuecomment-1484903062 is still happening :(
Two weeks ago I deployed two b-parasites with the sample from this PR, named 1.2.0-counters2
and 2.0.0-counters1
. Oddly they both stopped working within 24h from each other. Debugging notes:
First deployed on 2023-06-06.
Stopped responding a couple of days ago (2023-06-15):
Battery dead.
Counters after connecting to ext power:
[00:00:00.068,481] <inf> app: - boot: 210
[00:00:00.082,916] <inf> app: - double_reset: 81
[00:00:00.097,259] <inf> app: - main_finish: 209
[00:00:00.116,027] <inf> app: - steering_failure: 76
[00:00:00.134,979] <inf> app: - steering_success: 9
[00:00:00.153,778] <inf> app: - steering_watchdog_restart: 18
There's a concerning number of boots and double resets. But my theory is that once the battery gets super low it gets into a reboot loop, so the problem may be before this. There's also an unrelated problem with this, which is we're going to possibly trigger the double reset wipe whenever the battery gets near empty, which we need to address.
First deployed on 2023-06-06.
Stopped responding a couple of days ago (2023-06-15):
Battery dead.
Counters after connecting to ext power:
[00:00:00.026,855] <inf> app: - boot: 8
[00:00:00.032,226] <inf> app: - main_finish: 7
[00:00:00.038,146] <inf> app: - steering_failure: 18
[00:00:00.043,487] <inf> app: - steering_success: 11
[00:00:00.049,316] <inf> app: - steering_watchdog_restart: 1
The plot thickens. Not a lot of reboots now, and not a single double reset. Still, drained battery. A single steering_watchdog_restart
from #130.
krknwk_12017_reset
trigger from #126 (long shot: maybe we reset the device before counter is flushed to fs?)prstlib
for example that would put the device on a loop, quickly draining the battery? Unlikely, since we use the same functions as in the ble sample, which works reliablyzb_reset()
on the watchdog handler (instead of user_input_indicate
, which only restarts the procedure if not commissioned)? I implemented the following debugging changes in the zb-debug++ branch:
fs_sync
before fs_close
just to be sure we write to flash before resetting in (in case krknwk_12017_reset
is triggered and we don't see it now)I've reflashed and redeployed both sensors. Let's see.
Thanks @rbaron for your detailed insights. I can confirm some of your described behaviour.
Especially with low battery, I observed, that the parasite tries to connect to the network and as soon as the steering process starts it reboots, assumingly due to low power. I suspect that there is a short high power draw during steering which cannot be met by the low battery and thus the device reboots.
I also additionally struggled with the double reset to be triggered either involuntarily by me or by the device itself. Especially in the low power state from above, this might lead to a circle of reboots, reprovisioning and even higher power draw. So disabling the double reset might be a good idea and maybe for the v2 devices one could implement something like a 5 or 7 seconds continuous press on the reset button to factory reset the device. This is highly unlikely to be triggered by the software itself I suspect. 😉
I am not sure if the zb_reset()
approach is right. During implementation I tried to understand what is going on in ZBOSS with the various methods, but then sticked to the method described in the docs. To me it seems that ZBOSS is somewhat "obscure" what it is really doing and sets some flags here and there I did not fully understand. 🤷
100% agree on the factory reset approach with the button. There's even a register_factory_reset_button()
that does exactly that (although I think it's coupled with the dev kits).
For the zb_reset()
, I'm also not 100% sure. Just seeing if we gain some extra insight, but it's such a tricky thing to debug if it takes two weeks to reproduce. I'm really confused since both b-parasites from this branch stopped within 24h (but not at the same time!), and an older one (from #126) is still going strong...
Just some more feedback from my end. I had a sensor which was disconnecting the network. I suspect a mix of low battery, high temperature variations (sensor is outside) and weak reception on the edge of the network.
In all cases it reconnected to the network after a varying amount of time. In most cases it took still too long for my expectations but no manual intervention was required, which definately is a big plus. It is kind of buried below in a jungle of leaves... 😉
So I am eager to assume that the implemented steering restart mechanisms are helpful. The used firmware is with both #126 and #130 implemented.
Merging this to push another PR on top of it.
In the effort to increse reliability of the Zigbee sample, https://github.com/rbaron/b-parasite/pull/126 and https://github.com/rbaron/b-parasite/pull/130 introduced handling for possibly rare, hard-to-debug events that may be causing instability. While these changes seem to help in practice, we don't know exactly how often, if ever, they are triggered.
This PR introduces
debug_counters
in hope to add some visibility in these hard-to-debug, rare conditions.Calling
prst_debug_counters_increment("some_counter")
will increment a value stored in flash. These values are then dumped via logging when the device boots. The goal is to instrumentalize some code paths and find out how often they are triggered.Moving forward we can also consider exposing these values via a ZigBee cluster for convenience.