pimoroni / enviro

MIT License
101 stars 79 forks source link

Suspect that Enviro Grow is sometimes not shutting down and hence not waking with 0.0.8 #87

Open retropangy opened 1 year ago

retropangy commented 1 year ago

Several times since installing 0.0.8 I've found that uploading has stopped and on checking the activity light on the Grow is permanently on. On doing a reset the board resumes activity. Checking the logs nothing is apparent except that the board apparently went into shutdown with an RTC wake set for 15 minutes later but never woke up - going by the activity light I suspect it never actually shut down properly and so wasn't in a position to wake up. Log of one such situation below - note the wake reason was me resetting the board when I noticed the permanent activity light - but I've seen it perhaps 4 or so times.

2022-09-11 12:09:04 [info     /  95kB] > going to sleep
2022-09-11 12:09:04 [debug    /  93kB]   - clearing and disabling timer and alarm
2022-09-11 12:09:05 [info     / 113kB]   - setting alarm to wake at 12:15pm
2022-09-11 12:09:05 [info     / 110kB]   - shutting down
2022-09-11 14:41:45 [debug    / 114kB] > performing startup
2022-09-11 14:41:45 [info     / 126kB]   - wake reason: external_trigger
2022-09-11 14:41:45 [debug    / 124kB]   - turn on activity led
retropangy commented 1 year ago

Looking at the log it looks like I may have done a manual poke just before the problem occurred, hence it's not actually set for 15 minutes later - I think that’s coincidence, because I’ve had the issue occur in the middle of the night when there was definitely no manual triggering.

sjefferson99 commented 1 year ago

I have also experienced exactly this a few times on the weather on 0.0.8. I haven't had time to dig into it, so can't say more for now than this affects more than one type of board.

retropangy commented 1 year ago

Another failure last night - same symptoms, except this time an upload to Adafruit had also failed so the red flashing LED stayed on until I noticed this morning as well. No manual trigger until I pressed reset and poked the button this morning.

2022-09-12 18:15:19 [error    /  98kB] ! failed to upload '2022-09-12T18:15:07Z.json' to adafruit_io
2022-09-12 18:15:19 [error    /  96kB] ! reading upload failed
2022-09-12 18:15:19 [info     /  94kB] > going to sleep
2022-09-12 18:15:19 [debug    /  91kB]   - clearing and disabling timer and alarm
2022-09-12 18:15:19 [info     /  89kB]   - setting alarm to wake at 18:30pm
2022-09-12 18:15:19 [info     /  87kB]   - shutting down
2022-09-13 06:21:43 [debug    / 114kB] > performing startup
2022-09-13 06:21:43 [info     / 126kB]   - wake reason: button
2022-09-13 06:21:43 [debug    / 125kB]   - turn on activity led

Don't know what the issue is but it's making the device too unreliable to rely on right now.

retropangy commented 1 year ago

Just had the same issue again; LED lit permanently, no waking up - not bothering posting logs as no real info and the same as before. I've just switched batteries just in case there's a low voltage issue.

sjefferson99 commented 1 year ago

OK, my issue is different, although presents very similarly. I will see if I can look into it further (as you say the logs as they stand are not very detailed, so may need to add some more) and can see if it needs a separate issue if it's a different cause for a similar crash.

Over two days after a good number of hours it has hung twice, but you can see my issue (on the weather board) is never waking after shutdown. Although the same as yours, the board does hang with the white LED solidly on and the poke button does nothing, these logs are after pressing reset. I've then almost immediately connected to USB to extract logs, it's unknown if that reset would continue working, but I see the wifi connect pulsing and then stop as normal, but I connect the USB before my 5 minute poll initiates, the wakeup in the log is due to USB connect.

Unlike a crash I saw in another issue #79 , this is not the RTC stopping as the time is correct in the logs at reset.

2022-09-13 01:05:02 [debug / 118kB] > performing startup 2022-09-13 01:05:02 [info / 124kB] - wake reason: unknown 2022-09-13 01:05:02 [debug / 122kB] - turn on activity led 2022-09-13 01:05:03 [info / 113kB] > going to sleep 2022-09-13 01:05:03 [debug / 111kB] - clearing and disabling timer and alarm 2022-09-13 01:05:03 [info / 108kB] - setting alarm to wake at 01:10am 2022-09-13 01:05:03 [info / 106kB] - shutting down 2022-09-13 17:16:53 [debug / 118kB] > performing startup 2022-09-13 17:16:53 [info / 124kB] - wake reason: unknown 2022-09-13 17:16:53 [debug / 122kB] - turn on activity led 2022-09-13 17:16:54 [info / 113kB] > going to sleep 2022-09-13 17:16:54 [debug / 111kB] - clearing and disabling timer and alarm 2022-09-13 17:16:54 [info / 108kB] - setting alarm to wake at 17:20pm 2022-09-13 17:16:54 [info / 106kB] - shutting down 2022-09-13 17:17:13 [debug / 118kB] > performing startup 2022-09-13 17:17:13 [info / 124kB] - wake reason: unknown 2022-09-13 17:17:13 [debug / 122kB] - turn on activity led 2022-09-13 17:17:14 [info / 112kB] > going to sleep 2022-09-13 17:17:14 [debug / 110kB] - clearing and disabling timer and alarm 2022-09-13 17:17:14 [info / 108kB] - setting alarm to wake at 17:20pm 2022-09-13 17:17:14 [info / 106kB] - shutting down 2022-09-13 17:17:14 [debug / 104kB] - on usb power (so can't shutdown) halt and reset instead

retropangy commented 1 year ago

I suspect it’s the same issue - I’m only guessing that the board isn’t shutting down correctly and that’s what’s causing it not to wake based on the LED staying lit. It may be an issue with the wake itself. Either way, it’s making the board too unreliable to use for automatic watering.

sjefferson99 commented 1 year ago

Sorry I misread your first log post; This is the same issue.

I added a log line just after pull sys_en low to sleep and it does not execute, so this suggests the board stops processing, but as the activity LED is powered by the Pico and not the RTC and this is on every time I find it, it does suggest that "hold_vsys_en_pin.init(Pin.IN)" doesn't always shut the board down fully, as you've suggested. Poking doesn't wake the board so it's likely RTC can't wake it either as they both should from the description page (I might be able to confirm that by applying an appropriate voltage on that pin when hung).

I don't know anything about the way it shuts itself down to deep sleep beyond that line of code does it, hopefully it's documented and some reason it might not be reliable jumps out... I'm collecting logs and times of start -> hang from influx to see if any kind of pattern emerges as I change one variable at a time like sample frequency, upload frequency etc.

retropangy commented 1 year ago

I've had a look at the 0.0.8 init.py and compared it to the earlier version that didn't seem to have this issue. The code is quite a bit different but from what I can tell one difference in the shutdown code is that the earlier code clears both the timer and the alarm RTC flags and the newer code says in the log that it's doing both but only clears the alarm RTC flag. I have no idea if this would make any difference at all.

Original:

  # make sure the rtc flags are cleared before going back to sleep
  logging.debug("  - clearing rtc alarm flags")
  board.rtc.clear_timer_flag()
  board.rtc.clear_alarm_flag()

0.0.8:

  # make sure the rtc flags are cleared before going back to sleep
  logging.debug("  - clearing and disabling timer and alarm")
  rtc.clear_alarm_flag()
MrDrem commented 1 year ago

I think that this may be due to a switch in the code from being timer based to being clock based.

The previous timer set the clock to wake in 5 minutes when set to a reading every 5 minutes (so first reading at 14:02, next would be set for 5 minutes time at 14:07, poke at 14:10 and the next reading would be set to 14:15, etc) The new timer sets the clock to wake at a set time (so 14:00, 14:05, 14:10, etc), and adds pokes as additional readings.

retropangy commented 1 year ago

Additional data point - I was concerned that it might be a battery issue, but I had the same symptoms overnight with a set of three fresh batteries. Last upload at 3am and then nothing until I reset five minutes ago, activity light was locked on.

sjefferson99 commented 1 year ago

I moved to a bench supply at 3.7v into the battery connector in case batteries became an issue with all this extra logging and sending. I also had a crash after 20 hours with a 5 minute poll uploading every 12 entries. I've switched it back to 1 minute poll and upload every 10 entries to see if I get consistent failures quicker such as every 2-3 hours as I have had on occasion before.

I took the opportunity to measure the voltages on each of the pins in the hung state, I was also going to try pulling the various interrupts to 0 to see if I could wake the board, but given it looks like it's not shutdown from the voltages below I doubt that would have worked. Unfortunately my voltmeter tripped the reset when I measured the run pin, so I haven't been able to test the IRQ lines low yet.

If anyone has a pinout that would be great, but I've mapped the pin functions from the constants file below.

vsys: 3.7v 3v3out: 3.3v GP2 (HOLD_VSYS_EN_PIN): 3.3v 3v3_EN: 2.2v GP5 (I2C_SCL_PIN): 3.7v

All other pins 0v with the exception of physical pins 31-35, I didn't get to test after hitting run with the voltmeter and triggering a reset.

What's odd is the activity light was on solidly so I should have had a voltage on GP6, which throws a bit of doubt into me getting all the voltages, but I double checked the above that were non-zero and I'll do it again more thoroughly and without hitting run on the next crash. Might get a scope on the I2C clock pin next time and see if it's stopped high or averaging 3.7.

Feels like the board has stopped processing but not fully shutdown and that 2.2v on the enable pin is very suspect. Without a pinout of the enviro and a much better understanding of the Pico components I wouldn't like to guess what's causing that.

sjefferson99 commented 1 year ago

Lasted 2 hours on the 1 minute logging setting, which is consistent with a previous couple of findings. It's a small sample set, but anecdotally it looks like the more often you take readings the quicker you hit the issue.

Below are corrected voltages making better contact with the pins. Looking at the power supply circuit diagram, it seems reasonable that 3v3_en would be 2.7 given the voltage divider the pull up resistor makes, so I think this is a pretty normal set of voltages for a board that is on but either halted or stuck in a loop somewhere.

GP2, 6 and 8 = 3.3v GP 3, 4 and 5 = 2.7v vsys, 3v3 = 3.3v 3v3_en = 2.7v

I tried shorting GP8 (rtc_alarm) to ground and got nothing but slightly more current consumed. Same for shorting gp2 (hold_vsys_en) to ground. Shorting 3v3_en to ground shut down the board and the RTC woke it up shortly after (more like 10 seconds than a minute, not sure if that tells us anything about what the code was doing when it hung and resumed). It's been running fine since while I write this.

I am not sure as to why the shutdown is switching GP2 to input mode (with no pullup defined). I assume it's to be a current drain and pull the pin to 0 and keep it there while the board is off and the rtc alarm wakes the board and resets and the first line of code executed appears to be set to out/true so set that pin high. I guess the various interrupts trigger their pi low and the 3v3_en high to power the board and that first line of code latches it.

I assume setting the pin to IN doesn't always quite work and the core stops without the drain being set and waking the board up more frequently increases the probability. Or it's something else entirely,

I don't have it in me to try and work out if you can attach a debug interface and see where it halts or confirm the best design for triggering a deep sleep from the Pico itself. So I hope someone else comes up with something to make it so my board runs for more than a day. Do tag me in here if you want me to test anything if you can't reproduce this.

sjefferson99 commented 1 year ago

As release v.0.0.9 (https://github.com/pimoroni/enviro/releases/tag/v0.0.9) claims to have reverted to a more stable firmware base, I've pulled down the full uf2 from that release and flashed it with the aggressive polling config above to see if this issue has been addressed, despite not being specifically mentioned in the release notes. So far it's nearly 3 hours with no crash, so it's looking encouraging, I'll keep this running for a day or so and see how we go.

I used the https://github.com/pimoroni/enviro/releases/download/v0.0.9/pimoroni-picow_enviro-v1.19.10-micropython-v0.0.9.uf2 download as the link in the release notes text gives 404 (typo @ZodiusInfuser ?). Flashed, ran through the onboarding wizard and then hooked up Thonny to modify the polling config to once per minute and upload every 5. Uploading to a local InfluxDB and seeing data as expected.

sjefferson99 commented 1 year ago

Using the above as a call for Mr Murphy to expediate testing was successful. Board crashed as before 20 minutes later, total run time 2h55m.

ZodiusInfuser commented 1 year ago

Well then... thanks for testing anyway 😭 Any chance you can post the log for this time? v0.0.9 has a few additional log messages that could be helpful in diagnosing this problem.

sjefferson99 commented 1 year ago

Log file attached. The board had hung with a solid white LED on. I disconnected the power to the battery connector and then reattached the USB to pull using Thonny, the time stamps for all activities appear accurate. log.txt

ZodiusInfuser commented 1 year ago

Hmm. I'm seeing the moment when the issue occurred, but no useful logging to help with diagnosis 😢

sjefferson99 commented 1 year ago

It looks very similar to the logs before. It thinks it's waking up from RTC, but it's not, like it never properly shut down. I don't fully understand the way the shutdown is executed with the pin set to IN and if there is any external circuitry that affects how this works as no diagrams are available.

I can only hope the developers can come up with a fix or further tests, as I can't take this further on my own and it makes the board unusable as it stops every ~24 hours with a more sensible polling configuration. I'm not sure why this isn't affecting everyone unless it's some kind of component tolerance causing varied behaviour on the shutdown.

ZodiusInfuser commented 1 year ago

I'm not sure if it was always there, but you can find the schematic linked to on the product page: https://cdn.shopify.com/s/files/1/0174/1800/files/enviro_grow_schematic.pdf?v=1664452062

The basic idea is that a trigger event (such as the RTC) enables power to the board (VSYS_EN)for long enough that the program can set the HOLD_VSYS_EN pin to high to keep the board alive. This used to be done in the python code, but that took too long to trigger, so it is now done earlier on by this magic that I'm yet to understand: https://github.com/pimoroni/pimoroni-pico/tree/main/micropython/modules/wakeup Then to go to sleep the pin is set to an input. My understanding is that this is better than just setting the pin to low, but I'd have to check with a colleague to get a more detailed explanation.

I can only hope the developers can come up with a fix or further tests, as I can't take this further on my own and it makes the board unusable as it stops every ~24 hours with a more sensible polling configuration. I'm not sure why this isn't affecting everyone unless it's some kind of component tolerance causing varied behaviour on the shutdown.

This is why I've been pulled in to sort the issues out with Enviro. I don't think it is just you that is experiencing it, as there are some reports of Enviro Urban boards locking up in similar ways too. I thank you for your patience as we try and work through this.

ZodiusInfuser commented 1 year ago

I set up an Enviro Grow logging locally and have seen the solid white light issue two times so far. Ended up resetting the board in my attempt to probe the issue, whoops.

I'm not entirely sure it's an issue with the board shutting down, but rather with board waking up. I have no more insight at this time though.

sjefferson99 commented 1 year ago

@ZodiusInfuser I installed the latest firmware (main branch as of this post date) as I saw a number of changes have been made, but running at my poll every minute upload every 2 config to bring about the crash quicker it failed after 1 hour and 50 minutes.

I am confused how other people are getting any use out of the board, I can't make mine last more than a couple of days on any firmware, is this a hardware fault or is no one able to use theirs for the intended purpose?

sjefferson99 commented 1 year ago

I have done a successful test using the USB power for around a week with no faults, so the unit actually works fine if you can supply permanent power via USB instead of the battery connector. I will be doing this in situ with a small lead acid/solar set up. If any progress is made on the shutdown bug, I will be happy to test, until then I will power permanently. I may look into a continuous reading and reporting average approach to capture more detail as power will no longer be a concern.

sjefferson99 commented 1 year ago

@ZodiusInfuser I applied the networking code suggested in #173 and I believe the wifi disconnect block in the upload readings section may have resolved my issue of the board hanging. I have run it all night without issue on similar configuration and brought it inside to precisely replicate the config that failed after a couple of hours before and it has continued without fault for 5 hours. I am going to move the unit to a battery connection with USB charger to test this further and hopefully resolve some of the temperature issues as well.