metro-sign / dc-metro

This is the home for the software that can turn a 64x32 RGB LED matrix into a WMATA Metro sign.
Other
134 stars 39 forks source link

Board freezes periodically #3

Open SmarTripDood opened 3 years ago

SmarTripDood commented 3 years ago

First, thank you for this cool project. It works great except I find it freezes every few hours (both the sign and the LED on the Matrix Portal on the back) and I have to reset it. Has this problem occurred for others and is there a fix?

NBlair52 commented 3 years ago

I just recently made 2 boards (thanks so much!) and noticed I also have this problem. Simply unplugging and plugging back in seems to fix the problem, but is not ideal.

david-beers commented 3 years ago

I don't have a board yet, but have been following as I want to get one eventually but $. @erikrrodriguez maybe has some ideas. They have a fork of this repo with a couple changes to allow multiple stations and a walking distance modifier to ignore trains you won't make in time. Their latest commit says they were attempting to fix the board reset issues.

erikrrodriguez commented 3 years ago

Unfortunately I have not successfully solved this, and haven't been able to pin down why it happens. I have to reset my board every 2 or 3 days. But if I ever figure it out I'll be sure to push the code to my repo!

SmarTripDood commented 3 years ago

Is it possible there is something with WMATA's data feed that causes this? I don't think that's it but wanted to check.

erikrrodriguez commented 3 years ago

Perhaps, but I don't think so. When I've tested on my PC using Python's requests library, the program has run for days with no issues. So I believe it's to do with the board's internal requests library, perhaps getting overloaded.

I've also tried to query MetroHero's API instead of WMATA's. But the can't even get a response using the board's requests library (PC testing again works fine)

condeepadunov commented 2 years ago

My solution to this problem: I plug it into a smart plug and have it turn OFF at 59 minutes into the hour and back ON on the hour. Not necessarily on an hourly basis, put periodically throughout the day. That solves the issue for me.

ghost commented 2 years ago

I am running into this issue also. I don't know how to troubleshoot it. The display gets stuck and stops updating at some point. I've tried adding print statements to help troubleshoot. With serial console open, it stops sending out messages too. If anyone knows better ways to troubleshoot, please let me know.

condeepadunov commented 2 years ago

I am running into this issue also. I don't know how to troubleshoot it. The display gets stuck and stops updating at some point. I've tried adding print statements to help troubleshoot. With serial console open, it stops sending out messages too. If anyone knows better ways to troubleshoot, please let me know.

See the solution above with the smart plug. It's imperfect but it works. Have the thing switch off and on every hour (which is far more often than it freezes) and it'll keep auto restarting and the problem goes away.

ghost commented 2 years ago

Thanks, that is a nice workaround. I am hoping to find a programming fix of the root cause though, assuming it's possible and the firmware or other hardware problem isn't the issue.

dylanjtastet commented 2 years ago

My hunch is a memory leak. The portal has very little memory and the adafruit libraries have become notoriously heavy. I tried to load their version of the datetime library and it immediately crashed the board in a similar fashion.

ghost commented 2 years ago

@dylanjtastet I thought so too. But I am monitoring memory with gc module and it doesn't show a loss in memory. Would the leak be detectable any other way?

SmarTripDood commented 2 years ago

I'm using a smart plug, but I really think the only permanent solution is to set this up on different hardware. There are a number of similar boards out there that do the same thing, using Raspberry Pi.

ScottKekoaShay commented 1 year ago

I tried various cords and plugs just for kicks, and mine does the same thing--just craps out usually within an hour, but sometimes it lasts longer. I had it connected to my computer to see the console, and this is what I get:

Retrieving data...Received response from WMATA api...
Reply received.
Successfully updated.
Refreshing train information...
Retrieving data...Traceback (most recent call last):
  File "code.py", line 25, in <module>
  File "train_board.py", line 41, in refresh
  File "code.py", line 22, in <lambda>
  File "code.py", line 15, in refresh_trains
  File "metro_api.py", line 17, in fetch_train_predictions
  File "metro_api.py", line 23, in _fetch_train_predictions
  File "adafruit_portalbase/network.py", line 518, in fetch
  File "adafruit_requests.py", line 823, in get
  File "adafruit_requests.py", line 679, in request
  File "adafruit_esp32spi/adafruit_esp32spi_socket.py", line 138, in recv
  File "adafruit_esp32spi/adafruit_esp32spi_socket.py", line 210, in available
  File "adafruit_esp32spi/adafruit_esp32spi.py", line 776, in socket_available
  File "adafruit_esp32spi/adafruit_esp32spi.py", line 332, in _send_command_get_response
  File "adafruit_esp32spi/adafruit_esp32spi.py", line 299, in _wait_response_cmd
  File "adafruit_esp32spi/adafruit_esp32spi.py", line 278, in _wait_spi_char
TimeoutError: Timed out waiting for SPI char

Code done running.

So I think it is an issue in the library. Unfortunately, that's beyond my knowledge, but perhaps someone who has a deeper understanding can figure out a solution (esp. one that doesn't involve actually updating the library). I am thinking it should be possible to catch the error and then have the thing restart itself, if nothing else? But I wasn't able to do that. It doesn't seem to be able to recover from the error gracefully. If anyone can figure it out, let me know!

dylanjtastet commented 1 year ago

I'm going to try using @erikrrodriguez's fork. Digging through the network library, it looks like there's a bit of redundancy in using the adafruit_request library for http as adafruit_esp32spi_wifimanager.ESPSPI_WiFiManager already provides this api. It also uses the request library, which is a global singleton, so the network library is creating redundant sockets and re-initializing the request library.

This is where I stopped digging, but my hunch now is that all of this is crashing the wifi coprocessor. Erik's code will also reset the coprocessor if the board is failing requests which should keep things turning if all else fails.

erikrrodriguez commented 1 year ago

Thanks @dylanjtastet I was going to ping and also recommend @ScottKekoaShay try my fork.

I will admit that my board sometimes also freezes, and I haven't been able to discover why. It seems like it hangs after the request is made and ignores the timeout in waiting for a response. So I think it is ultimately still and issue in the adafruit_request library.

But, my board was been running the past 4 days without me needing to manually reset it 🎉

ScottKekoaShay commented 1 year ago

Thanks i will give it a try as well!

ghost commented 1 year ago

I tried various cords and plugs just for kicks, and mine does the same thing--just craps out usually within an hour, but sometimes it lasts longer. I had it connected to my computer to see the console, and this is what I get:

Retrieving data...Received response from WMATA api...
Reply received.
Successfully updated.
Refreshing train information...
Retrieving data...Traceback (most recent call last):
  File "code.py", line 25, in <module>
  File "train_board.py", line 41, in refresh
  File "code.py", line 22, in <lambda>
  File "code.py", line 15, in refresh_trains
  File "metro_api.py", line 17, in fetch_train_predictions
  File "metro_api.py", line 23, in _fetch_train_predictions
  File "adafruit_portalbase/network.py", line 518, in fetch
  File "adafruit_requests.py", line 823, in get
  File "adafruit_requests.py", line 679, in request
  File "adafruit_esp32spi/adafruit_esp32spi_socket.py", line 138, in recv
  File "adafruit_esp32spi/adafruit_esp32spi_socket.py", line 210, in available
  File "adafruit_esp32spi/adafruit_esp32spi.py", line 776, in socket_available
  File "adafruit_esp32spi/adafruit_esp32spi.py", line 332, in _send_command_get_response
  File "adafruit_esp32spi/adafruit_esp32spi.py", line 299, in _wait_response_cmd
  File "adafruit_esp32spi/adafruit_esp32spi.py", line 278, in _wait_spi_char
TimeoutError: Timed out waiting for SPI char

Code done running.

So I think it is an issue in the library. Unfortunately, that's beyond my knowledge, but perhaps someone who has a deeper understanding can figure out a solution (esp. one that doesn't involve actually updating the library). I am thinking it should be possible to catch the error and then have the thing restart itself, if nothing else? But I wasn't able to do that. It doesn't seem to be able to recover from the error gracefully. If anyone can figure it out, let me know!

I forgot to update this when I resolved my issue. For me on the Matrix Portal M4 with the help of Dan Halbert, one of the core develoers of CircuitPython, the root issue was the firmware for Circuit Python 7.3 which had a buggy DMA feature that caused the SPI failures of all kinds. He fixed it and it should be pushed to later versions, so if your board has old firmware, try flashing a new version.

erikrrodriguez commented 1 year ago

Makes sense, I'm currently using Circuit Python 8 as of the other week. Thanks for getting in touch with Dan, I'm glad he could push a fix!

Edit: Something else I did was update the ESP firmware separate from Circuit Python using this guide: https://learn.adafruit.com/upgrading-esp32-firmware/upgrade-all-in-one-esp32-airlift-firmware

dylanjtastet commented 1 year ago

Yep makes sense, I was getting the same issue after switching to @erikrrodriguez's fork. Will try updating firmware now.

SmarTripDood commented 1 year ago

Upgrading firmware to 8 and the latest files from @erikrrodriguez's fork solved it for me -- many thanks!

ScottKekoaShay commented 1 year ago

Updating the ESP firmware did the trick for me--mine's been running several days without crapping out. Thanks for the tips @erikrrodriguez !