LoPy4 with PyMesh continually faulting with core dump

ajmilford commented 3 years ago

After adding a PyCom board to a Project on pybytes with PyMesh

Board updates to Firmware: 1.20.2.rc11 PyBytes: 1.5.2

But now continually reboots with core dumps:

Rebooting... ets Jun 8 2016 00:22:57

rst:0xc (SW_CPU_RESET),boot:0x13 (SPI_FAST_FLASH_BOOT) configsip: 0, SPIWP:0xee clk_drv:0x00,q_drv:0x00,d_drv:0x00,cs0_drv:0x00,hd_drv:0x00,wp_drv:0x00 mode:DIO, clock div:1 load:0x3fff8020,len:8 load:0x3fff8028,len:2128 load:0x4009fa00,len:19760 entry 0x400a05bc WMAC: 3C71BF877A08 Firmware: 1.20.2.rc11 Pybytes: 1.5.2 Initialized watchdog for WiFi and LTE connection with timeout 1260000 ms WiFi connection established Connected to MQTT mqtt.pybytes.pycom.io Pybytes connected successfully (using the built-in pybytes library) Guru Meditation Error: Core 1 panic'ed (LoadStoreError). Exception was unhandled. Core 1 register dump: PC : 0x40094ab0 PS : 0x00060c30 A0 : 0x8021e690 A1 : 0x3ffdaab0 A2 : 0x3ffdab9a A3 : 0x40000000 A4 : 0x3fffff3e A5 : 0x3ffdab50 A6 : 0x3ffdaab0 A7 : 0x00000000 A8 : 0x00000000 A9 : 0x3ffdab08 A10 : 0x000000ff A11 : 0x3ffdaa80 A12 : 0x3ffdab50 A13 : 0x3ffdaa88 A14 : 0x00000002 A15 : 0x3ffdacb0 SAR : 0x00000002 EXCCAUSE: 0x00000003 EXCVADDR: 0x40000000 LBEG : 0x40094b34 LEND : 0x40094b62 LCOUNT : 0x00000000

ELF file SHA256: 0000000000000000000000000000000000000000000000000000000000000000

Backtrace: 0x40094ab0:0x3ffdaab0 0x4021e68d:0x3ffdaad0 0x4021e9ce:0x3ffdaaf0 0x4021ec09:0x3ffdab10 0x4021ecac:0x3ffdab30 0x40203f7a:0x3ffdab50 0x4020bdfe:0x3ffdac80 0x40200185:0x2a080000

Guru Meditation Error: Core 1 panic'ed (IntegerDivideByZero). Exception was unhandled. Core 1 register dump: PC : 0x40090ea3 PS : 0x00060033 A0 : 0x80090ab1 A1 : 0x3ffda970 A2 : 0x3ffda9f0 A3 : 0x00000000 A4 : 0x00002800 A5 : 0x0000002c A6 : 0x0000002c A7 : 0x00051fe0 A8 : 0x00000000 A9 : 0x3ffda960 A10 : 0x95c4a6a8 A11 : 0x00000000 A12 : 0x00000018 A13 : 0x3ffda988 A14 : 0x3ffc173f A15 : 0x00000001 SAR : 0x0000000a EXCCAUSE: 0x00000006 EXCVADDR: 0x40000000 LBEG : 0x40095380 LEND : 0x4009538b LCOUNT : 0x00000000

ELF file SHA256: 0000000000000000000000000000000000000000000000000000000000000000

Backtrace: 0x40090ea3:0x3ffda970 0x40090aae:0x3ffda9b0 0x40090cfe:0x3ffda9d0 0x4008341e:0x3ffda9f0 0x40094aad:0x3ffdaab0 0x4021e68d:0x3ffdaad0 0x4021e9ce:0x3ffdaaf0 0x4021ec09:0x3ffdab10 0x4021ecac:0x3ffdab30 0x40203f7a:0x3ffdab50 0x4020bdfe:0x3ffdac80 0x40200185:0x2a080000

Re-entered core dump! Exception happened during core dump! Rebooting...

If I re-flash the device with firmware 1.20.2.rc4 using the firmware update tool it recovers - but has no PyMesh.

As soon as I add the device to my PyMesh project on pybytes.pycom.io, it downloads the update and then starts into the cycle of CoreDumps again.

Is there a way to completely wipe the board back to factory settings as I assume the 'LoadStoreError' is fetching something from a store which is causing the crash? Or is there some other way to recover this board so I can use on Pymesh?

gijsio commented 3 years ago

Hi, You could use the firmware updater tool and clear the NVS and CONFIG partitions in the advanced section, as well as erase the flash.

I think the issue might be related to the program you have running on the device (you can clear this using os.fsformat("/flash")), though you could elaborate on the pybytes debugging output by using import pycom; pycom.nvs_set('pybytes_debug', 99)

Let me know!

ajmilford commented 3 years ago

Thanks for the comment, but did not fix it.

After re-flashing and erasing everything it was back at 1.20.2.rc4 I then re-deployed pymesh and it updated itself to 1.20.2.rc11 All still OK.

I then turned off pybytes and reset using

pycom.pybytes_on_boot(False)
machine.reset()

Still all good.

I then tried executing the following commands in the terminal:

>>> from network import LoRa
>>> lora = LoRa(mode=LoRa.LORA, region=LoRa.EU868, frequency = 863000000, bandwidth=LoRa.BW_125KHZ, sf=7)
>>> pymesh = lora.Mesh()

and it was the last command which then caused a core dump:

Guru Meditation Error: Core 0 panic'ed (LoadProhibited). Exception was unhandled. Core 0 register dump: PC : 0x400965c8 PS : 0x00060033 A0 : 0x400980a7 A1 : 0x3ffbb780 A2 : 0x3ffc4428 A3 : 0x3ffc4174 A4 : 0x00050023 A5 : 0x3ffb4d38 A6 : 0x3ffc4604 A7 : 0x3ffbb940 A8 : 0x3ffc42ac A9 : 0x3ffdfb30 A10 : 0x00000001 A11 : 0x00000000 A12 : 0x00000000 A13 : 0x00000006 A14 : 0x3ffc419c A15 : 0x3ffc42b4 SAR : 0x00000014 EXCCAUSE: 0x0000001c EXCVADDR: 0x00000048 LBEG : 0x00000000 LEND : 0x00000000 LCOUNT : 0x00000000

ELF file SHA256: 0000000000000000000000000000000000000000000000000000000000000000

Backtrace: 0x400965c8:0x3ffbb780 0x400980a4:0x3ffbb7a0 0x4009805a:0x3ffbb7b0 0x4009891f:0x4023765e

Guru Meditation Error: Core 0 panic'ed (IntegerDivideByZero). Exception was unhandled. Core 0 register dump: PC : 0x40090ea3 PS : 0x00060033 A0 : 0x80090ab1 A1 : 0x3ffbb640 A2 : 0x3ffbb6c0 A3 : 0x00000000 A4 : 0x00002800 A5 : 0x00000030 A6 : 0x0000002f A7 : 0x00051fe0 A8 : 0x00000000 A9 : 0x3ffbb630 A10 : 0xb44ccfdb A11 : 0x00000000 A12 : 0x00000018 A13 : 0x3ffbb658 A14 : 0x3ffc173f A15 : 0x00000001 SAR : 0x0000000a EXCCAUSE: 0x00000006 EXCVADDR: 0x00000048 LBEG : 0x40095380 LEND : 0x4009538b LCOUNT : 0x00000000

ELF file SHA256: 0000000000000000000000000000000000000000000000000000000000000000

Backtrace: 0x40090ea3:0x3ffbb640 0x40090aae:0x3ffbb680 0x40090cfe:0x3ffbb6a0 0x4008341e:0x3ffbb6c0 0x400965c5:0x3ffbb780 0x400980a4:0x3ffbb7a0 0x4009805a:0x3ffbb7b0 0x4009891f:0x4023765e

Re-entered core dump! Exception happened during core dump! Rebooting...

On other Lopy4 devices this works OK.

gijsio commented 3 years ago

Hi, You should not import / use the LoRa module separately. Please have a look at this example: https://docs.pycom.io/pymesh/simple-example/. The Pymesh object handles the initialization of the LoRa radio. Im guessing that is where your issue is.

ajmilford commented 3 years ago

But the same code works fine on other LoPy4 units.

It is a valid method of using PyMesh according to https://docs.pycom.io/firmwareapi/pycom/network/lora/pymesh/ and https://forum.pycom.io/topic/5038/lora-object-has-no-attribute-mesh/2

If I do use the method suggested on https://docs.pycom.io/pymesh/simple-example/ then I get the same core dump when I execute

pymesh = Pymesh(pymesh_config, new_message_cb)

gijsio commented 3 years ago

Sorry for the confusion, I was indeed mistaken. I had some trouble starting the Pymesh, but figured you'd have to disable pybytes in order to run the pymesh without issues. If it works on the your other Lopy4 devices, I suggest you clear the CONFIG and NVS partitions through the firmware updater tool of the non-working Pymesh device, and then afterwards reload the pymesh firmware, that should do the trick

ajmilford commented 3 years ago

But it didn't, as per my comment

"After re-flashing and erasing everything it was back at 1.20.2.rc4"

this intended to indicate I did erase everything - including CONFIG and NVS and file system But as per that comment: after then restoring pymesh via pybytes it still ends up in the same broken core dump.

gijsio commented 3 years ago

I have thought about your issue and the possible differences as to why one device does not work, and the others do. Also, I was not able to reproduce the issue on my Lopy4's. Could you perhaps provide the following information several times in the code, on a device that works, and one that does not?

import micropython
import gc
import pycom
def memory():
    machine.info()
    print("GC free=", gc.mem_free(), "alloc=", gc.mem_alloc(), "total=", gc.mem_alloc() + gc.mem_free())
    # micropython.mem_info()
    print("mp stack", micropython.stack_use())
    print("heap internal", pycom.get_free_heap()[0])
    print("heap external", pycom.get_free_heap()[1])

ajmilford commented 3 years ago

I've now had a second LoPy4 get into this state!

It has been working OK, I did a minor edit to one of my files, uploaded and now every reboot it crashes in the same place. The code I changed was unrelated to the Pymesh stuff, it was a simple bit of debug.

ajmilford commented 3 years ago

To confirm, this second board is now un-useable.

I removed all my code.

Then just entered:

>>> from network import LoRa
>>> lora = LoRa(mode=LoRa.LORA, region=LoRa.EU868, frequency = 863000000, bandwidth=LoRa.BW_125KHZ, sf=7)
>>> pymesh = lora.Mesh()

and it crashed. Does this every time now. I now have two LoPy4 boards which will not run pymesh.

ajmilford commented 3 years ago

To test the memory code in an earlier suggestion.

I entered the following program:

import pycom
import time
import _thread

pycom.heartbeat(False)

def runner(arg1, arg2):
    while True:
        pycom.rgbled(0xFF0000)
        time.sleep(1)
        pycom.rgbled(0x00FF00)
        time.sleep(1)
        pycom.rgbled(0x0000FF)
        time.sleep(1)

_thread.start_new_thread(runner, (1,2))

And it runs, flashing the led as expected. From the command line I then ran the memory test.

1st result:

---------------------------------------------
System memory info (in bytes)
---------------------------------------------
MPTask stack water mark: 8428
ServersTask stack water mark: 3652
LoRaTask stack water mark: 3596
SigfoxTask stack water mark: 2936
TimerTask stack water mark: 2212
IdleTask stack water mark: 608
System free heap: 366684
---------------------------------------------
GC free= 2559216 alloc= 2128 total= 2561344
mp stack 1072
heap internal 117536
heap external 261904

After a while:

---------------------------------------------
System memory info (in bytes)
---------------------------------------------
MPTask stack water mark: 8428
ServersTask stack water mark: 3652
LoRaTask stack water mark: 3564
SigfoxTask stack water mark: 2936
TimerTask stack water mark: 2212
IdleTask stack water mark: 608
System free heap: 366684
---------------------------------------------
GC free= 2558640 alloc= 2704 total= 2561344
mp stack 1072
heap internal 117536
heap external 261904

why would the GC alloc keep increasing?

Clearly I cannot run this when getting the crash because as soon as I run lora.Mesh() it crashes and reboots!

gijsio commented 3 years ago

Concerning your question about the GC alloc, there could be things happening in the background, Im not too sure, but eventually the GC will collect all garbage and free it again when necessary. There should not be an issue there. The more interesting part is generally the heap internal which is prone to running out, as it allocates all tasks.

I was (finally) able to debug your coredumps, but they do not show anything 'interesting' per se. Could the issue be related to the power supply perhaps? You mention a second device behaving in the same way suddenly when changing the code in a different section. In the past we've seen coredumps when there's power limitations that do not trigger the 'brownout' detect.

ajmilford commented 3 years ago

The LoPy4 is on an expansion board powered via the USB.

Narashiman commented 2 years ago

Hi, Even i am facing the same issue. I am using Frozen Pymesh code only with few modifications for my application. It works fine. But completely all of a sudden one or the other device starts crashing and then it would be in a loop of boot , crash and restart. In logs i see it connects with WiFi and sends MQTT data, then it starts my application and then it crashes. U cannot do anything with device. Cant control from atom or stop the execution. The workaround i am doing right now, is format the lopy4 with new pybytes firmware with check for clear nvs, flash and other options. After that i deploy pymesh firmware and my frozen code. With that the device is recovered and wont have any issues,

pycom / pycom-micropython-sigfox

LoPy4 with PyMesh continually faulting with core dump #547