pycom / pycom-micropython-sigfox

A fork of MicroPython with the ESP32 port customized to run on Pycom's IoT multi-network modules.
MIT License
196 stars 167 forks source link

GPY on reboot - (496) esp_image: Checksum failed. #528

Open knapikm opened 3 years ago

knapikm commented 3 years ago

Hello,

currently i use GPY boards (with Pytrack) and i found out that some of them get corrupted memory.

I have custom build of 1.20.2.r3 fw (LittleFS), with my own code added into frozen directory which presents my custom solution. This fw was flashed on the devices (no OTA, not even try for OTA).

Non-working board is responding to Pycom Firmware Updater so i can flash it with clean fw, but i would like to figure out what is happened and what is causing the problem.

The only thing i can get from non-working boards is this:

ets Jun  8 2016 00:22:57

rst:0x1 (POWERON_RESET),boot:0x17 (SPI_FAST_FLASH_BOOT)
configsip: 0, SPIWP:0xee
clk_drv:0x00,q_drv:0x00,d_drv:0x00,cs0_drv:0x00,hd_drv:0x00,wp_drv:0x00
mode:DIO, clock div:1
load:0x3fff8020,len:8
load:0x3fff8028,len:2128
load:0x4009fa00,len:19824
entry 0x400a05d4
E (496) esp_image: Checksum failed. Calculated 0x3c read 0x6e

or on another board

...
entry 0x400a05d4
E (494) esp_image: Checksum failed. Calculated 0xb4 read 0x6e

This starts occurring after random number of cycles (resets, deep_sleeps, machine.deepsleeps or external wake-ups).

From my custom logs i cant find nothing what could cause the problem - even nothing unusual happened. Devices are used for tracking purpose so they are waked up periodically in cycles, the device can be restarted several times during the cycle. There is no I/O operation on flash memory, only on the SD card. Code use WDT to prevent from deep_sleep failure or from get stuck.

I have no idea where to start looking for a solution. It is caused by my custom code? It is a Pycom FW / HW problem?

robert-hh commented 3 years ago

It looks like the flash gets corrupted? Do you write to files in your application and go to deepsleep soon after that?

knapikm commented 3 years ago

No, there is no I/O operation on flash memory, only on the SD card.

robert-hh commented 3 years ago

Do you use NVRAM store operations? NVRAM is actually located in flash.

knapikm commented 3 years ago

No, i dont use NVRAM.

robert-hh commented 3 years ago

Is the device powered by battery or by USB. And I understand that, once the error occurs, it stays permanent until the firmware is flashed again.

knapikm commented 3 years ago

Battery, but can be charging by USB. Yes it stays permanent.

robert-hh commented 3 years ago

It is hart to tell what's going on. You could try one thing:

The error is reported by the boot loader, but is not caused there.

knapikm commented 3 years ago

Yes i will try it, but that probably won't be a problem, because i found out in logs that devices on which that happened had around 50% of battery (Li-Ion battery) and i calculate battery voltage from min and max voltage that device was able to measure (or run at in case of min voltage).

I don't think this is caused by brownout or something like this. I have never problem with devices after unplugging (USB or battery) without hesitation while device was running. I know it is not the same but i don't see reason why this should cause flash corruption.

robert-hh commented 3 years ago

Average battary voltage is one aspect, Low impedance for short current draws the other. Can you connect a separate UART/USB adapter to Tx (p1) -> RX(adapter) and GND and log the output, so you see what has happened before flash got corrupted. Or did you do that already? You have to use an external adapter to avoid powering through USB.

knapikm commented 3 years ago

The battery the devices are running on is a 4P Li-Ion with a max current draw > 1A, so It should be physically impossible for the battery voltage to drop below the brownout threshold. So the fact that a device with 50% battery had this occur leads me to the conclusion that this is not the issue, I want to try all of the other more feasible options before I go chasing rabbits.

I'd love to capture the logs of when this happens live, but unfortunately, it's not very reproducible. Out of ~60 devices, only 3 had this happen in the first week, all of the others are still running just fine. I haven't been able to reproduce it in the lab environment,

robert-hh commented 3 years ago

I have never seen a flash corruption on the firmware image like you describe. Corruption on the file system part happen, usually when using the FAT file system. And since the change is permanent, it is a change to the flash and not a transient read error during boot. So some kind of write operation must have been called. There are code sections in the firmware which write to flash, e.g. tho code which does OTA updates, but that has to be called. That may happen when the code gets crazy, but I would not expect to see that by chance several times on three devices. Are the three devices which failed in any way different to the others, like in use for longer time, a different version, etc. Has anyone from Pycom a clue? @peter-pycom ?

knapikm commented 3 years ago

These 3 devices are the same as the others, not in use for longer time, not a different version.

peter-pycom commented 3 years ago

Are you able to share your code? And/or describe what it does before reset/deepsleep? Are you using some of the pycoproc functions of the pytrack? Can you describe functionality wrt power draw? Like BLE, wifi, LTE, SD transactions, internal/external sensors/actuators...