homieiot / homie-esp8266

💡 ESP8266 framework for Homie, a lightweight MQTT convention for the IoT
http://homieiot.github.io/homie-esp8266
MIT License
1.36k stars 306 forks source link

MQTT OTA csum error in reboot after "OTA sucess" #204

Closed Gulaschcowboy closed 6 years ago

Gulaschcowboy commented 7 years ago

Hi,

with latest commits of homie-esp8266 I get the following result when updating my ESP8266-07s using homie-ota:

✴ OTA available (version 1.0.11) OTA started Triggering OTA_STARTED event... Firmware is base64-encoded

Receiving OTA firmware (353600/353600)... ✔ OTA success Triggering OTA_SUCCESSFUL event... Triggering MQTT_PACKET_ACKNOWLEDGED event (packetId 51)... Device is idle ↻ Rebooting... Triggering MQTT_PACKET_ACKNOWLEDGED event (packetId 375)... ��� MQTT disconnected Triggering MQTT_DISCONNECTED event... ↕ Attempting to connect to MQTT... ✖ Wi-Fi disconnected Triggering WIFI_DISCONNECTED event... ↕ Attempting to connect to Wi-Fi... ets Jan 8 2013,rst cause:1, boot mode:(3,6) load 0x4010f000, len 1384, room 16 tail 8 chksum 0x2d csum 0x2d v3de0c112 @cp:0 ld ets Jan 8 2013,rst cause:4, boot mode:(3,6) wdt reset load 0x4010f000, len 1384, room 16 tail 8 chksum 0x1d csum 0x1d csum err Flashing same binary using serial works. Any ideas? Thanks, Alex
clough42 commented 7 years ago

I am experiencing something similar. I can get the OTA to upload to the device, and it apparently succeeds, but the device fails to reboot. If I rest the device, I get only the off-baud garbage string and the device doesn't come up.

If I flash the same image via serial, it works.

I am building with PlatformIO, if that makes a difference. I have the latest code as of today from the develop branch and the latest versions of the dependencies, from their respective repositories.

Here's my output, for reference. [...snip...] is where I took out long sequences of similar messages:

`** Booting into normal mode **
{} Stored configuration
  • Hardware device ID: d1008cef
  • Device ID: firelight1
  • Boot mode: normal
  • Name: FireLight
  • Wi-Fi:
    ◦ SSID: clough
    ◦ Password not shown
  • MQTT:
    ◦ Host: mqtt
    ◦ Port: 1883
    ◦ Base topic: devices/
    ◦ Auth? no
  • OTA:
    ◦ Enabled? yes
↕ Attempting to connect to Wi-Fi...

✔ Wi-Fi connected
Triggering WIFI_CONNECTED event...
↕ Attempting to connect to MQTT...
Sending initial information...
✔ MQTT ready
Triggering MQTT_CONNECTED event...
Calling setup function...
Sending statistics...
  • Wi-Fi signal quality: 100%
  • Uptime: 1s
Triggering MQTT_PACKET_ACKNOWLEDGED event (packetId 1)...
Triggering MQTT_PACKET_ACKNOWLEDGED event (packetId 2)...

[...snip...]

Triggering MQTT_PACKET_ACKNOWLEDGED event (packetId 21)...
Triggering MQTT_PACKET_ACKNOWLEDGED event (packetId 22)...
Triggering MQTT_PACKET_ACKNOWLEDGED event (packetId 23)...
✴ OTA available (version 1.0.10)
Triggering MQTT_PACKET_ACKNOWLEDGED event (packetId 24)...

OTA started
Triggering OTA_STARTED event...
Firmware is base64-encoded
Receiving OTA firmware (1055/348720)...
Receiving OTA firmware (2150/348720)...
Receiving OTA firmware (3245/348720)...

[...snip...]

Receiving OTA firmware (344885/348720)...
Receiving OTA firmware (345980/348720)...
Receiving OTA firmware (347075/348720)...
Receiving OTA firmware (348170/348720)...
Receiving OTA firmware (348720/348720)...
✔ OTA success
Triggering OTA_SUCCESSFUL event...
Triggering MQTT_PACKET_ACKNOWLEDGED event (packetId 25)...
Triggering MQTT_PACKET_ACKNOWLEDGED event (packetId 26)...
Triggering MQTT_PACKET_ACKNOWLEDGED event (packetId 27)...

[...snip...]

Triggering MQTT_PACKET_ACKNOWLEDGED event (packetId 341)...
Triggering MQTT_PACKET_ACKNOWLEDGED event (packetId 342)...
Device is idle
↻ Rebooting...
✖ MQTT disconnected
Triggering MQTT_DISCONNECTED event...
↕ Attempting to connect to MQTT...
✖ Wi-Fi disconnected
Triggering WIFI_DISCONNECTED event...
↕ Attempting to connect to Wi-Fi...

 ets Jan  8 2013,rst cause:2, boot mode:(3,7)

load 0x4010f000, len 1384, room 16
tail 8
chksum 0x2d
csum 0x2d
v3ffe84e8
@cp:0
ld
e:
 ets Jan  8 2013,rst cause:3, boot mode:(3,7)

ets_main.c`
clough42 commented 7 years ago

Oh, I'm using the latest homie-ota python code, and I'm using mosquitto as my mqtt broker, if any of those things matter.

dasMopo commented 7 years ago

Just to confirm the issue - I experienced the same, but with some minor differences / additions:

marvinroger commented 7 years ago

@mrpace2 mind taking a look? #208 seems related

ghost commented 7 years ago

Since we responded with 200 we did commit the update to flash and the arduino/esp8266 Updater code told us that the update was successfully written to flash. With #208, he got a 400. That is, we found that something was wrong with the OTA payload or the MD5.

The one commonality between #204 and #208 is that in both cases we do reboot. According to https://github.com/esp8266/Arduino/issues/1017, even the latest arduino/esp8266 still has issues with restart after OTA. Could this be our problem here? @clough42, @Gulaschcowboy can you please check if it helps to reset or power cycle your device after serial flashing but before trying OTA update?

ghost commented 7 years ago

@marvinroger I wonder about the Homie output in between ↻ Rebooting... and ets Jan 8 2013,rst cause:.... Right after we print ↻ Rebooting... we flush the serial port and we call ESP.restart(). Shouldn't we see the boot message before anything else from Homie? Could it be possible that it restarted again during startup?

Also, varying reset causes in the logs above (1=normal boot, 4=watchdog, 2=reset pin, 3=software reset). See https://github.com/esp8266/Arduino/blob/master/doc/boards.md#boot-messages-and-modes

marvinroger commented 7 years ago

@mrpace2 the serial output after the ESP.restart() is not alarming (see https://github.com/esp8266/Arduino/blob/master/cores/esp8266/Esp.cpp#L140), the reboot is not immediate.

They both use homie-ota, which, correct me if I am wrong, does not send an MD5. So the flash appears as successful, but the Updater does not actually check if the firmware is valid. @clough42 said

If I reset the device, I get only the off-baud garbage string and the device doesn't come up

So the flashed firmware is definitely wrong. Which, along with #208, makes me think the b64 decoding is broken.

marvinroger commented 7 years ago

@clough42 @Gulaschcowboy can you provide your firmwares binaries?

marvinroger commented 7 years ago

Can you test again with the latest git rev?

Gulaschcowboy commented 7 years ago

Hi Marvin,

tested right now with latest homie-ota and latest homie rev. Result is similar:

✴ OTA available (version 1.0.2) OTA started Triggering OTA_STARTED event... Firmware is base64-encoded Receiving OTA firmware (985/350673)...

Receiving OTA firmware (349825/350673)... Receiving OTA firmware (350672/350672)... ✔ OTA success Triggering OTA_SUCCESSFUL event... Triggering MQTT_PACKET_ACKNOWLEDGED event (packetId 50)... Device is idle ↻ Rebooting... ✖ MQTT disconnected Triggering MQTT_DISCONNECTED event... ↕ Attempting to connect to MQTT... ✖ Wi-Fi disconnected Triggering WIFI_DISCONNECTED event... ↕ Attempting to connect to Wi-Fi... ets Jan 8 2013,rst cause:2, boot mode:(3,6) load 0x4010f000, len 1384, room 16 tail 8 chksum 0x2d csum 0x2d v3de0c112 @cp:0 ld ets Jan 8 2013,rst cause:4, boot mode:(3,6) wdt reset load 0x4010f000, len 1384, room 16 tail 8 chksum 0xd6 csum 0xd6 csum err ets_main.c
ghost commented 7 years ago
✔ OTA success
...
v3de0c112
@cp:0
ld

indicates that

It crashed afterwards:

ets Jan 8 2013,rst cause:4, boot mode:(3,6)

is a watchdog reset. It seems to have happened before Homie's first log output. Maybe the firmware was published and saved correctly, but the firmware binary itself was broken? Does the same BIN file work when you upload it over the serial port?

If you share your BIN and BASE64 files, I could try them out on one of my setups. I don't have an ESP-07, though.

Another thing you could try are using MD5 checksums. Basically, you checksum the BIN file and Homie verifies the checksum right before committing the received OTA update to flash. This would help ruling out any issues with

Gulaschcowboy commented 7 years ago

Hi @mrpace2

Binary attached. Please note, it's actually an uncompressed .bin just renamed to .zip because of GH limitations and my limitation of being "half offline" in a hotel. So please just rename. homie-temp.ino.generic.zip

ghost commented 7 years ago

Thanks for sending the files. I will try them out in a bit. I can test on Nodemcu.

Homie automatically detects is a firmware blob is binary or base64, based on the first firmware byte which is always 0xE9. It did detect base64 in your case.

Recent homie-ota does support base64. It was added a few weeks ago. A few minutes ago I created a new homie-ota PR adding support for MD5 checksums for the new OTA sequence (see #220).

clough42 commented 7 years ago

Here's the last binary I attempted to use:

firelight.zip

This is a zip file, containing the .bin file. It's compiled for a WeMos D1 mini, which has a 4M flash, with the default PlatformIO configuration. I'm actually using an ESP12F module, but the specs are the same.

I am attempting to flash with Homie-OTA.

The source code is here: https://github.com/clough42/homie-firelight/blob/master/src/main.cpp

ghost commented 7 years ago

@clough42, thanks, will test shortly.

@Gulaschcowboy, can you also provide mqtt and serial log files and the base64 file you were uploading?

ghost commented 7 years ago

@clough42 Running the latest Homie code, I uploaded the bin file you provided (a) using my little script (attached) and (b) using the latest homie-ota. I can upload your bin file without problems. The board restarts and launches your firmware. See the attached logs.

Please try again:

  1. recompile your app to use the latest homie-esp8266
  2. download the BIN file once over serial
  3. hard-reset or power cycle your board
  4. run OTA either using the latest homie-ota or something along the lines of the bash script that I attached.

Please report results. If it still fails, please provide logs.

firelight-tests.zip

ghost commented 7 years ago

@Gulaschcowboy Same thing here. It works fine on my setup. See the attached logs, this time generated using your firmware blob. When you're back home, please try again (same as above). Have a safe trip!

homie-temp-tests.zip

Gulaschcowboy commented 7 years ago

Hi @mrpace2

  1. check
  2. check
  3. check
  4. via homie-ota

Result:

Triggering MQTT_PACKET_ACKNOWLEDGED event (packetId 32)... ✴ OTA available (checksum 745b6573bfdaafbee0805739b779f63d) ��� OTA started Triggering OTA_STARTED event... Firmware is base64-encoded Receiving OTA firmware (985/350880)... Receiving OTA firmware (2005/350880)...

receiving OTA firmware (350880/350880)... Exception (3): epc1=0x4000bf64 epc2=0x00000000 epc3=0x00000000 excvaddr=0x4024a39b depc=0x00000000 ctx: sys sp: 3ffffad0 end: 3fffffb0 offset: 01a0 > > > stack>>> > > > 3ffffc70: 3ffef460 3fff05d4 3ffffcac 40219ad6 > > > 3ffffc80: 4024a39b 3ffef460 3ffffcb0 4010068c > > > 3ffffc90: 4024a3ea 000002f7 00000007 4020c179 > > > 3ffffca0: 00072380 3ffef460 00000197 3fff3674 > > > 3ffffcb0: 0000000f 00000000 00000000 402199a8 > > > 3ffffcc0: 00072380 3fff05d4 3ffef460 4020c6d4 > > > 3ffffcd0: 00000000 00000000 00000000 00000100 > > > 3ffffce0: 3fff34dc 0000000f 0000000d 3ffeddc8 > > > 3ffffcf0: 3ffedd50 00000000 4021eec3 00000000 > > > 3ffffd00: ffffffff 00000000 3ffe9d51 00000000 > > > 3ffffd10: 4021ef16 3ffed398 3fff1fa4 00000000 > > > 3ffffd20: 40000000 00072352 0000002e 3fff2762 > > > 3ffffd30: 00000023 00072380 00000020 00000000 > > > 3ffffd40: 3ffea52e 3fff31e4 3ffffda0 00000000 > > > 3ffffd50: 3ffea52e 3fff31dc 00000000 4021b9d4 > > > 3ffffd60: 00072380 00000000 00000000 4020babf > > > 3ffffd70: 40000000 3fff31cc 00000000 4021bcd2 > > > 3ffffd80: 3fff2364 0276ee4c 00002200 40210014 > > > 3ffffd90: 00072380 40103080 3fffc200 00000002 > > > 3ffffda0: 3fff3644 00000003 4020ba9c 4021b9ac > > > 3ffffdb0: 40000000 00000030 0000001a ffffffff > > > 3ffffdc0: 3fff31dc 3fff275c 00000000 3fff3934 > > > 3ffffdd0: 3fff3934 3fff2370 3fff2364 00000159 > > > 3ffffde0: 3ffffe50 0000002e 3fff3234 4021bb6d > > > 3ffffdf0: 0000002e 00072352 00072380 00000000 > > > 3ffffe00: 00000022 4010424d 00040000 40210fd5 > > > 3ffffe10: 0000002e 00072352 00072380 00000000 > > > 3ffffe20: 3ffea52e 00000550 3fff00fc 4020fe09 > > > 3ffffe30: 3fff0570 00000279 00000279 00000080 > > > 3ffffe40: 40104250 00040000 7fffffff 00000000 > > > 3ffffe50: 00000000 4010424d 00040000 40107378 > > > 3ffffe60: 3fff0255 4010322a 3fff3934 4023ecd8 > > > 3ffffe70: 3fff2364 3fff1cc4 00002200 3fff0150 > > > 3ffffe80: 00000001 3fff3284 3fff00fc 4020fe3c > > > 3ffffe90: 3ffffea0 3fff3284 3fff00fc 402115f0 > > > 3ffffea0: 3fff387c 00000030 0000001a ffffffff > > > 3ffffeb0: 3fff1ea4 3fff1eb0 00000006 3fff1cb8 > > > 3ffffec0: 3fff387c 3fff1cc0 3fff1cbc 40211643 > > > 3ffffed0: 3fff387c 3fff1cc0 3fff1cbc 402404ec > > > 3ffffee0: 00000550 3e2ba8c0 00000010 00000550 > > > 3ffffef0: 00000010 00000000 0000010d 00000004 > > > 3fffff00: 40230000 00000000 0000007d 3fff1ea8 > > > 3fffff10: 3ffea506 3fff1eb0 3fff3284 4023e9f1 > > > 3fffff20: 3fff1b9c 3fff2364 3fff2364 4023cea1 > > > 3fffff30: 3fff1b9c 3fff1b98 00c50db3 3fff2364 > > > 3fffff40: 3ffea4f8 00000000 3fff3284 4023dded > > > 3fffff50: 2a358f3e 00000000 4010522e 3ffeddc8 > > > 3fffff60: 401059aa 00c526e9 3ffeee00 60000600 > > > 3fffff70: 40232c18 3ffe9a34 3ffeee00 0276f726 > > > 3fffff80: 40232c3e 3fffdab0 00000000 3fff23b4 > > > 3fffff90: 3fffdc80 00000000 3fff3284 40232863 > > > 3fffffa0: 40000f49 3fffdab0 3fffdab0 40000f49 > > > << Second attempt causes same result. Question: What should be the setting for homie-ota.ini: OTA_FIRMWARE_BASE64 = true or false? I tried both (of course including restart of homie-ota and re-upload of firmware.bin) Same result. What else should I provide to help?
Gulaschcowboy commented 7 years ago

Hi @mrpace2,

tried both ways that you described in your attached zip file. Both ways (homie-ota and your script) end in:

Receiving OTA firmware (350845/350880)... Receiving OTA firmware (350880/350880)...

Exception (3): epc1=0x4000bf64 epc2=0x00000000 epc3=0x00000000 excvaddr=0x4024a39b depc=0x00000000

ctx: sys sp: 3ffffad0 end: 3fffffb0 offset: 01a0

stack>>> 3ffffc70: 3ffef460 3fff05d4 3ffffcac 40219ad6
3ffffc80: 4024a39b 3ffef460 3ffffcb0 4010068c
3ffffc90: 4024a3ea 000002c1 00000007 4020c179
3ffffca0: 00072380 3ffef460 0000019d 3fff34c4
3ffffcb0: 0000000f 00000000 00000000 402199a8
3ffffcc0: 00072380 3fff05d4 3ffef460 4020c6d4
3ffffcd0: 00000000 00000000 00000000 00000000
3ffffce0: 3fff36c4 0000000f 0000000d 00000000
3ffffcf0: 3ffeddc8 4010322a 4021eec3 00000000
3ffffd00: ffffffff 00000000 3ffe9d51 00000000
3ffffd10: 4021ef16 3ffed320 3fff1fa4 00000000
3ffffd20: 40000000 00072352 0000002e 3fff2762
3ffffd30: 00000023 00072380 00000020 00000000
3ffffd40: 3ffeab72 3fff31e4 3ffffda0 00000000
3ffffd50: 3ffeab72 3fff31dc 00000000 4021b9d4
3ffffd60: 00072380 4010322a 00000000 4020babf
3ffffd70: 40000000 3fff31cc 00000000 4021bcd2
3ffffd80: 3fffc278 40103080 3fffc200 40210014
3ffffd90: 00072380 40103080 3fffc200 00000002
3ffffda0: 3fff330c 00000030 4020ba9c 4021b9ac
3ffffdb0: 40000000 3ffeadb6 0000030a 003c3dda
3ffffdc0: 3fff31dc 3fff275c 00000000 00000001
3ffffdd0: ffffffff 3fffc6fc 000000d6 00000159
3ffffde0: 3ffffe50 0000002e 3fff35dc 4021bb6d
3ffffdf0: 0000002e 00072352 00072380 00000000
3ffffe00: 4022495e 3fff1fa4 3ffeb1e5 40210fd5
3ffffe10: 0000002e 00072352 00072380 00000000
3ffffe20: 3ffeab72 00000550 3fff00fc 4020fe09
3ffffe30: 3fff0570 000002f1 000002f1 00000080
3ffffe40: 00000000 00000006 3fff372c 4023ec98
3ffffe50: 00000000 00000000 00000000 40107378
3ffffe60: 3fff0255 00000006 3fff372c 4023ecd8
3ffffe70: 3fff2364 3fff1cc4 3fff53c4 3fff0150
3ffffe80: 00000001 3fff3644 3fff00fc 4020fe3c
3ffffe90: 00000001 3fff3644 3fff00fc 402115f0
3ffffea0: 3fff3234 3fff3238 3fff3234 4023c806
3ffffeb0: 3fff1ea4 3fff1eb0 00000006 3fff1cb8
3ffffec0: 3fff3234 3fff1cc0 3fff1cbc 40211643
3ffffed0: 3fff3234 3fff1cc0 3fff1cbc 402404ec
3ffffee0: 00000550 3e2ba8c0 00000010 00000550
3ffffef0: 00000010 00000000 3ffeb194 40107368
3fffff00: 00000000 00000000 3ffeb1ad 3fff1ea8
3fffff10: 3ffeab4a 3fff1eb0 3fff3644 4023e9f1
3fffff20: 002c4288 3fff2364 3fff2364 3ffee7b0
3fffff30: 00000000 3fff3644 0000001c 3fff2364
3fffff40: 3ffeab3c 00000000 3fff3644 4023dded
3fffff50: 2a358f3e 00000001 4010522e 3ffeddc8
3fffff60: 3ffed320 40101bd5 00000000 00000005
3fffff70: 00000002 40101967 00000002 00000000
3fffff80: 401019a2 00000002 00000000 3fff23ec
3fffff90: 3fffdc80 00000000 3fff3644 40232863
3fffffa0: 40000f49 3fffdab0 3fffdab0 40000f49
<<<stack<<<

ets Jan 8 2013,rst cause:2, boot mode:(3,6)

load 0x4010f000, len 1384, room 16 tail 8 chksum 0x2d csum 0x2d v3de0c112 ~ld

After that ESP restarts normal (at least this is better than before, as it doesn't produce a csum error anmore)

Attached you find a log taken with.. mosquitto_sub -h localhost -t homie/test1/# -u myuser -P mypass -v > test.log

...using your shell script.

test.zip

ghost commented 7 years ago

base64 (=default, if not set):

OTA_FIRMWARE_BASE64 = True 

binary:

OTA_FIRMWARE_BASE64 = False

I'm afraid I can't help much with your exception. You could try locating the exact code position where it crashes by looking up addresses contained in the exception dump from a map file or a listing generated using xtensa-objdump. I have never done that myself for ESP, but I know that other people have, so it is possible.

Or add extra debug logs around BootNormal.cpp:L413-L424 to see how far it gets before it dies.

Just guessing... Are you sure your power supply is clean? Maybe there's another board you could try?

Or, start over from a simple Homie sketch. Something like:

#include <Homie.h>

void setup() {
    Serial.begin(115200);
    Homie.setup();
}

void loop() {
    Homie.loop();
}
Gulaschcowboy commented 7 years ago

Hi @mrpace2

I guess you misinterpreted the logs. After your last commits it doesn't crash anymore after firmware is copied to "hot" flash area, instead it crashes before. So at least the ESP is not rendered unusable anymore. It crashes, resets and starts old firmware. That was different before todays rev. So my non programmer-impression is: you/ @marvinroger changed something today, that causes a slightly different behaviour/issue than before.

About OTA in general: OTA worked for me before the base64-related changes in homie-esp8266 about ~3 weeks ago.

About power supply/ESP type: I have tested with 2 node MCUs and a ESP07 that I have with me right now. (yes, I always travel with ESPs :-) ) Node MCUs are powered by built-in USB, 07 is powered by an external power supply and a linear voltage regulator with a beefy capacitor built in. So it is very unlikely to be a power supply or device issue.

Also 2 Sonoff relays at home have same problem since above mentioned b64 changes. (Couldn't test them with today's rev). Sketch is almost bare minimal, actually it is the example provided by Marvin. (sonoff with button)

All ESPs run stable in normal operation.

About xtensa-objdump: Don't know if I'm capable of doing it or being able to interpret results...

About adding code in BootNormal.cpp: That's also beyond my skills I guess

:-(

Thanks a lot Alex

ghost commented 7 years ago

@Gulaschcowboy I did notice that the behavior changed for you and I did notice in your log that your app rebooted with the old md5. The location where I suggested to add debug output is before the reboot, for a reason ;-)

How can I duplicate your problem here? As I said, it worked fine when I tried your previous firmware a few hours ago. Do you have a repo I could fork? Would you be open to sharing your latest binary again so I can try that one out?

What toolchain are you using?

Thanks for your help.

ckrey commented 7 years ago

I think I got over the problem now:

a) updated Arduino to 1.6.12 b) switched to OTA_FIRMWARE_BASE64 False (binary)

Loaded ESP via Arduino and USB once Can now repeatedly upgrade or downgrade versions

c) switched back to OTA_FIRMWARE_BASE64 True (BASE64)

✴ OTA available (checksum d5f35e99e9df82360a91879400820e98)
↕ OTA started
Triggering OTA_STARTED event...
Firmware is base64-encoded
Receiving OTA firmware (1056/350736)...
<snip>
Receiving OTA firmware (350736/350736)...

Exception (3):
epc1=0x4000bf64 epc2=0x00000000 epc3=0x00000000 excvaddr=0x4024a353 depc=0x00000000

ctx: sys 
sp: 3ffffad0 end: 3fffffb0 offset: 01a0

>>>stack>>>
3ffffc70:  3ffef5e0 3fff070c 3ffffcac 40219aee  
3ffffc80:  4024a353 3ffef5e0 3ffffcb0 4010068c  

see jpmens/homie-ota#44

ghost commented 7 years ago

@ckrey Thanks for your help. So, still something wrong with base64... ☹️ Maybe the issue only happens with Arduino-built firmware? (I'm using PlatformIo). @Gulaschcowboy what is your build environment?

I'm just reading up on ESP8266 exceptions. Exception 3 is a LoadStoreErrorCause which seems to be what data aborts are with ARM or page faults with x86 (roughly). epc1 and excvaddr both contain addresses. Presumably, one is the location of the opcode where it happened and the other one is the address that was accessed. With all the exceptions reported here, epc1 seems to be 0x4000bf64 and excvaddr varies around 0x4024a3XX. excvaddr could be the opcode location.

Can you please generate a map file matching your firmware and look up addresses 0x4000bf64 and 0x4024a353 in the map file? In order to generate a map file, please add link option

 -Wl,-Map,firmware.map

to your LDFLAGS. I don't know how this is done in Arduino. In PlatformIo, you would add

build_flags = -Wl,-Map,firmware.map

to your platformio.ini file. The next linker run will then generate a file firmware.map which contains mappings of symbolic names to addresses. You may also post your map file here.

Gulaschcowboy commented 7 years ago

@mrpace2 Yes, Arduino IDE 1.6.12, The sketch doesn't seem to matter. As said, I even have the problem with the Marvins Sonoff button sketch. Build flags: I will have to try tonight.

marvinroger commented 7 years ago

To decode the stack trace easily:

Please paste the decoded stack trace, not the one with the addresses as we cannot do anything without the binaries. 😉

ghost commented 7 years ago

To give it a try, I installed Arduino 1.6.12, arduino/esp8266 and Homie/develop + dependencies on my machine. Same thing. It just doesn't want to crash. It works fine with base64 or binary and with homie-ota or manual publishing (my little ota-update.sh script).

I attached annotated serial and mqtt logs showing 4 updates:

All updates worked as expected. None of them raised an exception.

Thanks to @ckrey for fixing the read-boolean bug in homie-ota. I am using the latest version that includes your fix.

I also attached the bare-minimum sketch that I was testing with (I changed the Homie_setFirmware version number in between tests), as well as my Arduino build log, in case somebody wants to compare library version numbers. For the libraries that did not come with Arduino, I downloaded the latest ZIP files from github.

arduino-tests.zip

marvinroger commented 7 years ago

Following my latest post about EspException: you can use Arduino 1.6.12 with the following EspException zip : https://github.com/me-no-dev/EspExceptionDecoder/files/594809/EspExceptionDecoder-1.0.4.zip

Gulaschcowboy commented 7 years ago

Hi @marvinroger @mrpace2 ,

attached my decoded stacktrace. Hope it helps. Thanks a lot, Alex stacktrace.txt

marvinroger commented 7 years ago

@Gulaschcowboy thanks, did you paste the stack trace into the decoder from Exception (3): to <<<stack<<<?

Gulaschcowboy commented 7 years ago

no, just from >>>>stack to <<<<stack. attached full paste:

stacktrace.txt

marvinroger commented 7 years ago

@Gulaschcowboy can you try, just like @ckrey, to set OTA_FIRMWARE_BASE64 to False in homie-ota? Just to confirm your issue also happens with base64 only?

Gulaschcowboy commented 7 years ago

@marvinroger Tried that before, tried again now. result: ✴ OTA available (checksum 471f6229335cbb136aed58187dfc5c89) Receiving OTA firmware but not requested, skipping...

Exception (3): epc1=0x4000bf64 epc2=0x00000000 epc3=0x00000000 excvaddr=0x40249e7f depc=0x00000000

BTW, the stack traces before were "created" using @mrpace2 's script

ghost commented 7 years ago

@Gulaschcowboy thanks for providing the stack trace.

@marvinroger I am looking into it. It crashes while exiting _publishOtaStatus(400, PSTR("NOT_REQUESTED")); called from BootNormal.cpp:299

Gulaschcowboy commented 7 years ago

@marvinroger @mrpace2 I don't know, if it help or confuses more:

Triggering MQTT_PACKET_ACKNOWLEDGED event (packetId 25)... ✴ OTA available (checksum 5e619432e444f54f608378e074298435) Receiving OTA firmware but not requested, skipping...

Exception (3): epc1=0x4000bf64 epc2=0x00000000 epc3=0x00000000 excvaddr=0x4024b247 depc=0x00000000

So my non-developer guess/impression: We have 2 issues:

Edit: Yes confirmed, all with PIO created firmware.bin-> I increased version number, recompiled, uploaded, did OTA -> success OTA again while no update needed -> crash

ghost commented 7 years ago

@Gulaschcowboy When you generated the stack trace, $implementation/ota/firmware was rejected because the new firmware's checksum matched the running firmware. You probably received a 304 for the $implementation/ota/checksum message before. Homie was about to send you a 400 NOT_REQUESTED when it crashed inside String:concat(char) which was called from BootNormal::_publishOtaStatus.

Sorry, I'm afraid I don't fully understand what you were trying to say in your last message. But, anyway, one way or another, the function that crashes your system will end up being called, either to reject OTA firmware or to report success or failure at the end of an OTA update. And it doesn't make a difference here if the firmware blob is binary or base64-encoded or whether you used homie-ota or something like my script.

I committed an attempt to workaround the issue to my own homie-esp8266 fork on Github. Would you mind trying out this workaround on your system? For this test, please replace your current homie-esp8266 library with https://github.com/mrpace2/homie-esp8266/archive/patch-trying-to-fix-issue-204.zip in your Arduino, or use https://github.com/mrpace2/homie-esp8266.git@fcfdc66 with your PlatformIo library configuration.

Please report the result. If it still fails, please send another stack trace. A matching log (serial or mqtt) would be greatly appreciated.

Thanks again for your help.

Gulaschcowboy commented 7 years ago

@mrpace2 What I tried to say is: I see different behaviors, when I use different steps in both IDEs. Somehow we have two different issues:

Tests I did:

ENV is always like this:

Test1 with binary compiled with arduino:

Test 2 with binary compiled with PIO:

Edit: yes, base64 or not doesn't make a difference. Also your script doesn't make a difference, as it always uses b64. Where I disagree is your sentence:

But, anyway, one way or another, the function that crashes your system will end up being called, either to reject OTA firmware or to report success or failure at the end of an OTA update.

Because the function to report success after update seems to work in PIO but not in Arduino. Otherwise it wouldn't be explainable, that the OTA succeeds in PIO-based firmwares.

Did you get my point now? Will do your suggested tests now.

Thank you very much!

Gulaschcowboy commented 7 years ago

@mrpace2 I tried with your library now, results:

Compiled with Arduino, b64 disabled, your patched homie lib:

  1. OTA with version1.0.0 over serially flashed 1.0.0 -> crash
  2. OTA with version 1.0.1 over serially flashed 1.0.0 -> it works (why, don't ask me, see my post before)
  3. OTA with version 1.0.1 over OTA-flashed 1.0.1 -> crash

See logs, traces an decoded traces below. stack-arduino-v100tov100.txt stack-arduino-v100tov101.txt stack-arduino-v101tov101.txt

Edit: Now with base64 set to true:

  1. OTA with version 1.0.1 over OTA-flashed 1.0.0 -> crashes now too stack-arduino-v100tov101-BASE64.txt

Setting back to false again:

  1. OTA with version 1.0.1 over OTA-flashed 1.0.0 -> works
ghost commented 7 years ago

Thanks.

Regarding (1) and (3), is not the 304 response to $implementation/ota/checksum that crashes. It's the 400 response to the next message ($implementation/ota/firmware) that crashes. FYI, homie-ota "blindly" sends firmware no matter what homie-esp8266 responds to the checksum.

What I am trying to say is that all these crashes are most likely caused by the same bug that triggers somewhere inside String::concat called from _publishOtaStatus because _publishOtaStatus is called no matter what the result of the OTA update is. I find it strange that _publishOtaStatus sometimes works and sometimes doesn't. My first attempt to fix the issue obviously didn't help. Will have to come up with another solution.

marvinroger commented 7 years ago

I am not able to reproduce this behavior, so it will be hard to debug. @Gulaschcowboy can you please comment these lines and try again with a scenario that normally fails?

ghost commented 7 years ago

@marvinroger Do you have experience reading those stack traces? The stack traces seems to interleave two execution contexts? For example, in stack-arduino-v101tov101.txt roughly every other stack frame seems to come from some unrelated activity:

[?]     0x40249e87: sleep_reset_analog_rtcreg_8266 at ?? line ?
[Homie] 0x40219cef: String::concat(__FlashStringHelper const*) at /home/atze/.arduino15/packages/esp8266/hardware/esp8266/2.3.0/cores/esp8266/WString.cpp line 519
[?]     0x40249e87: sleep_reset_analog_rtcreg_8266 at ?? line ?
[?]     0x40249e87: sleep_reset_analog_rtcreg_8266 at ?? line ?
[Homie] 0x4020bbb8: HomieInternals::BootNormal::_publishOtaStatus(int, char const*) at /home/atze/Dokumente/arduino/libraries/homie-esp8266/src/Homie/Boot/BootNormal.cpp line 52
[?]     0x40218f3c: Print::println() at /home/atze/.arduino15/packages/esp8266/hardware/esp8266/2.3.0/cores/esp8266/Print.cpp line 64
[Homie] 0x4020c3a0: HomieInternals::BootNormal::_onMqttMessage(char*, char*, AsyncMqttClientMessageProperties, unsigned int, unsigned int, unsigned int) at /home/atze/Dokumente/arduino/libraries/homie-esp8266/src/Homie/Boot/BootNormal.cpp line 299
[?]     0x4021c21c: Print::write(unsigned char const*, unsigned int) at /home/atze/.arduino15/packages/esp8266/hardware/esp8266/2.3.0/cores/esp8266/Print.cpp line 64
[?]     0x40225f48: ieee80211_output_pbuf at ?? line ?

Since the other context is also calling String, maybe something in String is non-reentrant? I'm thinking about replacing String inside _publishOtaStatus with a pre-allocated character buffer, something like

_otaStatusMessage = std::unique_ptr<char[]>(new char[longestOtaStatusMessage]);

What do you think?

Gulaschcowboy commented 7 years ago

(1)(3) crashes directly (2) works when b64 is set to false (4) same as (2) but b64 set to true and it crashes

So it may be the same function as you say, but it behaves differently in all mentioned steps. At least it is consistently bad with base64 :-(

Gulaschcowboy commented 7 years ago

@marvinroger you mean doing / if (info) { payload += ' '; payload += info; } / ?

marvinroger commented 7 years ago

@mrpace2 not really, I was not able to find some viable docs about the stack traces. I was also thinking about replacing the String with a char array. You don't even need a unique_ptr, just instantiate a char array in the stack in _publishOtaStatus

marvinroger commented 7 years ago

@Gulaschcowboy yes

ghost commented 7 years ago

Or, even more radicale:

uint16_t BootNormal::_publishOtaStatus(int status, const char* info) {
  /*
  String payload(status);
  if (info) {
    payload += F(" ");
    payload += info;
  }
  return Interface::get().getMqttClient().publish(_prefixMqttTopic(PSTR("/$implementation/ota/status")), 1, true, payload.c_str());
  */
  return 0;
}

uint16_t BootNormal::_publishOtaStatus_P(int status, PGM_P info) {
  // return _publishOtaStatus(status, String(info).c_str());
  return 0;
}
ghost commented 7 years ago

@marvinroger I'll work on that.

Gulaschcowboy commented 7 years ago

@marvinroger @mrpace2 Commenting out these lines works! base64 set to false:

  1. hash to same hash -> just gets ignored
  2. hash to different hash -> updates, reboots

base64 set to true:

  1. hash to same hash -> just gets ignored
  2. hash to different hash -> updates - crash
ghost commented 7 years ago

@Gulaschcowboy Thx

Gulaschcowboy commented 7 years ago

@marvinroger @mrpace2 Haha, who is supporting whom? So all thanks to both of you