nekromant / esp8266-frankenstein

Alternative firmware for ESP8266 modules
318 stars 71 forks source link

TFTP update failed #60

Open darkbyte-ru opened 9 years ago

darkbyte-ru commented 9 years ago

After updating antares and SDK make a new firmware and write it to esp via tftp, all was ok, esp bootup.

Memory Layout:
data  : 0x3ffe8000 ~ 0x3ffe8f90, len: 3984
rodata: 0x3ffe8f90 ~ 0x3ffeb4c0, len: 9520
bss   : 0x3ffeb4c0 ~ 0x3fff4118, len: 35928
heap  : 0x3fff4118 ~ 0x3fffc000, len: 32488

 Flash layout:
Firmware size is     221 KiB (0x375b0)
Filesystem starts at 224 KiB (0x38000)

Then run tftp upgrade from new firmware and it's failed:

TFTP: Downloading tftp://192.168.1.1/antares.rom, Starting flash block 56

 ########################################################
 TFTP done, 226720 bytes transferred
Committing update in 2 seconds

Commiting update, 56 sectors 226720 bytes
##########################Fatal exception (0):
epc1=0x40219a3c, epc2=0x00000000, epc3=0x00000000, excvaddr=0x00000000, depc=0x00000000
Fatal exception (0):
epc1=0x40219318, epc2=0x00000000, epc3=0x00000000, excvaddr=0x00000000, depc=0x00000000
Fatal exception (0):
epc1=0x40219318, epc2=0x00000000, epc3=0x00000000, excvaddr=0x00000000, depc=0x00000000

Last message repeated N times before esp stop responding.

The same firmware successfully flushed via XTCOM.

susisstrolch commented 9 years ago

I've never got tftp working inside a telnet session.

Last famous words: "Committing update in 2 seconds" Did reset 1-5min after the commit message, but only got trash on serial console, telnet/ping wasn't avail anymore.

Rewriting the flash via esptool works - so it doesn't look like esp config was borked by tftp...

On 03/24/2015 06:46 PM, DarkByte wrote:

After updating antares and SDK make a new firmware and write it to esp via tftp, all was ok, esp bootup.

|Memory Layout: data : 0x3ffe8000 ~ 0x3ffe8f90, len: 3984 rodata: 0x3ffe8f90 ~ 0x3ffeb4c0, len: 9520 bss : 0x3ffeb4c0 ~ 0x3fff4118, len: 35928 heap : 0x3fff4118 ~ 0x3fffc000, len: 32488

Flash layout: Firmware size is 221 KiB (0x375b0) Filesystem starts at 224 KiB (0x38000) |

Then run tftp upgrade from new firmware and it's failed:

|TFTP: Downloading tftp://192.168.1.1/antares.rom, Starting flash block 56

######################################################## TFTP done, 226720 bytes transferred Committing update in 2 seconds

Commiting update, 56 sectors 226720 bytes ##########################Fatal exception (0): epc1=0x40219a3c, epc2=0x00000000, epc3=0x00000000, excvaddr=0x00000000, depc=0x00000000 Fatal exception (0): epc1=0x40219318, epc2=0x00000000, epc3=0x00000000, excvaddr=0x00000000, depc=0x00000000 Fatal exception (0): epc1=0x40219318, epc2=0x00000000, epc3=0x00000000, excvaddr=0x00000000, depc=0x00000000 |

Last message repeated N times before esp stop responding.

The same firmware successfully flushed via XTCOM.

— Reply to this email directly or view it on GitHub https://github.com/nekromant/esp8266-frankenstein/issues/60.

nekromant commented 9 years ago

Hm... I think I know why it doesn't work over telnet. It's easy to fix, but you won't get any progress report whatsoever when synchronising data. @darkbyte-ru Does it work via serial? (e.g. NOT telnet)

darkbyte-ru commented 9 years ago

@nekromant, I use serial to communicate with esp.

nekromant commented 9 years ago

@darkbyte-ru Got it, hold on will fetch the board and see if it works for me.

susisstrolch commented 9 years ago

I know that progress bar isn't possible because of blocked wdog and INTs. So a short "Starting commit, rebooting in XXX sec" would be sufficient via telnet.

On 03/24/2015 07:10 PM, Andrew wrote:

Hm... I think I know why it doesn't work over telnet. It's easy to fix, but you won't get any progress report whatsoever. @darkbyte-ru https://github.com/darkbyte-ru Does it work via serial (e.g. NOT telnet)

— Reply to this email directly or view it on GitHub https://github.com/nekromant/esp8266-frankenstein/issues/60#issuecomment-85627312.

nekromant commented 9 years ago

Bad news. tftp updates crash as well as spi_wipe command. Looks like some spi flash code is now calling something that's NOT in RAM. Looks like we're in deep trouble with 1.0 SDK. Unless someone pokes Espressif to fix their blobs we're mostly out of luck with 1.0 SDK this way. Ideas welcome.

susisstrolch commented 9 years ago

That's what I see when trying the tftp via serial:

Commiting update, 54 sectors 220484 bytes #########################Fatal exception (0): epc1=0x40218250, epc2=0x00000000, epc3=0x00000000, excvaddr=0x00000000, depc=0x00000000 Fatal exception (0): epc1=0x40217bc4, epc2=0x00000000, epc3=0x00000000, excvaddr=0x00000000, depc=0x00000000 Fatal exception (0): epc1=0x40217bc4.......

The latter one repeats until the module hangs... 0x40218250 is "system_get_time" 0x40217bc4 is "system_get_boot_version"

Module must be reflashed afterwards...

By reordering the "ets_wdt_disable()" / "ets_wdt_enable()" and commenting out the hash printout I see exactly the same when updating during a telnet session. Maybe only an additional "disable ints" is required?

nekromant commented 9 years ago

@susisstrolch I already tried running os_intr_lock() before doing the actual update, doesn't help much, things are the same. What's ever weirder, epc1 addresses of exception appear to be random on my board.

susisstrolch commented 9 years ago

Also, encapsulating in system_upgrade_flag_set(UPGRADE_FLAG_START) system_upgrade_flag_set(UPGRADE_FLAG_FINISH) system_upgrade_reboot()

doesn't work. Exactly same - only differenc is, that wdt reset shows up after all...

I'll ask Espressif...

On 03/27/2015 02:45 PM, Andrew wrote:

@susisstrolch https://github.com/susisstrolch I already tried running os_intr_lock() before doing the actual update, doesn't help much, things are the same. What's ever weirder, epc1 addresses appear to be random on my board.

— Reply to this email directly or view it on GitHub https://github.com/nekromant/esp8266-frankenstein/issues/60#issuecomment-86944129.

nekromant commented 9 years ago

@susisstrolch Thanks! Hope they can give us some insight of what's happening inside the blobs. For I don't feel like spending a romantic evening with a disassembler right now.

susisstrolch commented 9 years ago

The SDK docu about "system_upgrade_flag_set" is completely misleading, because of the "If you using 'spi_flash_write' to upgrade firmware yourself, this flag need to be set to 'UPGRADE_FLAG_FINISH', then call 'system_upgrade_reboot' to reboot to run new firmware."

On 03/27/2015 03:06 PM, Andrew wrote:

@susisstrolch https://github.com/susisstrolch Thanks! Hope they can give us some insight of what's happening inside the blobs. For I don't feel like spending a romantic evening with a disassembler right now.

— Reply to this email directly or view it on GitHub https://github.com/nekromant/esp8266-frankenstein/issues/60#issuecomment-86950079.

susisstrolch commented 9 years ago

And what's about the idea to go the same route than Espressif - using the bootloader-blob 1.3 and providing two different roms (f.e. antares.rom1, antares.rom2), linked for the two (by Espressif) supported flash areas? That would at least prevent hustle with os_intr_lock() and other undocumented functions.

Maybe the os_intr_lock() should go into the loop body, just in case one of those funcs will reenable ints..

nekromant commented 9 years ago

@susisstrolch I thought about that. However it all looks pretty nasty. If those things didn't change since first versions of SDK (and we still have a flash split, where one application updates the other mirror and reboots to it) we'll be limiting ourselves to 256KiB flash (for rom1 and rom2) not saying about complicating the build process. If we manage to bring tftp update back to life, with heatshrink compression we can get ~300-400KiB of flash without sacrificing OTA update functionality.

susisstrolch commented 9 years ago

Had a quick look with "objdump -dtr libmain.a"... Guess what - spi_flash_erase_sector calls system_get_time on entrance and on leave... And system_get_time accesses the irom_text section...

nekromant commented 9 years ago

@susisstrolch Looks like someone has been doing some profiling and forgot to remove portions of that profiling code. Do you think you can ask Espressif to kindly fix that in further SDK releases?

susisstrolch commented 9 years ago

Just verifying against 0.96... For sure!

susisstrolch commented 9 years ago

Hmm... strange - sure it worked with 0.96? Because system_get_time, which is located in irom, is called by nearly any spi_flash function.

darkbyte-ru commented 9 years ago

Previous version of SDK was 0.9.6? I thought 0.9.5.

susisstrolch commented 9 years ago

So it was introduced with the 0.9.6 beta... Definitly the 0.9.5 spiflash* does NOT call system_get_time!

On 03/28/2015 12:52 PM, DarkByte wrote:

Previous version of SDK was 0.9.6? I thought 0.9.5.

— Reply to this email directly or view it on GitHub https://github.com/nekromant/esp8266-frankenstein/issues/60#issuecomment-87215967.

susisstrolch commented 9 years ago

Hmm... an unsane workaround would be to remove system_get_time out of libmain and provide an own one, until SDK is fixed...

darkbyte-ru commented 9 years ago

Maybe simple nop system_get_time calls in libmain?

susisstrolch commented 9 years ago

Jep - 'cause it only contains code w/o relocation:

40218250 : 40218250: fffe31 l32r a3, 40218248 <system_timer_reinit+0x18> ; =0x3ffee874 40218253: fffe21 l32r a2, 4021824c <system_timer_reinit+0x1c> ; =0x3ff20a00 40218256: 0338 l32i.n a3, a3, 0 40218258: 0020c0 memw 4021825b: 802222 l32i a2, a2, 0x200 4021825e: 223a add.n a2, a2, a3 40218260: f00d ret.n

On 03/28/2015 01:19 PM, DarkByte wrote:

Maybe simple nop system_get_time calls in libmain?

— Reply to this email directly or view it on GitHub https://github.com/nekromant/esp8266-frankenstein/issues/60#issuecomment-87217839.

susisstrolch commented 9 years ago

these are the bytepos in libmain to patch with a nop (0x0020f0 -> f0 20 00):

0xa29c, 0xa2be, 0xa308, 0xa32e, 0xa369, 0xa38f

ensure, that those locations contain "c0 00 00" before patching... Affected are: spi_flash_read, spi_flash_write, spi_flash_erase_sector

darkbyte-ru commented 9 years ago

@susisstrolch, seems not working

Commiting update, 57 sectors 232928 bytes
############################Fatal exception (0):
epc1=0x4021b194, epc2=0x00000000, epc3=0x00000000, excvaddr=0x00000000, depc=0x00000000
Fatal exception (28):
epc1=0x4021aa7a, epc2=0x00000000, epc3=0x00000000, excvaddr=0x00000004, depc=0x00000000
susisstrolch commented 9 years ago

Are you using the proper libmain.a? Have a "xtensa-lx106-elf-objdump -drt images/antares.elf >tmp/antares.S". Search for "spi_flash_erase_sector>:" It should look like:

401007f8 : 401007f8: f0c112 addi a1, a1, -16 401007fb: 0129 s32i.n a2, a1, 0 401007fd: 2109 s32i.n a0, a1, 8 401007ff: fffc01 l32r a0, 401007f0 <spi_flash_write_status+0x34> 40100802: 0020f0 nop 40100805: ffc401 l32r a0, 40100718 <vPortInitialiseBlocks+0x54> 40100808: 0000c0 callx0 a0

If not, you've patched the wrong libmain.a... libmain.a is two times in esp-open-sdk:

./esp-open-sdk/xtensa-lx106-elf/xtensa-lx106-elf/sysroot/usr/lib/libmain.a ./esp-open-sdk/esp_iot_sdk_v1.0.0/lib/libmain.a

./esp-open-sdk/xtensa-lx106-elf/xtensa-lx106-elf/sysroot/usr/lib/libmain.a is the right one to patch...

darkbyte-ru commented 9 years ago
401007f8 <spi_flash_erase_sector>:
401007f8:   f0c112          addi    a1, a1, -16
401007fb:   0129        s32i.n  a2, a1, 0
401007fd:   2109        s32i.n  a0, a1, 8
401007ff:   fffc01          l32r    a0, 401007f0 <spi_flash_write_status+0x34>
40100802:   0000c0          callx0  a0

Seems the same, but "callx0 a0" it is call system_get_time. Must be "0020f0 nop". I'm really patch wrong file. At first I patch "/esp-open-sdk/esp_iot_sdk_v1.0.0/lib/libmain.a", but for linking was used "/esp-open-sdk/xtensa-lx106-elf/xtensa-lx106-elf/sysroot/usr/lib/libmain.a".

darkbyte-ru commented 9 years ago

Now update commits okay, but module does not starts after that.

Commiting update, 57 sectors 232928 bytes
#########################################################
Firmware update completed (57 sectors), rebooting

 ets Jan  8 2013,rst cause:4, boot mode:(3,6)

wdt reset
ets_main.c
susisstrolch commented 9 years ago

So it's bricked? Does it work after a pwr cycle?

darkbyte-ru commented 9 years ago

@susisstrolch, yep, bricked, after reset it send only

ets Jan  8 2013,rst cause:2, boot mode:(3,6)

ets_main.c

at baud 74880.

susisstrolch commented 9 years ago

Hmm, here it doesn't work also.. I dumped the flash after the tftp

esptool.py -p /dev/ttyUSB0 -b $(( 4 * 115200)) dump_mem 0x40100000 $((512 * 1024)) flash512-40100000.rom

and got a pretty strange result:

xxd images/antares.rom |head 0000000: e903 0000 0400 1040 0000 1040 a460 0000 .......@...@.`.. 0000010: 040e 0040 12c1 f009 0105 3000 01fd ffc0 ...@......0..... 0000020: 0000 0801 12c1 100d f000 0000 00fe ef3f ...............? 0000030: c8ae fe3f ecae fe3f cc24 0040 12c1 f009 ...?...?.$.@....

xxd flash512-40100000.rom |head 0000000: 040e 0040 12c1 f009 0105 3000 01fd ffc0 ...@......0..... 0000010: 0000 0801 12c1 100d f000 0000 00fe ef3f ...............? 0000020: c8ae fe3f ecae fe3f cc24 0040 12c1 f009 ...?...?.$.@....

Any idea how to read the whole flash with esptool? Trying to read from 0x00 doesn't work...

nekromant commented 9 years ago

@susisstrolch Something tells me they might be using system_get_time() there for a reason. Perhaps swapped busy-wait for a system_get_time(). This way if you replace system_get_time with a nop it might misbehave, e.g. timeout immediately and therefore garble the flash contents in weird way.

susisstrolch commented 9 years ago

Crosschecked again and compared spi_flash-0.9.5 vs spi_flash-1.0. Onyl difference is the additional call to system_get_time. And it's used as void function...

Other variant - instead of patching: ar -d libmain.a spi_flash.o ar -x libmain-v0.95.a spi_flash.o ar -r libmain.a spi_flash.o

Will try to call system_restart() instead of ets_wdt_enable();...

On 03/29/2015 01:35 PM, Andrew wrote:

@susisstrolch https://github.com/susisstrolch Something tells me they might be using system_get_time() there for a reason. Perhaps swapped busy-wait for a system_get_time(). This way if you replace system_get_time with a nop it might misbehave, e.g. timeout immediately and therefore garble the flash contents in weird way.

— Reply to this email directly or view it on GitHub https://github.com/nekromant/esp8266-frankenstein/issues/60#issuecomment-87399792.

darkbyte-ru commented 9 years ago

Answer from espressif team:

Erasing and rewriting in the running program area are not our recommended approach and there are potencial risks.

We recommend you to use boot.bin+user1.bin/user2.bin and you can refer to our "SDK User Manual_V1.0.0" for details.
susisstrolch commented 9 years ago

Got exactly the same one with addition "Next,we will optimize what may lead to crash problems arising in the process of use."

On 03/31/2015 08:46 AM, DarkByte wrote:

Answer from espressif team:

|Erasing and rewriting in the running program area are not our recommended approach and there are potencial risks.

We recommend you to use boot.bin+user1.bin/user2.bin and you can refer to our "SDK User Manual_V1.0.0" for details. |

— Reply to this email directly or view it on GitHub https://github.com/nekromant/esp8266-frankenstein/issues/60#issuecomment-87966071.

susisstrolch commented 9 years ago

The problem seems to be the system restart. Have a look at https://github.com/susisstrolch/esp8266-frankenstein/commit/1bd262354312189d88d117b170444e2e7d74f2e2 for my approach... It results in

frankenstein > tftp TFTP: Downloading tftp://192.168.254.20/volume1/public/tftp/antares.rom, Starting flash block 54

###################################################### TFTP done, 220484 bytes transferred Committing update in 2 seconds

Commiting update, 54 sectors 220484 bytes 111111111111111111111111111111111111111111111111111111 Firmwa▒ ets Jan 8 2013,rst cause:4, boot mode:(3,6)

wdt reset ets_main.c

So, for me it looks like the bootloader assumes a given structure, which is set by serial upload via esptool, but not by simple copy.

Unfortunately, the update related functions from SDK are based in irom, so, at least without elf patching, we can't use the suggested methods "system_upgrade_flag_set()", "system_upgrade_reboot()"

On 03/31/2015 08:46 AM, DarkByte wrote:

Answer from espressif team:

|Erasing and rewriting in the running program area are not our recommended approach and there are potencial risks.

We recommend you to use boot.bin+user1.bin/user2.bin and you can refer to our "SDK User Manual_V1.0.0" for details. |

— Reply to this email directly or view it on GitHub https://github.com/nekromant/esp8266-frankenstein/issues/60#issuecomment-87966071.

nekromant commented 9 years ago

@susisstrolch Bad news. I didn't expect Espressif to really fix the problem - they don't look like the people who care about anything but their own weird approach. I'll give their bootloader thingie a spin, may be with heatshrink it won't be a huge overhead. Basically we need a basic app to jump to that will only do the syncronisation. Regarging the .rom - it looks really weird. I remember I compared flash contents with .rom after flashing and they were byte to byte identical.

darkbyte-ru commented 9 years ago

@nekromant, some good news:

As determined by our internal engineers,this problem can not be treated as a bug for the moment.

However, we will modify related functions to support your systems and new version of SDK will be released on this Friday.

"This not a bug, but we will fix it" :D

susisstrolch commented 9 years ago

Don't go directly to "Start"... :)

On 03/31/2015 10:17 AM, DarkByte wrote:

@nekromant https://github.com/nekromant, some good news:

|As determined by our internal engineers,this problem can not be treated as a bug for the moment.

However, we will modify related functions to support your systems and new version of SDK will be released on this Friday. |

"This not a bug, but we will fix it" :D

— Reply to this email directly or view it on GitHub https://github.com/nekromant/esp8266-frankenstein/issues/60#issuecomment-87991602.

darkbyte-ru commented 9 years ago

I think this is time to make own bootloader with blackjack and tftp support. Or, maybe arduino like for a first time.

nekromant commented 9 years ago

@darkbyte-ru o_O Either they are reading this very thread or I'm just paranoid. Nevertheless - good it will be fixed. Maybe I was even wrong to bash them like that. Regarding the bootloader - I think the easiest and most portable - will be adding a linux-kernel-like piggyback decompressor to updates. e.g.

Any comments?

susisstrolch commented 9 years ago

So we're at the same point than now: the update is written to the flash, and we have to restart in a clean way. Problem is, we don't know what else has to be done after flash is written. A simple "software_reset()" isn't sufficient - it leads to

ets Jan 8 2013,rst cause:3, boot mode:(3,6)

ets_main.c

All others (system_reboot(), systemupgrade*) can't be run because we don't have the relocated addresses...

In fact, we have to find out what the rom-based loader does after flashing...

On 03/31/2015 11:53 AM, Andrew wrote:

@darkbyte-ru https://github.com/darkbyte-ru o_O Either they are reading this very thread or I'm just paranoid. Nevertheless - good it will be fixed. Maybe I was even wrong to bash them like that. Regarding the bootloader - I think the easiest will be adding a linux-kernel-like piggyback decompressor to updates. e.g.

  • We make a super-simple loader that /only/ uses stack.
  • Compile it with -fPIC,
  • Append the compressed update image to it
  • When we've loaded the update to the internal flash - we make a jump there. The loader then takes care to decompress the update and write it to the application section.

Comments?

— Reply to this email directly or view it on GitHub https://github.com/nekromant/esp8266-frankenstein/issues/60#issuecomment-88019033.

nekromant commented 9 years ago

@susisstrolch Let me try it again next weekend. I don't think the ROM loader does any super-special magic there.

susisstrolch commented 9 years ago

Ok - found the reason... recv_cb reads the image to "u->fblock * SPI_FLASH_SEC_SIZE" (which is 54 in my environment) but commit_handler reads from (64i) \ SPI_FLASH_SEC_SIZE... I'll extend the update_server struct with the start block of recv flash, so it can be handled dynamicly in commithandler...

nekromant commented 9 years ago

My bad, looks like I've left some hardcode after I've implemented the end-of-firware marker.

susisstrolch commented 9 years ago

New sdk is out: esp_iot_sdk_v1.0.1_b1 A quick look (and also the readme) shows that the unnecessary system_get_time calls are removed from spi_flash functions. So I only have to push it into the open sdk to do further tests...

dirkx commented 9 years ago

Am trying to get 1.0.1 wired in to see if that resolves the issue (getting a hang similar to above on UART). Pulling in this version seems to require two changes; moving the rom0 segment back to 0B00 and changing the include path.

That makes things compile without ado -- but on flash & reboot (though UART) a hang is observed just after it prints out the Memory map on reset/reboot.

Any suggestions as to what I may be missing ? Or where to look.

Dw.

diff -ur ./ld/eagle.app.v6.ld sdk/esp_iot_sdk_v1.0.1/ld/eagle.app.v6.ld
@@ -5,8 +5,7 @@
   dport0_0_seg :                        org = 0x3FF00000, len = 0x10
   dram0_0_seg :                         org = 0x3FFE8000, len = 0x14000
   iram1_0_seg :                         org = 0x40100000, len = 0x8000
-  irom0_0_seg :                         org = 0x40240000, len = 0x3C000
+  irom0_0_seg :                         org = 0x4020B000, len = 0x3C000 
 }
@@ -179,5 +178,4 @@
 }

 /* get ROM code address */
-INCLUDE "../ld/eagle.rom.addr.v6.ld"  
+INCLUDE "antares/src/arch/esp8266/ld/eagle.rom.addr.v6.ld"