oarcher / piotech

my custom components for esphome
18 stars 10 forks source link

ota_http crash if out of memory. Implement safe mode ? #15

Closed tomatensaus closed 10 months ago

tomatensaus commented 11 months ago

Having upgraded to the latest esphome (2023.10.6) I noticed that I am unable to get to the point after the GET

Running the old code that used to work for me is also failing at the same point.

Perhaps related to some esphome changes?

replaced my URL below

[15:27:42][D][ota_http:046]: Trying to connect to https://example.com/g.bin
[15:27:42][ 34064][V][HTTPClient.cpp:252] beginInternal(): url: https://example.com/g.bin
[15:27:42][ 34071][D][HTTPClient.cpp:263] beginInternal(): unexpected protocol: https, expected http
[15:27:42][ 34079][V][HTTPClient.cpp:252] beginInternal(): url: https://example.com/g.bin
[15:27:42][ 34088][D][HTTPClient.cpp:303] beginInternal(): protocol: https, host: example.com port: 443 url: /g.bin
[15:27:42][V][ota_http:061]: http begin successfull.
[15:27:42][VV][ota_http:065]: http client setReuse.
[15:27:42][V][ota_http:069]: http headers collected.
[15:27:42][ 34122][D][HTTPClient.cpp:598] sendRequest(): request type: 'GET' redirCount: 0
[15:27:42]
[15:27:42][ 34122][V][ssl_client.cpp:62] start_ssl_client(): Free internal heap before TLS 80468
[15:27:42][ 34129][V][ssl_client.cpp:68] start_ssl_client(): Starting socket
[15:27:42][ 34148][V][ssl_client.cpp:149] start_ssl_client(): Seeding the random number generator
[15:27:42][ 34150][V][ssl_client.cpp:158] start_ssl_client(): Setting up the SSL/TLS structure...
[15:27:42][ 34153][D][ssl_client.cpp:179] start_ssl_client(): WARNING: Skipping SSL Verification. INSECURE!
[15:27:42][ 34161][V][ssl_client.cpp:257] start_ssl_client(): Setting hostname for TLS session...
[15:27:42][ 34170][V][ssl_client.cpp:272] start_ssl_client(): Performing the SSL/TLS handshake...
[15:27:47]E (42691) task_wdt: Task watchdog got triggered. The following tasks did not reset the watchdog in time:
[15:27:47]E (42691) task_wdt:  - loopTask (CPU 1)
[15:27:47]E (42691) task_wdt: Tasks currently running:
[15:27:47]E (42691) task_wdt: CPU 0: IDLE
[15:27:47]E (42691) task_wdt: CPU 1: IDLE
[15:27:47]E (42691) task_wdt: Aborting.
[15:27:47]
[15:27:47]abort() was called at PC 0x401184d4 on core 0
[15:27:47]
[15:27:47]
[15:27:47]Backtrace:0x40083a45:0x3ffbeb3c |<-CORRUPTED
WARNING Found stack trace! Trying to decode it
WARNING Decoded 0x40083a45: panic_abort at /Users/ficeto/Desktop/ESP32/ESP32S2/esp-idf-public/components/esp_system/panic.c:402
tomatensaus commented 11 months ago

I commented out the web component and managed to get further. Seems like memory is the issue.

Do you have any strategy to disable some of the components when running the OTA? In reality your IOT device can shut down all services, update, and resume the work after the reboot.

tomatensaus commented 11 months ago
[16:29:24][ 72857][V][ssl_client.cpp:301] start_ssl_client(): Certificate verified.
[16:29:24][ 72860][V][ssl_client.cpp:316] start_ssl_client(): Free internal heap after TLS 59040
[16:29:24][ 72868][D][HTTPClient.cpp:1156] connect():  connected to example.com:443
[16:29:24][ 72876][V][ssl_client.cpp:369] send_ssl_data(): Writing HTTP request with 174 bytes...
[16:29:24][ 72944][V][HTTPClient.cpp:1250] handleHeaderResponse(): RX: 'HTTP/1.1 200 OK'
[16:29:24][ 72944][V][HTTPClient.cpp:1250] handleHeaderResponse(): RX: 'Date: Sun, 12 Nov 2023 14:29:19 GMT'
[16:29:24][ 72949][V][HTTPClient.cpp:1250] handleHeaderResponse(): RX: 'Server: Apache'
[16:29:24][ 72956][V][HTTPClient.cpp:1250] handleHeaderResponse(): RX: 'Last-Modified: Fri, 10 Nov 2023 14:01:04 GMT'
[16:29:24][ 72965][V][HTTPClient.cpp:1250] handleHeaderResponse(): RX: 'Accept-Ranges: bytes'
[16:29:24][ 72972][V][HTTPClient.cpp:1250] handleHeaderResponse(): RX: 'Content-Length: 1430848'
[16:29:24][ 72980][V][HTTPClient.cpp:1250] handleHeaderResponse(): RX: 'Keep-Alive: timeout=5, max=100'
[16:29:24][ 72988][V][HTTPClient.cpp:1250] handleHeaderResponse(): RX: 'Connection: Keep-Alive'
[16:29:24][ 72996][V][HTTPClient.cpp:1250] handleHeaderResponse(): RX: 'Content-Type: application/octet-stream'
[16:29:24][ 73005][V][HTTPClient.cpp:1250] handleHeaderResponse(): RX: ''
[16:29:24][ 73010][D][HTTPClient.cpp:1307] handleHeaderResponse(): code: 200
[16:29:24][ 73016][D][HTTPClient.cpp:1310] handleHeaderResponse(): size: 1430848
[16:29:24][ 73022][D][HTTPClient.cpp:628] sendRequest(): sendRequest code=200
[16:29:24]
[16:29:24][V][ota_http:141]: http GET finished.
[16:29:24][D][ota_http:150]: firmware is 1430848 bytes length.
[16:29:24][D][ota_http:044]: Using ArduinoESP32OTABackend
[16:29:24][ 73051][D][Updater.cpp:133] begin(): OTA Partition: app0
[16:29:24][VV][ota_http:163]: Got esp32 stream
[16:29:24][D][ota_http:207]: Progress: 0.1%
[16:29:25][D][ota_http:207]: Progress: 1.2%
[16:29:35][ 83860][V][WiFiGeneric.cpp:360] _arduino_event_cb(): STA Disconnected: SSID:

When I run a very basic device I can upgrade. Anything more serious the upgrade fails every time as an out-of-memory happens. Do you have any suggestions or ideas?

oarcher commented 11 months ago

What we need is a safe mode, like regular ota: when something goes wrong, the device reboot with minimal configuration, and retry ota. I will have to dig into the ota code to implement this.

So you are using the web server component on your device ? This component is also known to be very memory consuming. But if you have the web server with opened port on the firewall, you perhaps can try to push ota with the embedded ota support in the web server ?

ota_http doesn't seems to be very memory consuming by itself, but if you are using https instead of http, the underlying SSL library seems to be very consuming.

I've created a branch with some memory messages:

external_components:
    - source: github://oarcher/piotech@mem

In my case, in http, just after the GET, I got in the logs:

[22:35:43][D][ota_http:051]: Early init free heap: 213956 bytes
...
[22:35:43][V][ota_http:100]: http GET finished.
[22:35:43][D][ota_http:101]: heap delta from begining: 7568 bytes

but with SSL, I got:

[22:38:59][D][ota_http:051]: Early init free heap: 213896 bytes
...
[22:38:59][V][ota_http:100]: http GET finished.
[22:38:59][D][ota_http:101]: heap delta from begining: 46568 bytes

So https consume 40k more than http (x6) !

tomatensaus commented 11 months ago

Let me explain my usecase:

I have a bootstrap firmware that I load, it detects the make and model of the inverter and the ota_http works great for flashing the correct firmware. The web component merely serves as a tool for diagnosing problems on the hardware (and can be disabled). Nothing is port forwarded and the end user decide when an upgrade should be actioned.

I use a button to look for firmware om my website Text_template to display the status (to let the user know there is an update available or not) Another button to action the ota upgrade (if the text_template) is the correct state.

Once the firmware is running I was hoping to publish updates for the user to enhance firmware features but this runs out of memory every time. Digging through the source to understand a bit more I saw this code

void Application::run_safe_shutdown_hooks() {
  for (auto it = this->components_.rbegin(); it != this->components_.rend(); ++it) {
    (*it)->on_safe_shutdown();
  }
  for (auto it = this->components_.rbegin(); it != this->components_.rend(); ++it) {
    (*it)->on_shutdown();
  }
}

Now I wonder how possible it would be to shut down all non-essential components. Eg Web server and most entities (eg modbus, spi, i2c). Even perhaps the API feeding data to HA. Then action the firmware upgrade, once done the reboot will boot the latest upgrade.

The firmware upload from port 3232 always work for me, so I was hoping that ota_http would give me the same results. From the documentation:

ESPHome also has an “OTA safe mode”. If for some reason your node gets into a boot loop, ESPHome will
 automatically try to detect this and will go over into a safe mode after the configured unsuccessful boot attempts 
(Defaults to 10). In that mode, all components are disabled and only Serial Logging + Network(WiFi or Ethernet) + OTA 
are initialized, so that you can upload a new binary. You can trigger entering safe mode by either configuring a 
dedicated button or switch to do that or by pressing the reset button on the board for num_attempts times.

I will investigate if the issue can be solved with http

tomatensaus commented 11 months ago

My debug output from running with the mem branch

[13:26:29][D][ota_http:048]: Early init free heap: 104104
[13:26:29][D][ota_http:069]: Trying to connect to https://example.com/g.bin.bin
[13:26:29][592370][V][HTTPClient.cpp:252] beginInternal(): url: https://example.com/g.bin.bin
[13:26:29][592375][D][HTTPClient.cpp:263] beginInternal(): unexpected protocol: https, expected http
[13:26:29][592383][V][HTTPClient.cpp:252] beginInternal(): url: https://example.com/g.bin.bin
[13:26:29][592392][D][HTTPClient.cpp:303] beginInternal(): protocol: https, host: example.com port: 443 url: /g.bin.bin
[13:26:29][V][ota_http:084]: http begin successfull.
[13:26:29][D][ota_http:086]: free heap: 105800
[13:26:29][VV][ota_http:088]: http client setReuse.
[13:26:29][V][ota_http:092]: http headers collected.
[13:26:29][D][ota_http:093]: free heap: 105716
[13:26:29][592435][D][HTTPClient.cpp:598] sendRequest(): request type: 'GET' redirCount: 0
[13:26:29]
[13:26:29][592436][V][ssl_client.cpp:62] start_ssl_client(): Free internal heap before TLS 103388
[13:26:29][592442][V][ssl_client.cpp:68] start_ssl_client(): Starting socket
[13:26:29][592461][V][ssl_client.cpp:149] start_ssl_client(): Seeding the random number generator
[13:26:29][592462][V][ssl_client.cpp:158] start_ssl_client(): Setting up the SSL/TLS structure...
[13:26:29][592465][D][ssl_client.cpp:179] start_ssl_client(): WARNING: Skipping SSL Verification. INSECURE!
[13:26:29][592474][V][ssl_client.cpp:257] start_ssl_client(): Setting hostname for TLS session...
[13:26:29][592482][V][ssl_client.cpp:272] start_ssl_client(): Performing the SSL/TLS handshake...
[13:26:30][593817][V][ssl_client.cpp:293] start_ssl_client(): Verifying peer X.509 certificate...
[13:26:30][593818][V][ssl_client.cpp:301] start_ssl_client(): Certificate verified.
[13:26:30][593821][V][ssl_client.cpp:316] start_ssl_client(): Free internal heap after TLS 58516
[13:26:30][593829][D][HTTPClient.cpp:1156] connect():  connected to example.com:443
[13:26:30][593837][V][ssl_client.cpp:369] send_ssl_data(): Writing HTTP request with 174 bytes...
[13:26:29][592474][V][ssl_client.cpp:257] start_ssl_client(): Setting hostname for TLS session...
[13:26:29][592482][V][ssl_client.cpp:272] start_ssl_client(): Performing the SSL/TLS handshake...
[13:26:30][593817][V][ssl_client.cpp:293] start_ssl_client(): Verifying peer X.509 certificate...
[13:26:30][593818][V][ssl_client.cpp:301] start_ssl_client(): Certificate verified.
[13:26:30][593821][V][ssl_client.cpp:316] start_ssl_client(): Free internal heap after TLS 58516
[13:26:30][593829][D][HTTPClient.cpp:1156] connect():  connected to example.com:443
[13:26:30][593837][V][ssl_client.cpp:369] send_ssl_data(): Writing HTTP request with 174 bytes...
[13:26:30][593893][V][HTTPClient.cpp:1250] handleHeaderResponse(): RX: 'HTTP/1.1 200 OK'
[13:26:30][593894][V][HTTPClient.cpp:1250] handleHeaderResponse(): RX: 'Date: Mon, 13 Nov 2023 11:26:27 GMT'
[13:26:30][593898][V][HTTPClient.cpp:1250] handleHeaderResponse(): RX: 'Server: Apache'
[13:26:30][593905][V][HTTPClient.cpp:1250] handleHeaderResponse(): RX: 'Last-Modified: Fri, 10 Nov 2023 14:01:04 GMT'
[13:26:30][593914][V][HTTPClient.cpp:1250] handleHeaderResponse(): RX: 'Accept-Ranges: bytes'
[13:26:30][593922][V][HTTPClient.cpp:1250] handleHeaderResponse(): RX: 'Content-Length: 1430848'
[13:26:30][593929][V][HTTPClient.cpp:1250] handleHeaderResponse(): RX: 'Keep-Alive: timeout=5, max=100'
[13:26:30][593938][V][HTTPClient.cpp:1250] handleHeaderResponse(): RX: 'Connection: Keep-Alive'
[13:26:30][593945][V][HTTPClient.cpp:1250] handleHeaderResponse(): RX: 'Content-Type: application/octet-stream'
[13:26:30][593954][V][HTTPClient.cpp:1250] handleHeaderResponse(): RX: ''
[13:26:30][593959][D][HTTPClient.cpp:1307] handleHeaderResponse(): code: 200
[13:26:30][593965][D][HTTPClient.cpp:1310] handleHeaderResponse(): size: 1430848
[13:26:30][593971][D][HTTPClient.cpp:628] sendRequest(): sendRequest code=200
[13:26:30]
[13:26:30][V][ota_http:098]: http GET finished.
[13:26:30][D][ota_http:099]: free heap: 50300
[13:26:30][D][ota_http:114]: firmware is 1430848 bytes length.
[13:26:30][V][ota_http:061]: md5sum from received data initialized.
[13:26:30][594000][D][Updater.cpp:133] begin(): OTA Partition: app0
[13:26:30][V][ota_http:069]: OTA backend begin
[13:26:30][I][ota_http:093]: Progress: 0.1%
[13:26:35]E (607887) task_wdt: Task watchdog got triggered. The following tasks did not reset the watchdog in time:
[13:26:35]E (607887) task_wdt:  - loopTask (CPU 1)
[13:26:35]E (607887) task_wdt: Tasks currently running:
[13:26:35]E (607887) task_wdt: CPU 0: IDLE
oarcher commented 11 months ago

I still think that the best way to solve this issue, is to implement a safe mode, to perform the flash at reboot. I will need some time to implement this, but unfortunately, I have to work on a professional project for the moment.

However, there is perhaps a 100% yaml implementation of this feature. We can just save a boolean value to flash memory, and then check it at an early boot stage, to do the real job. Hopefully, there will be more available memory.

Here is the yaml:

esphome:
  name: ota_http_test
  on_boot:
    priority: 240
    then:
      - if:
          condition:
            lambda: 'return id(ota_http_safe);'
          then:
            - logger.log: "ota http safe mode"
            - globals.set:
                id: ota_http_safe
                value: '0'
            - delay: 1s # let `preferences` component to handle this global
            - ota_http.flash:
                url: https://example.com/firmware.bin
                verify_ssl: false
            - logger.log: "This message should be not displayed(reboot)"
          else:
            - logger.log: "ota http safe mode not needed. Continuing"

globals:
  - id: ota_http_safe
    type: bool
    restore_value: yes
    initial_value: '0'

ota_http:

button:
  - platform: template
    name: "Firmware update"
    on_press:
      then:
        - ota_http.flash:
            url: https://exemple.com/firmware.bin
            verify_ssl: false
        - logger.log: "This message should be not displayed(reboot)"

  - platform: template
    name: "Firmware update (safe mode)"
    on_press:
      then:
        - globals.set:
            id: ota_http_safe
            value: '1'
        - logger.log: "Will perform ota_http update at next reboot."
        - button.press: reboot
        - logger.log: "This message should be not displayed(reboot)"

  - platform: restart
    id: reboot
tomatensaus commented 11 months ago

Thank you! That is a very creative solution! Maybe worth adding it to the documentation. There is already a "safe mode" so you should perhaps call it "flash mode"

I assume you have used this workaround already I will see if this works for my use case!

Edit: Tested and it actually worked. Module disappeared from the network as 2 minutes later it joined the network again with the new firmware.

oarcher commented 10 months ago

Cool !

There is a critical section in on_boot when the global ota_http_safe is set to false. This variable is in memory, and it's flushed onto flash every minutes.

So if something goes wrong in ota_http.flash action, the device will enter a reboot loop with ota_http_safe as true, and will crash again, until it give up and enter the regular ota safe mode (after 10 crash). So if you don't have access to the regular ota port because of a firewall, it will be a problem.

It seems that there is no option to force write to flash at the yaml level, but I think that this could be done on the C side.

oarcher commented 10 months ago

I've pushed a new commit on the mem branch to try to save globals onto flash, to prevent boot loop if ota_http have an oom.

This currently doesn't works very well, and I had to insert a delay: 1s in the previous yaml example to allow the preferences component to handle this var.

I think that I have to do that because the preferences is not completely initialized at this boot stage.

Another problem is that the boot stage is not blocking for other components, for example the web_server that is memory consuming continue to initialize, and so more time is given to it with the delay: 1s trick...

So it's a little progress, but I'd like to find a way to block other components initialisation, except for needed ones like the preferences component.

I will have to fully understand how safe_mode works in the regular ota, has they are able to have such behaviour.

tomatensaus commented 10 months ago

Thanks! I have been really busy but hopefully I can find some time to look at the source code and test on my full software stack to see if it gets stuck in a boot-loop at all.

Thank you for the effort, much appreciated!

oarcher commented 10 months ago

Wait for my next comment before testing, because I think that I am going to be able to push a new commit that reboot the device in a real blocking safe mode, and retry the flash.

This solution will be 100% C, with no yaml tricks!

oarcher commented 10 months ago

So here is the current solution to solve the problems described above:

It's a basic safe_mode implementation: If http_ota fail, il will be restarted at boot time, with minimal components loaded.

Note that I have implemented this in https://github.com/esphome/esphome/pull/5586, not in the mem branch.

external_components:
  - source: github://pr#5586
      components: [ ota_http ]

ota_http:
  safe_mode: fallback # retry at reboot after crash

button:
  - platform: template
    name: "Firmware update"
    on_press:
      then:
        - ota_http.flash:
            url: https://example.com/firmware.bin
            verify_ssl: no  # must be explicitly set to `no` if https is used.
        - logger.log: "This message should be not displayed(reboot)"
Alphaemef commented 10 months ago

did you change "Verify_ssl" to "disable_ssl"?

Also, if we template the url so it fetches from a sensor, then once the unit reboots after fail, that sensor state will be wiped. So it will fail with flashing right ? Or how did you code the behaviour ?

Seriously though: This is amazing work.

oarcher commented 10 months ago

@Alphaemef , you're right, it's verify_ssl, not disable_ssl. I've edited my yaml above to correct this. I've confused with esp8266_disable_ssl_support that is also an option. In fact, http options are the same as the htpp_request component.

And yes, currently the safe mode flash will fail if some values come from templates. To handle this, I will have to save them to flash. I will try to implement this soon.

oarcher commented 10 months ago

ok, support for templated and sensor url is now implemented.

There is currently a limitation of the url length to 128. I will have to add an url_max_length configuration option if we want to change it.

Alphaemef commented 10 months ago

I think you are safe with 128 char urls :) Should just be in docs.

Do you know if there are illegal characters ? For instance I can see that "/" ruins if its part of the password. Im guessing ":" and "@" the same. Any others I should be weary of ? Could be added to docs as well.

oarcher commented 10 months ago

Have you tried percent-encoding the password ? ie '%2F' instead of '/'. Username and password are handled by the underlying http library, and ota_http feed username/password as is.

So general percent-encoding should work: https://developer.mozilla.org/en-US/docs/Glossary/percent-encoding

Alphaemef commented 10 months ago

Yeah, but Im rrusin the password the units have to Connect to the MQTT broker, and thats saves in a substitution ${pass} and same with the username.

oarcher commented 10 months ago

ok, so I will just add this to the doc, as it's a quite specific use case.

Alphaemef commented 10 months ago

Oh yeah absolutely. Was just wondering if you knew which characters I should avoid in passwords besides / : and @.

oarcher commented 10 months ago

I've included https://en.wikipedia.org/wiki/Percent-encoding in the doc. Currently, the README.txt is out of date, as the doc is in PR https://github.com/esphome/esphome-docs/pull/3291

oarcher commented 10 months ago

I've edited my yaml above because safe_mode value is now fallback (by default), to retry the ota if it fail during action.

Other values are yes (always reboot before ota), or no, (never perform ota at boot).

@tomatensaus , @Alphaemef , are you ok if I close this issue ?

Alphaemef commented 10 months ago

yes, it became a miss mash of stuff on my behalf, thank once more and great work.

Just in case anyone stumbles on the percent-encoding conundrum later: My testing showed this: without percent-encoding, these characters work: =!-_.~()*+,; These do NOT work: : / @ {}[]? These I didn't test, but almost certainly do NOT work: ’&#$%"'

tomatensaus commented 10 months ago

I am not quite sure if I should rather be using the PR branch on esphome or this project, where do I find the most up to date code to test? It seems like you archiving this project?

oarcher commented 10 months ago

You should use the PR branch. I've synced this repo with the PR branch, and updated the README of this repo to the PR.

BTW, the last documentation is here: https://deploy-preview-3291--esphome.netlify.app/components/ota_http

tomatensaus commented 10 months ago

Wow! I love where this is going! I will test the PR branch then

I think your documentation is lacking the dependency to the http request that needs to be compiled into the project Or do you not need to define this?

http_request:
  useragent: esphome/esp32dev"
  id: http_request_data
  timeout: 5s
  follow_redirects: false

I also noticed this spelling mistake in the log message:

10:29:21    [I] [ota_http:061]  Rebotting before flashing new firmware.
10:29:21    [I] [ota_http:061]  Rebotting before flashing new firmware.
10:29:21    [I] [app:134]   
Rebooting safely...
oarcher commented 10 months ago

http_request is not an http_ota dependency, you can remove it.

The typo is fixed.

tomatensaus commented 10 months ago

One last thing, maybe mention in the documentation that if you flash the images without wifi credentials and configure them via the the below 2 options they get saved between flashes allowing a 3rd party to upgrade and keep their wifi details. Took me 2 days to find the code in the wifi source

captive_portal:

improv_serial:

Also wanted to say that the PR branch works beautifully on my ESP32. Did 8 flashes in a row trying various options and all worked perfectly. Will soon test it on my main firmware to allow in-place upgrades. Thank you for the effort in creating something very useful!