theelims / ESP32-sveltekit

A simple and extensible framework for ESP32 based IoT projects with a feature-rich, beautiful, and responsive front-end build with Sveltekit, Tailwind CSS and DaisyUI. This is a project template to get you started in no time with a fully integrated build chain.
https://theelims.github.io/ESP32-sveltekit/
Other
90 stars 15 forks source link

TLS handshake crash on esp32-dev platform github OTA update #32

Closed leokeba closed 2 months ago

leokeba commented 3 months ago

Hi, I've been trying to add OTA support to my project via github releases, but as soon as the TLS handshake starts I get a crash.

[500440][V][ssl_client.cpp:62] start_ssl_client(): Free internal heap before TLS 90848
[500441][V][ssl_client.cpp:68] start_ssl_client(): Starting socket
[500467][V][ssl_client.cpp:146] start_ssl_client(): Seeding the random number generator
[500469][V][ssl_client.cpp:155] start_ssl_client(): Setting up the SSL/TLS structure...
[500473][V][ssl_client.cpp:190] start_ssl_client(): Attaching root CA cert bundle
[500479][V][ssl_client.cpp:254] start_ssl_client(): Setting hostname for TLS session...
[500487][V][ssl_client.cpp:269] start_ssl_client(): Performing the SSL/TLS handshake...
[500744][D][esp_crt_bundle.c:108] esp_crt_verify_callback(): 48 certificates in bundle
[500747][E][esp_crt_bundle.c:74] esp_crt_check_signature(): PK verify failed with error FFFFBD70
[500750][E][esp_crt_bundle.c:147] esp_crt_verify_callback(): Failed to verify certificate
[500762][E][ssl_client.cpp:37] _handle_error(): [start_ssl_client():273]: (-12288) X509 - A fatal error occurred, eg the chain is too long or the vrfy callback failed
[500772][E][WiFiClientSecure.cpp:144] connect(): start_ssl_client: -12288
[500782][V][ssl_client.cpp:321] stop_ssl_socket(): Cleaning SSL connection.
[500786][D][HTTPClient.cpp:1163] connect(): failed connect to github.com:443
[500792][W][HTTPClient.cpp:1483] returnError(): error(-1): connection refused
[500799][E][HTTPUpdate.cpp:234] handleUpdate(): HTTP error: connection refused

[500806][D][HTTPClient.cpp:408] disconnect(): tcp is closed

[500812][V][ssl_client.cpp:321] stop_ssl_socket(): Cleaning SSL connection.
Guru Meditation Error: Core  1 panic'ed (LoadProhibited). Exception was unhandled.

Core  1 register dump:
PC      : 0x400e29f8  PS      : 0x00060430  A0      : 0x800ec2d0  A1      : 0x3fff5650  
A2      : 0x00000060  A3      : 0x3fffff70  A4      : 0x3fff576c  A5      : 0x0007a453  
A6      : 0x00000000  A7      : 0x3ffc62c0  A8      : 0x800e29f8  A9      : 0x3fff5610  
A10     : 0x3fff565c  A11     : 0xfc8d3b73  A12     : 0xfc8d3b73  A13     : 0x3ffc62c0  
A14     : 0x00000000  A15     : 0x3fff565c  SAR     : 0x00000019  EXCCAUSE: 0x0000001c  
EXCVADDR: 0x00000060  LBEG    : 0x4008a3e8  LEND    : 0x4008a3f2  LCOUNT  : 0x00000000  

Backtrace: 0x400e29f5:0x3fff5650 0x400ec2cd:0x3fff5690 0x400f2811:0x3fff56b0

  #0  0x400e29f5:0x3fff5650 in PsychicEventSource::send(char const*, char const*, unsigned int, unsigned int) at lib/PsychicHttp/src/PsychicEventSource.cpp:121
  #1  0x400ec2cd:0x3fff5690 in NotificationEvents::send(String, String, int) at lib/framework/NotificationEvents.cpp:64
  #2  0x400f2811:0x3fff56b0 in updateTask(void*) at lib/framework/DownloadFirmwareService.cpp:80 (discriminator 3)

The error FFFFBD70 seems to point to a memory issue like running out of heap space, but several facts make this a little suspicious. First, I have about 90kB of free heap available before the handshake, and the documentation suggest I need about 42kB in the worst-case scenario. Also, I once had an error clearly indicating I was running out of heap, and this is a different one. I also tried adding the flags suggested here to sdkconfig.defaults, but that did not have an effect.

I will keep investigating this issue myself, but if anybody has a clue about where to look I'll take it.

Thanks.

theelims commented 3 months ago

@leokeba you could try to comment out the line 80 in DownloadFirmwareService.cpp. I suspect, that this is causing the actual crash. You'll get a more verbose error message on the serial log if it doesn't crash with SSE.

theelims commented 3 months ago

@leokeba can you please post some more details? Which ESP are you using? I just tested it with an ESP32-S3 and it validated all 48 certificates. To me it looks like somehow the certificate creation process produced garbage. Can you please check your build log for these lines?

SSL Cert Store: Certificate bundle downloaded to: ssl_certs\cacert.pem
SSL Cert Store: Parsing certificates from ssl_certs\cacert.pem
SSL Cert Store: Successfully added 48 certificates
SSL Cert Store: Successfully added 48 certificates in total
SSL Cert Store: Successfully created src\certs\x509_crt_bundle.bin

For me the OTA process works through the complete SSL part flawless. It receives the redirect from github and starts to download the binary. You can try to just use the single certificate used by github, instead of the bundle.

image In my test SSL increased the heap by ~70kb. But still enough headroom.

However, it crashes when using SSE to notify the frontend about the progress, too. But this part is due for a rewrite anyway #29.

leokeba commented 3 months ago

I'm using a good old esp32-dev "WROOM" module. Here's the relevant build log part :

SSL Cert Store: Certificate bundle downloaded to: ssl_certs/cacert.pem
SSL Cert Store: Parsing certificates from ssl_certs/cacert.pem
SSL Cert Store: Successfully added 48 certificates
SSL Cert Store: Successfully added 48 certificates in total
SSL Cert Store: Successfully created src/certs/x509_crt_bundle.bin
CONFIGURATION: https://docs.platformio.org/page/boards/espressif32/esp32dev.html
PLATFORM: Espressif 32 (6.4.0) > Espressif ESP32 Dev Module
HARDWARE: ESP32 240MHz, 320KB RAM, 4MB Flash

You are right about the crash, I commented out line 80 and it doesn't crash anymore. Here's the relevant serial output :

[ 39947][D][HTTPClient.cpp:598] sendRequest(): request type: 'GET' redirCount: 0
[ 39959][V][ssl_client.cpp:62] start_ssl_client(): Free internal heap before TLS 88152
[ 39960][V][ssl_client.cpp:68] start_ssl_client(): Starting socket
[ 39985][V][ssl_client.cpp:146] start_ssl_client(): Seeding the random number generator
[ 39987][V][ssl_client.cpp:155] start_ssl_client(): Setting up the SSL/TLS structure...
[ 39992][V][ssl_client.cpp:190] start_ssl_client(): Attaching root CA cert bundle
[ 39997][V][ssl_client.cpp:254] start_ssl_client(): Setting hostname for TLS session...
[ 40005][V][ssl_client.cpp:269] start_ssl_client(): Performing the SSL/TLS handshake...
[ 40969][D][esp_crt_bundle.c:108] esp_crt_verify_callback(): 48 certificates in bundle
[ 40972][E][esp_crt_bundle.c:74] esp_crt_check_signature(): PK verify failed with error FFFFBD70
[ 40975][E][esp_crt_bundle.c:147] esp_crt_verify_callback(): Failed to verify certificate
[ 40988][E][ssl_client.cpp:37] _handle_error(): [start_ssl_client():273]: (-12288) X509 - A fatal error occurred, eg the chain is too long or the vrfy callback failed
[ 40997][E][WiFiClientSecure.cpp:144] connect(): start_ssl_client: -12288
[ 41007][V][ssl_client.cpp:321] stop_ssl_socket(): Cleaning SSL connection.
[ 41011][D][HTTPClient.cpp:1163] connect(): failed connect to github.com:443
[ 41017][W][HTTPClient.cpp:1483] returnError(): error(-1): connection refused
[ 41024][E][HTTPUpdate.cpp:234] handleUpdate(): HTTP error: connection refused
[ 41031][D][HTTPClient.cpp:408] disconnect(): tcp is closed
[ 41037][V][ssl_client.cpp:321] stop_ssl_socket(): Cleaning SSL connection.
[ 41044][E][DownloadFirmwareService.cpp:82] updateTask(): [Download OTA] HTTP Update failed with error (-1): HTTP error: connection refused
HTTP Update failed with error (-1): HTTP error: connection refused
[ 59591][I][PsychicHttpServer.cpp:261] closeCallback(): [psychic] Client disconnected 53

Doesn't seem to get much more verbose, would there be a way for me to validate if the certificate bundle I'm making is correct ?

theelims commented 3 months ago

You could try to change the bundle creation in platformio.ini to board_ssl_cert_source = folder.

leokeba commented 2 months ago

Hi, I had time to investigate a bit more and made some progress. The main culprit looks more and more like a memory issue, the requirements specified by the Mbed TLS documentation most likely don't take into account lots of other things happening on the MCU at the same time.

After lots of trial and error, I was able to make it work once using the latest version of template project. I had to remove all unnecessary features (security, analytics...), which helped it pass all the TLS handshakes with github, but then it would still crash during download. Finally, I commented out all notificationEvents in DownloadFirmwareService.cpp, and I had to close the browser tab very quickly after triggering the download so there was no websocket left open or any HTTP request going through during the download.

The handshake errors I'm getting still seem quite random, but there are some tendencies. I get a clear memory allocation issue if there is less than 90kb of free heap available before triggering the download. Between 90 and 120kb I either get the same error as before, something about a BigInt or the memory allocation error. Above 120kb, most of the time it gets through everything but crashes during download.

I suspect that this behaviour could be related to #39, and as you said #29 could very well improve the situation as well. I will try again as soon as there is good progress on these fronts.

In the end, my goal here would be to find stable minimum requirements, and maybe implement a check triggering a warning or an error if those requirements are not met.

theelims commented 2 months ago

Have you tried: board_ssl_cert_source = folder in platformio.ini?

Could maybe ease the memory demand if only one certificate needs to be loaded.

If it aborts during download it could be related to task priorities. Have you running anything in the loop()-function or any other task on core 1?

leokeba commented 2 months ago

I'm using the base template from the current main branch for most of my testing around this issue so no loop task. I just change the repo url so I can get a firmware to download for the right board version. In this context, I was able to get it to work reliably by commenting out all _notificationEvents->send calls in DownloadFirmwareService.cpp.

On my personal project fork it took me a little more research, but after a few rabbit holes and dead ends I found out all I had to do was to upgrade platformio to the latest espressif32 platform release (6.6.0) and it suddenly worked. Maybe that's because it also somehow made appear almost 50kb of free heap, but still nice bonus if not. I still have to leave the notificationEvents commented out though.

Changing board_ssl_cert_source did not seem to have a significant effect in my testing by the way.

theelims commented 2 months ago

Good point. I'll fix the platform version in the next round. Seems I'm still on 6.4.0.

Anyway, problem is solved. The bug with the notification events will be address by #29 anyway, soon.