homieiot / homie-esp8266

💡 ESP8266 framework for Homie, a lightweight MQTT convention for the IoT
http://homieiot.github.io/homie-esp8266
MIT License
1.36k stars 308 forks source link

[develop-v3] OTA failing #644

Closed nemidiy closed 4 years ago

nemidiy commented 4 years ago

Hi people,

I've been hitting my head against the wall trying to OTA update an ESP8266. The code is minimalistic

#include "Arduino.h"
#include <Homie.h>

HomieNode sensorNode("sensor", "Sensor", "sensor");

unsigned long lastSent = 0;

void loopHandler() {
  unsigned long m = millis();
  if ( m - lastSent >= 10 * 1000UL || lastSent == 0) {
    Homie.getLogger() << "Fake value: " << m << endl;
    sensorNode.setProperty("value").send(String(m));
    lastSent = millis();
  }
}

void setup() {
  Serial.begin(115200);
  Homie.getLogger() << "kkkkkkkkkkkkkkkkkkkkkk" << endl;
  Homie.getMqttClient().setKeepAlive(120);
  Homie_setFirmware("test", "1.0.0");
  Homie.setLoopFunction(loopHandler);
  sensorNode.advertise("value").setName("Value").setDatatype("double");
  Homie.setup();
}

void loop(){
  Homie.loop();
}

The platformio.ini file is :

[env:nodemcuv2]
platform = espressif8266
board = nodemcuv2
framework = arduino
build_flags =
    -D HOMIE_MDNS=0
    -D PIO_FRAMEWORK_ARDUINO_LWIP2_LOW_MEMORY
upload_speed = 115200
lib_ldf_mode = deep
lib_deps =
    ArduinoJson@6.13.0
    https://github.com/homieiot/homie-esp8266.git#develop-v3

I also repartitioned the mcu to make sure there is enough space:

# Name,   Type, SubType, Offset,  Size, Flags
nvs,      data, nvs,     0x9000,  0x6000,
phy_init, data, phy,     0xf000,  0x1000,
factory,  app,  factory, 0x10000, 3M,

Here are my libs :

Dependency Graph
|-- <ArduinoJson> 6.13.0
|-- <Homie> 3.0.0 #7818c73
|   |-- <ArduinoJson> 6.13.0
|   |-- <AsyncMqttClient> 0.8.2
|   |   |-- <ESPAsyncTCP> 1.2.2
|   |   |   |-- <ESP8266WiFi> 1.0
|   |-- <Bounce2> 2.52
|   |-- <ESP Async WebServer> 1.2.3
|   |   |-- <ESPAsyncTCP> 1.2.2
|   |   |   |-- <ESP8266WiFi> 1.0
|   |   |-- <Hash> 1.0
|   |   |-- <ESP8266WiFi> 1.0
|   |   |-- <ArduinoJson> 6.13.0
|   |-- <ESP8266mDNS> 1.2
|   |   |-- <ESP8266WiFi> 1.0
|   |-- <ESP8266WiFi> 1.0
|   |-- <DNSServer> 1.1.1
|   |   |-- <ESP8266WiFi> 1.0
|   |-- <ESPAsyncTCP> 1.2.2
|   |   |-- <ESP8266WiFi> 1.0
|   |-- <Ticker> 1.0
|   |-- <ESP8266HTTPClient> 1.2
|   |   |-- <ESP8266WiFi> 1.0

Also for some reason, not sure why, platform IO is not recognizing the size of my partitions ...

DATA:    [=====     ]  48.9% (used 40048 bytes from 81920 bytes)
PROGRAM: [=====     ]  49.8% (used 520632 bytes from 1044464 bytes)

Shouldn't it read 3M and not 1MB (1044464 bytes)

Anyways, then what I do is modify the sketch to change the logging line after initializing Serial. Then build and then invoke :

$ python ota_updater.py -l 192.168.0.227 -p 1883 -t devices/ -i testbox /home/nemi/workspace/green-jac/homie_i2c/.pio/build/nodemcuv2/firmware.bin
Connecting to mqtt broker 192.168.0.227 on port 1883
Connected with result code 0
Waiting for device to come online...
Waiting for device info...
Publishing new firmware with checksum 3373df653b869cf431689d254d3b1e37
[                              ] 989/545344Waiting for device info...
[                              ] 989/545344Expecting checksum 3373df653b869cf431689d254d3b1e37, got b6b11dc49ab3c6400b1995025dc996f1, update failed!

Sometimes the serial output shows and blocks:

Receiving OTA payload
↕ OTA started
Triggering OTA_STARTED event...
Firmware is binary
✖ OTA failed (500 INTERNAL_ERROR 11)
Triggering OTA_FAILED event...

ERROR 11 according to this means UPDATE_ERROR_BAD_ARGUMENT

Some other times I get :

Receiving OTA payload
↕ OTA started
Triggering OTA_STARTED event...
Firmware is binary
Receiving OTA firmware (453/545344)...
Receiving OTA firmware (989/545344)...

 ets Jan  8 2013,rst cause:4, boot mode:(3,6)

wdt reset
load 0x4010f000, len 1384, room 16 
tail 8
chksum 0x2d
csum 0x2d
v951aeffa
~ld

Here is what gets published :

------- SOMETHING THAT SEEMS TO BE THE BINARY BEING SENT OVER MQTT --------
devices/testbox/$implementation/ota/status 202
devices/testbox/$implementation/ota/status 206 453/545344
devices/testbox/$implementation/ota/status 206 989/545344
------- RESET -------
devices/testbox/$state init
devices/testbox/$homie 3.0.1
devices/testbox/$name Test
....

As u see in the output the board restarts but with the old firmware, it's like it starts getting the binary but then decides (?) to restart before it got it all. I read through this other issue https://github.com/homieiot/homie-esp8266/issues/613 and tried the keep alive thing but did not work either

Then I also added the flag but it makes it even worse

-D HOMIE_CONFIG=0

I get the following crash if I flash the board with a firmware built with that flag and then OTA:

0x40225144: operator delete[](void*) at ../../../../../dl/gcc-xtensa/libstdc++-v3/libsupc++/del_opv.cc line 33
0x40225174: operator new(unsigned int) at ../../../../../dl/gcc-xtensa/libstdc++-v3/libsupc++/new_op.cc line 45
0x40225135: std::set_terminate(void (*)()) at ../../../../../dl/gcc-xtensa/libstdc++-v3/libsupc++/eh_terminate.cc line 70
0x40225120: __fpclassifyd at ../../../../../../dl/newlib-xtensa/newlib/libm/common/s_fpclassify.c line 26
0x402252b1: glue2esp_linkoutput at glue-esp/lwip-esp.c line 239
0x40225533: new_linkoutput at glue-lwip/lwip-git.c line 216
0x40225174: operator new(unsigned int) at ../../../../../dl/gcc-xtensa/libstdc++-v3/libsupc++/new_op.cc line 45
0x4022ce14: etharp_output_to_arp_index at core/ipv4/etharp.c line 764
0x4022e6da: inet_chksum_pseudo at core/inet_chksum.c line 266
0x4022dba2: ip4_output_if_opt at core/ipv4/ip4.c line 820
0x4022e717: inet_chksum_pseudo at core/inet_chksum.c line 327
0x4022a8a2: tcp_output at core/tcp_out.c line 1312
0x40228ba4: tcp_input at core/tcp_in.c line 487
0x4022d919: ip4_input at core/ipv4/ip4.c line 696
0x4022589d: ethernet_input_LWIP2 at netif/ethernet.c line 170
0x402256d3: esp2glue_alloc_for_recv at glue-lwip/lwip-git.c line 419
0x4024a51e: ethernet_input at glue-esp/lwip-esp.c line 353
0x4024a52f: ethernet_input at glue-esp/lwip-esp.c line 355
0x40225135: std::set_terminate(void (*)()) at ../../../../../dl/gcc-xtensa/libstdc++-v3/libsupc++/eh_terminate.cc line 70
0x40225120: __fpclassifyd at ../../../../../../dl/newlib-xtensa/newlib/libm/common/s_fpclassify.c line 26
0x40225135: std::set_terminate(void (*)()) at ../../../../../dl/gcc-xtensa/libstdc++-v3/libsupc++/eh_terminate.cc line 70
0x40225174: operator new(unsigned int) at ../../../../../dl/gcc-xtensa/libstdc++-v3/libsupc++/new_op.cc line 45

Any ideas what might be wrong or what else to try before diving into the code? I still have to try with ESP32 to see if the same thing happens. Thanks!

MajorTwip commented 4 years ago

Could it be linked to #613?

nemidiy commented 4 years ago

Hi @MajorTwip It looks like it is, although I read through it and tried setting the keep alive to 120 as you did but did not change anything. In #641 @sradner13 is saying that the same commit that broke the booting for ESP32 also compromised OTA. So I am going to try out what he says and come back.

sradner13 commented 4 years ago

Thanks for looking into this @nemidiy I can test OTA changes on esp8266 fairly easily.

When I use the latest library OTA crashes, usually on a watchdog timer exception. Using the exception decoder, the error is almost never in the same place, and is very deep in the AsyncMQTT library. If I back out the change everything works fine and OTA is reliable. I'm guessing the code to copy the topic is slowing down the calls just enough to hit the watchdog timer. Maybe there is a more efficient way to copy the topic?

nemidiy commented 4 years ago

I just verified what @sradner13 says, rolled back homie to commit 71a766b and OTA worked. As a note, I do get a crash after the FW has been flashed (not sure if that's supposed to happen) but then reboots to the new FW just fine

Receiving OTA firmware (540516/545264)...
Receiving OTA firmware (541052/545264)...
Receiving OTA firmware (541588/545264)...
Receiving OTA firmware (542124/545264)...
Receiving OTA firmware (542660/545264)...
Receiving OTA firmware (543196/545264)...
Receiving OTA firmware (543732/545264)...
Receiving OTA firmware (544268/545264)...
Receiving OTA firmware (544804/545264)...
Receiving OTA firmware (545264/545264)...
✔ OTA succeeded
Triggering OTA_SUCCESSFUL event...
Device is idle
↻ Rebooting...
✖ Wi-Fi disconnected, reason: 8
Triggering WIFI_DISCONNECTED event...
↕ Attempting to connect to Wi-Fi...

Exception (28):
epc1=0x4000df2f epc2=0x00000000 epc3=0x00000000 excvaddr=0x00000033 depc=0x00000000

>>>stack>>>

ctx: sys
sp: 3fffea40 end: 3fffffb0 offset: 01a0
3fffebe0:  402564f7 00000001 00000000 40264ef2  
3fffebf0:  00000033 00000010 40256e0d 3ffee754  
....
<<<stack<<<

 ets Jan  8 2013,rst cause:2, boot mode:(3,7)

load 0x4010f000, len 1384, room 16 
tail 8
chksum 0x2d
csum 0x2d
v951aeffa
@cp:0
ld
💡 Firmware test (1.0.0)
🔌 Booting into normal mode 🔌
{} Stored configuration
  • Hardware device ID: cc50e33ca8c5
  • Device ID: testbox

The decoded ex :


Exception 28: LoadProhibited: A load referenced a page mapped with an attribute that does not permit loads
PC: 0x4000df2f
EXCVADDR: 0x00000033

Decoding stack results
0x4024cfdc: udp_sendto_if_src at core/udp.c line 893
0x4024d033: udp_sendto_if at core/udp.c line 692
0x4024fa34: dhcp_renew at core/ipv4/dhcp.c line 1183
0x40236e31: operator delete(void*) at ../../../../../dl/gcc-xtensa/libstdc++-v3/libsupc++/del_op.cc line 48
0x40236e1c: operator delete[](void*) at ../../../../../dl/gcc-xtensa/libstdc++-v3/libsupc++/del_opv.cc line 33
0x402487e9: netif_sta_status_callback at glue-lwip/lwip-git.c line 283
0x40236e31: operator delete(void*) at ../../../../../dl/gcc-xtensa/libstdc++-v3/libsupc++/del_op.cc line 48
0x40236e70: operator new(unsigned int) at ../../../../../dl/gcc-xtensa/libstdc++-v3/libsupc++/new_op.cc line 52

@sradner13 I would be really surprised about std::unique_ptr + memcpy slowing things down to the point where the wd timer is triggered. Do you have one of the decoded exceptions you are seeing to share ?

luebbe commented 4 years ago

@nemidiy I thought the same. Arduino OTA works fine for me with the PR. I honestly never use homie OTA via MQTT, but this sends large chunks of data, so it could become a problem when it is copied around. It runs asynchronous and IIRC somewhere in the homie docs it is said that you shouldn't do "expensive" stuff in the MQTT callbacks. The crash when rebooting after OTA happens all the time. I guess that something isn't cleaned up properly.

nemidiy commented 4 years ago

@luebbe @sradner13 thanks for the input. the memcpy code is copying the mqtt topic, not the payload. https://github.com/homieiot/homie-esp8266/blob/develop-v3/src/Homie/Boot/BootNormal.cpp#L786 The topic should be string no more that 50 chars. Anyways, since I am the rookie here I will time that piece of code to see how much time it takes. Regardless there is clearly something wrong on the copying that's causing the issue. I will do some debugging and come back. My end goal is having OTA over SSL working eventually.

luebbe commented 4 years ago

@nemidiy your support is greatly appreciated!

nemidiy commented 4 years ago

Hey guys, I found the problem and learned a few things along the way.

The hardware WD timer was being hit as direct consequence of a memory bug. I timed the memcpy + unique_ptr piece of code to be aprox at 50 microsecs.

Here is what's wrong : When you get to ota topic that has as payload the binary firmware it first goes through

void BootNormal::_onMqttMessage(char* topic, char* payload, AsyncMqttClientMessageProperties properties, size_t len, size_t index, size_t total) {
  if (total == 0) return;  // no empty message possible

  size_t topicLength = strlen(topic);
  _mqttTopicCopy = std::unique_ptr<char[]>(new char[topicLength+1]);
  memcpy(_mqttTopicCopy.get(), topic, topicLength);
  _mqttTopicCopy.get()[topicLength] = '\0';

  // split topic on each "/"
  if (index == 0) {
    __splitTopic(_mqttTopicCopy.get());
  }

  // 1. Handle OTA firmware (not copied to payload buffer)
  if (__handleOTAUpdates(_mqttTopicCopy.get(), payload, properties, len, index, total))
    return;
  1. copies the topic into _mqttTopicCopy
  2. since index == 0 calls splitTopic. That function uses the internal object (char*) of _mqttTopicCopy to tokenize the topic and then store all the tokens in _mqttTopicLevels.

The bug resides in that the tokens are pieces of the internal buffer copied into _mqttTopicCopy. When doing OTA, the _onMqttMessage is "artificially" invoked for the chunked FW (here), every time a new call to _onMqttMessage is done kills the previous instance of the _mqttTopicCopy (here) and thus the pointers that were stored on _mqttTopicLevels become invalid, leading to undefined behavior and eventually a timeout.

I'll work on a fix and send the PR.

Thanks !

mkfrey commented 4 years ago

Thanks for your effort! I'm excited to see your fix for this.

In my opinion the copying of the topic should be removed. The reason it was introduced was to be able to receive messages directly via the MQTT client without the topics being modified by Homie. But since Homie already does the connection housekeeping and does provide a topic structure I think it would make more sense to add an API to send/receive messages through Homie instead of directly calling AsyncMqttClient.

nemidiy commented 4 years ago

hey @codefrey thanks for giving me the context behind that piece of code. To be honest I just worked on the fix assuming that copying the mqtt topic was good (again I am new here and lack the big picture). Here is what I was going for : https://github.com/nemidiy/homie-esp8266/commit/1a8d2e9d1b441d082877ad222f5f8408d51fb08b

I partially understand what you are saying in the sense that I see in that the AsyncMqttClient get's invoked across the Homie code instead of having Homie itself a bit more abstracted from the MQTT cli and maybe have a few methods (API) in Homie encapsulating the AsyncMqttClient to send and receive (so far so good?).

What I don't entirely get is how does the API you are suggesting ties back to the topic copying code. My intuition says that is about actually owning as an entity the topic buffer and thus being able to do with it what you want and not worrying about breaking someone else's buffer (in this case AsyncMqttClient's). Could you elaborate a bit on that ?

Also to be honest the change you are suggesting seems to be a bit more foundational than what I would be comfortable doing now given I am new to the project. Would gladly help out but will need guidance.

mkfrey commented 4 years ago

@nemidiy The change adding the topic copy was just recently introduced (see #526). The ability to modify the topic buffer is indended by the author of AsyncMqttClient, which is signalled by the topic being char* instead of const char*.

The change added quite some overhead, since every single call to _onMqttMessage() would cause a memory allocation and subsequent free to happen which is imo just pure insanity from a performance standpoint. Also the added complexity will lead to errors like the one in this issue.

Your commit will further add complexity. Your proposed changes could also be simplified by just moving the copying into the index==0 condition right before the call to __splitTopic(), which would have the same effect while at the same time providing performance benefits.

Now to your last question: My proposal to introduce the API would cover the original use case in #524, and therefore eliminate the need to directly access the MQTT client. This would make the changes in #526 unnecessary and spare us of this whole ordeal.

Aside from this topic, the current code seems to assume that AsyncMqttClient calls its message callback functions sequentially for messages with different topics. Is this actually the case?

mkfrey commented 4 years ago

@nemidiy Those 50 microseconds add up, especially with larger/splitted payloads such as OTA and take the time from other software components such as the Wi-Fi stack.

But why would we have to replace the call to __splitTopic(_mqttTopicCopy.get())? If we replaced it by __splitTopic(topic) then we would get the effect described in #524, because we would start splitting the original message buffer again.

nemidiy commented 4 years ago

@codefrey sorry for the last comment I submited, it was unfinished and I was actually reviewing it, and it seems not only I closed the issue but also hit the comment button (both unwillingly), this week my brain has been somewhere else and pretty much fried. I deleted the comment and reopened the issue. I will take some time this weekend to look at it and get back to you. I have two cats (who love to nap on my keyboard) and two toddler sons that also seem to love the keyboard so not quite sure what happened there. Sorry for the confusion.

nemidiy commented 4 years ago

hey @codefrey, thanks for the input. I compiled your last two comments here.

@nemidiy The change adding the topic copy was just recently introduced (see #526). The ability to modify the topic buffer is indended by the author of AsyncMqttClient, which is signalled by the topic being char* instead of const char*.

yes. to be honest, I'm not sure why, I might be lacking the context, but why would you allow to modify the topic on the callback that is supposed to handle it ? The only thing I can think of is so the layer on top of the MQTT client (Homie/BootNormal in this case) could split it given the Homie author and the AsyncMqttClient author are the same and that way they could avoid copying it. Anyways, just a guess. In my head that is kind of noisy, since the owner of the topic buffer is AsyncMqttClient after all and not Homie. Maybe this is a better practice for FW to avoid overhead ?

The change added quite some overhead, since every single call to _onMqttMessage() would cause a memory allocation and subsequent free to happen which is imo just pure insanity from a performance standpoint. Also the added complexity will lead to errors like the one in this issue.

I kind of agree with you, so .. yes... agreed about overhead, pure insanity not so sure. I am not a FW developer, this is my first approach, but that copying code is 50 microsecs aprox does not seem like a lot.

Those 50 microseconds add up, especially with larger/splitted payloads such as OTA and take the time from other software components such as the Wi-Fi stack.

Yes, overhead is there and I guess it is about the battle between ownership of things vs overhead. Overhead seems to win here.

Your commit will further add complexity. Your proposed changes could also be simplified by just moving the copying into the index==0 condition right before the call to __splitTopic(), which would have the same effect while at the same time providing performance benefits.

You are right :) , thanks for pointing that out. Although we would also have to replace

__splitTopic(_mqttTopicCopy.get());

by

__splitTopic(topic);

otherwise you would have the same truncating effect that is described in #524 after calling __splitTopic

But why would we have to replace the call to __splitTopic(_mqttTopicCopy.get())? If we replaced it by __splitTopic(topic) then we would get the effect described in #524, because we would start splitting the original message buffer again.

Yes, you would, but what also happens with __splitTopic(_mqttTopicCopy.get()) is that every hanlder after that :

__handleOTAUpdates
__fillPayloadBuffer
__handleBroadcasts ...

will get the truncated version of the topic since you are doing the split on the internal mqttTopicCopy buffer. It works at least for OTA. Not sure for the rest of the handlers. My suggestion was splitting using the *topic but then use the _mqttTopicCopy onwards.

Now to your last question: My proposal to introduce the API would cover the original use case in #524, and therefore eliminate the need to directly access the MQTT client. This would make the changes in #526 unnecessary and spare us of this whole ordeal.

I am still struggling with this, if the callback signature defined by AsyncMqttClient presents topic as a char*, how are you going to be able to not mess around with the buffer you get from AsyncMqttClient if it is not by copying ?

Aside from this topic, the current code seems to assume that AsyncMqttClient calls its message callback functions sequentially for messages with different topics. Is this actually the case?

I am really not sure.

I also found that if you flash though USB and the you try to do OTA you get this :

ERROR[11]: Invalid bootstrapping state, reset ESP8266 before updating
✖ OTA failed (500 INTERNAL_ERROR 11)

if u then reset then it works. Not sure why is it not ok to do OTA after flashing without resetting. Sorry if I am being a pain here but this is helping me a lot. Thanks.

mkfrey commented 4 years ago

@nemidy If we start splitting the topic again, then we would need to reopen #524, which is the sole reason the copying was introduced in the first place. The handlers don't really matter, since they neither use the topic copy nor the original topic. They only need the topic levels.

I am still struggling with this, if the callback signature defined by AsyncMqttClient presents topic as a char*, how are you going to be able to not mess around with the buffer you get from AsyncMqttClient if it is not by copying ?

You would do it by directly modifying the buffer, which is no problem if you know you are the only receiver of the message callbacks. In Homie this could be archieved by hiding the native client and providing an API for custom messages, leaving Homie as a single receiver.

nemidiy commented 4 years ago

thanks @codefrey got it. I'll close the issues as soon as your PR gets merged.

luebbe commented 4 years ago

So shall I merge #645? Originally I wanted to have more eyes on it before merging.

kleini commented 4 years ago

@nemidiy My device is a Sonoff Basic (the old first one with 1M flash). The firmware is the following https://github.com/kleini/sonoff-basic/. What other information do you need to be able to reproduce?

nemidiy commented 4 years ago

@kleini that's good , your platform.ini has a bunch of settings I was not using so I'll try your settings. I don't have a sonoff basic so I am just going to try it out on my nodemcuv2 and let you know.

nemidiy commented 4 years ago

So shall I merge #645? Originally I wanted to have more eyes on it before merging.

From a basic perspective the fix works. Maybe you can wait to merge until i come back with the tests using @kleini's firmware ? up to you, not really sure about the project policies.

kleini commented 4 years ago

@nemidiy What settings are you using in platform.ini?

nemidiy commented 4 years ago

@kleini I tested your repo and OTA worked fine on nodemcuv2. I did comment out the board params:

[env:kleini]
platform = espressif8266@2.3.1

board = nodemcuv2
;board_build.flash_mode = dout
;board_build.ldscript = eagle.flash.1m64.ld

framework = arduino

build_flags =
  -D PIO_FRAMEWORK_ARDUINO_LWIP2_HIGHER_BANDWIDTH
  -D ASYNC_TCP_SSL_ENABLED=1
  -D HOMIE_MDNS=0
  -D HOMIE_CONFIG=0

lib_compat_mode = strict
;lib_ldf_mode = off
;lib_extra_dirs = /home/marcus/source/esp8266/lib
lib_deps =
  https://github.com/bblanchon/ArduinoJson.git#062c1c13b5c3219ab5fc1364781c584a676c8def
  https://github.com/kleini/ESPAsyncTCP#49b497569045d063c817b5c7ae4baa8eaf8313bf
  https://github.com/kleini/async-mqtt-client.git#f1b42054815ad82fed4519bce7febb7f1601560f
  https://github.com/kleini/homie-esp8266.git#35419c27ada5af4f1fc44fb19f339b355149599f

;monitor_speed = 115200
;upload_port = /dev/ttyUSB0

@nemidiy What settings are you using in platform.ini?

This are my settings

[env:nodemcuv2]
platform = espressif8266
board = nodemcuv2
framework = arduino
build_flags =
    ;-D HOMIE_MDNS=0
    -D DEBUG_UPDATER
    -D PIO_FRAMEWORK_ARDUINO_LWIP2_LOW_MEMORY
upload_speed = 115200
lib_ldf_mode = deep
lib_deps =
    ArduinoJson@6.13.0
    /home/nemi/workspace/green-jac/homie-esp8266

I am also setting the gw, static ip, dns1 and mask in the config.

Can you add -D DEBUG_UPDATER and see what you get on the serial output when triggering the OTA ?

luebbe commented 4 years ago

So shall I merge #645? Originally I wanted to have more eyes on it before merging.

From a basic perspective the fix works. Maybe you can wait to merge until i come back with the tests using @kleini's firmware ? up to you, not really sure about the project policies.

Project policy is quite simple. The original Author doesn't maintain it anymore. There's no official maintainer and I'm afraid that the project will die. So I'm trying to moderate without having the big picture of the code...

kleini commented 4 years ago

@nemidiy I think the big difference is, I am using an encrypted MQTT connection, which requires more memory. The flag PIO_FRAMEWORK_ARDUINO_LWIP2_LOW_MEMORY will not work with enabled SSL.

nemidiy commented 4 years ago

@kleini that's probably the issue, none of my tests were done with SSL. I still have not reached that point. I wanted first to have OTA working (it currently is with #645) and then do OTA over SSL. The #645 PR is the first step to hit that goal.

Could you try to OTA update over a non encrypted MQTT connection to see if it works ? Thanks.

nemidiy commented 4 years ago

So shall I merge #645? Originally I wanted to have more eyes on it before merging.

From a basic perspective the fix works. Maybe you can wait to merge until i come back with the tests using @kleini's firmware ? up to you, not really sure about the project policies.

Project policy is quite simple. The original Author doesn't maintain it anymore. There's no official maintainer and I'm afraid that the project will die. So I'm trying to moderate without having the big picture of the code...

Got it @luebbe and thanks for doing this. About the PR #645 IMO at this point I would merge it. The memory bug is there and the fix solves it.

luebbe commented 4 years ago

Merged #645 (Keeping my fingers crossed) :)