marvinroger / async-mqtt-client

📶 An Arduino for ESP8266 asynchronous MQTT client implementation
MIT License
834 stars 266 forks source link

Discussion: #develop branch #219

Closed bertmelis closed 3 years ago

bertmelis commented 3 years ago

I'm been doing some work on the https://github.com/marvinroger/async-mqtt-client/tree/develop branch lately.

It should solve lots of the current issues. The main improvement is that outgoing messages are queued and the memory is only freed after the TCP ack allowes to do so.

:exclamation: : although the API remained unchanged, the exlamation mark denotes a change in behaviour.

I'm looking forward to feedback. I do like positive feedback :smile:, but I'm more interested in the negative ones. I still have to go through the list of mandatory statements of the MQTT specification. TLS is also untested but there has been no change in that part of the code.

OldGreyCells commented 3 years ago

I'm not sure why master didn't highlight my embarrassing "Doh!" but this now sorts my dodgy code out:

    char payloadc[len+1];
    for (size_t i = 0; i < len; i++) {
        payloadc[i] = ((char)payload[i]);
    }
    payloadc[len] = '\0';

As an aside, is there a way to identify the payload 'type' eg whether it stays a binary stream (eg an image) or text/json?

bertmelis commented 3 years ago

There probably are, but a bit out of scope for this lib imho. The MQTT protocol doesn't specify anything about the payload format.

An enhancement could be to create a separate file/class with helper methods. It's somewhere on the to do, but without guarantees nor deadline.

OldGreyCells commented 3 years ago

Thank you for you patience Bert. A comment on the example(s) onMqttMessage() might help. I kinda just looked at char* topic and char* payload in the middle of the night and thought: "They're just regular char arrays" - although I did wonder why the payload wasn't logged. RTFM as bedtime reading is underrated.

bertmelis commented 3 years ago

Willdo

I'm after another bug now:

Exception (9):
epc1=0x4020205e epc2=0x00000000 epc3=0x00000000 excvaddr=0x000002d9 depc=0x00000000

LoadStoreAlignmentCause: Load or store to an unaligned address
  epc1=0x4020205e in AsyncMqttClient::publish(char const*, unsigned char, bool, char const*, unsigned int, bool, unsigned short) at ??:?
OldGreyCells commented 3 years ago

Eeek! That looks way above my pay grade!

proddy commented 3 years ago

I'm getting regular crashes too on the ESP32 so getting my esp-prog debugger out to find where it's bombing.

Pablo2048 commented 3 years ago

Same problem here - I've just replaced "old" working develop with the new one, and got periodic:

Exception (9):
epc1=0x40225596 epc2=0x00000000 epc3=0x00000000 excvaddr=0x000006bf depc=0x00000000

LoadStoreAlignmentCause: Load or store to an unaligned address
  epc1=0x40225596 in AsyncMqttClient::publish(char const*, unsigned char, bool, char const*, unsigned int, bool, unsigned short) at lib/async-mqtt-client-develop/src/AsyncMqttClient.cpp:744

crashes...

luebbe commented 3 years ago

Wouldn't it make sense to convert all the char payloads into byte to avoid the confusion that we are dealing with strings?

bertmelis commented 3 years ago

Wouldn't it make sense to convert all the char payloads into byte to avoid the confusion that we are dealing with strings?

That would indeed make sense.

bertmelis commented 3 years ago

Same problem here - I've just replaced "old" working develop with the new one, and got periodic:

Exception (9):
epc1=0x40225596 epc2=0x00000000 epc3=0x00000000 excvaddr=0x000006bf depc=0x00000000

LoadStoreAlignmentCause: Load or store to an unaligned address
  epc1=0x40225596 in AsyncMqttClient::publish(char const*, unsigned char, bool, char const*, unsigned int, bool, unsigned short) at lib/async-mqtt-client-develop/src/AsyncMqttClient.cpp:744

crashes...

Thank you, here, the line number isn't displayed. Furthermore, when I enable logging, the error is gone.

Now, if the line number is correct, it is on the packetId() call. It's a virtual call which returns the value _packetId when pointed to a Publish packet or zero when there's no method overloaded. Mind that the base class does not have a member variable _packetId. Am I doing something wrong here?

Pablo2048 commented 3 years ago

Well the complete stack trace is:

0x40225548 in AsyncMqttClient::publish(char const*, unsigned char, bool, char const*, unsigned int, bool, unsigned short) at lib/async-mqtt-client-develop/src/AsyncMqttClient.cpp:744
0x40260b90 in mqtt_if_output(netif*, pbuf*, ip4_addr const*) at lib/MQTTVPN/src/MQTTVPN.cpp:72
0x4025a69d in cnx_update_bss_more at ??:?
0x40260bc4 in mqtt_if_output(netif*, pbuf*, ip4_addr const*) at lib/MQTTVPN/src/MQTTVPN.cpp:78
0x40100e3f in umm_free_core at /home/pablo2048/.platformio/packages/framework-arduinoespressif8266/cores/esp8266/umm_malloc/umm_malloc.cpp:351
0x40101108 in free at /home/pablo2048/.platformio/packages/framework-arduinoespressif8266/cores/esp8266/umm_malloc/umm_malloc.cpp:398
0x402501d4 in mem_free at /local/users/gauchard/arduino/arduino_esp8266/esp8266-lwip/tools/sdk/lwip2/builder/lwip2-src/src/core/mem.c:237
0x40254459 in ieee80211_parse_beacon at ??:?
0x4024f6a8 in ip4_output_if_opt_src at /local/users/gauchard/arduino/arduino_esp8266/esp8266-lwip/tools/sdk/lwip2/builder/lwip2-src/src/core/ipv4/ip4.c:1764
0x40258de7 in sta_input at ??:?
0x402393ee in AsyncClient::_s_recv(void*, tcp_pcb*, pbuf*, long) at .pio/libdeps/esp12e/ESPAsyncTCP/src/ESPAsyncTCP.cpp:741
0x4024f6f0 in ip4_output_if_opt at /local/users/gauchard/arduino/arduino_esp8266/esp8266-lwip/tools/sdk/lwip2/builder/lwip2-src/src/core/ipv4/ip4.c:1577
0x402501d4 in mem_free at /local/users/gauchard/arduino/arduino_esp8266/esp8266-lwip/tools/sdk/lwip2/builder/lwip2-src/src/core/mem.c:237
0x4024f716 in ip4_output_if at /local/users/gauchard/arduino/arduino_esp8266/esp8266-lwip/tools/sdk/lwip2/builder/lwip2-src/src/core/ipv4/ip4.c:1550
0x4024fd1d in icmp_input at /local/users/gauchard/arduino/arduino_esp8266/esp8266-lwip/tools/sdk/lwip2/builder/lwip2-src/src/core/ipv4/icmp.c:291
0x40100e3f in umm_free_core at /home/pablo2048/.platformio/packages/framework-arduinoespressif8266/cores/esp8266/umm_malloc/umm_malloc.cpp:351
0x4024f451 in ip4_input at /local/users/gauchard/arduino/arduino_esp8266/esp8266-lwip/tools/sdk/lwip2/builder/lwip2-src/src/core/ipv4/ip4.c:1473
0x40246090 in esp2glue_ethernet_input at /local/users/gauchard/arduino/arduino_esp8266/esp8266-lwip/tools/sdk/lwip2/builder/glue-lwip/lwip-git.c:469
0x4026cb33 in ethernet_input at /local/users/gauchard/arduino/arduino_esp8266/esp8266-lwip/tools/sdk/lwip2/builder/glue-esp/lwip-esp.c:373
0x40260b25 in mqtt_if_Task(ETSEventTag*) at lib/MQTTVPN/src/MQTTVPN.cpp:182
0x40105029 in call_user_start_local at ??:?
0x4010502f in call_user_start_local at ??:?
0x4010000d in call_user_start at ??:?
0x401007c8 in ets_post at /home/pablo2048/.platformio/packages/framework-arduinoespressif8266/cores/esp8266/core_esp8266_main.cpp:177
0x40101143 in malloc at /home/pablo2048/.platformio/packages/framework-arduinoespressif8266/cores/esp8266/umm_malloc/umm_malloc.cpp:552
0x4022ddeb in esp8266::MDNSImplementation::MDNSResponder::stcMDNSSendParameter::addDomainCacheItem(void const*, bool, unsigned short) at /home/pablo2048/.platformio/packages/framework-arduinoespressif8266/libraries/ESP8266mDNS/src/LEAmDNS_Structs.cpp:2441 (discriminator 1)
0x40235300 in operator new(unsigned int) at /home/pablo2048/.platformio/packages/framework-arduinoespressif8266/cores/esp8266/abi.cpp:39
0x4022ddeb in esp8266::MDNSImplementation::MDNSResponder::stcMDNSSendParameter::addDomainCacheItem(void const*, bool, unsigned short) at /home/pablo2048/.platformio/packages/framework-arduinoespressif8266/libraries/ESP8266mDNS/src/LEAmDNS_Structs.cpp:2441 (discriminator 1)
0x4022eb8d in esp8266::MDNSImplementation::MDNSResponder::_writeMDNSRRDomain(esp8266::MDNSImplementation::MDNSResponder::stcMDNS_RRDomain const&, esp8266::MDNSImplementation::MDNSResponder::stcMDNSSendParameter&) at /home/pablo2048/.platformio/packages/framework-arduinoespressif8266/libraries/ESP8266mDNS/src/LEAmDNS_Transfer.cpp:1342
0x4022ed71 in esp8266::MDNSImplementation::MDNSResponder::_writeMDNSServiceDomain(esp8266::MDNSImplementation::MDNSResponder::stcMDNSService const&, bool, bool, esp8266::MDNSImplementation::MDNSResponder::stcMDNSSendParameter&) at /home/pablo2048/.platformio/packages/framework-arduinoespressif8266/libraries/ESP8266mDNS/src/LEAmDNS_Transfer.cpp:1431
0x4022e882 in esp8266::MDNSImplementation::MDNSResponder::_udpAppend8(unsigned char) at /home/pablo2048/.platformio/packages/framework-arduinoespressif8266/libraries/ESP8266mDNS/src/LEAmDNS_Transfer.cpp:1112
0x4022e9e9 in esp8266::MDNSImplementation::MDNSResponder::_write8(unsigned char, esp8266::MDNSImplementation::MDNSResponder::stcMDNSSendParameter&) at /home/pablo2048/.platformio/packages/framework-arduinoespressif8266/libraries/ESP8266mDNS/src/LEAmDNS_Transfer.cpp:1254
0x4022ed20 in esp8266::MDNSImplementation::MDNSResponder::_writeMDNSServiceDomain(esp8266::MDNSImplementation::MDNSResponder::stcMDNSService const&, bool, bool, esp8266::MDNSImplementation::MDNSResponder::stcMDNSSendParameter&) at /home/pablo2048/.platformio/packages/framework-arduinoespressif8266/libraries/ESP8266mDNS/src/LEAmDNS_Transfer.cpp:1425
0x40101143 in malloc at /home/pablo2048/.platformio/packages/framework-arduinoespressif8266/cores/esp8266/umm_malloc/umm_malloc.cpp:552
0x402501b4 in mem_malloc at /local/users/gauchard/arduino/arduino_esp8266/esp8266-lwip/tools/sdk/lwip2/builder/lwip2-src/src/core/mem.c:210
0x4022e882 in esp8266::MDNSImplementation::MDNSResponder::_udpAppend8(unsigned char) at /home/pablo2048/.platformio/packages/framework-arduinoespressif8266/libraries/ESP8266mDNS/src/LEAmDNS_Transfer.cpp:1112
0x402633f7 in pp_attach at ??:?
0x40263446 in pp_attach at ??:?
0x401007c8 in ets_post at /home/pablo2048/.platformio/packages/framework-arduinoespressif8266/cores/esp8266/core_esp8266_main.cpp:177
0x401057c9 in ets_timer_disarm at ??:?
0x401007c8 in ets_post at /home/pablo2048/.platformio/packages/framework-arduinoespressif8266/cores/esp8266/core_esp8266_main.cpp:177
0x401027bc in rcReachRetryLimit at ??:?
0x401057c9 in ets_timer_disarm at ??:?
0x40103d29 in lmacIsIdle at ??:?
0x40103d29 in lmacIsIdle at ??:?
0x401007c8 in ets_post at /home/pablo2048/.platformio/packages/framework-arduinoespressif8266/cores/esp8266/core_esp8266_main.cpp:177
0x401007c8 in ets_post at /home/pablo2048/.platformio/packages/framework-arduinoespressif8266/cores/esp8266/core_esp8266_main.cpp:177
0x40101c6c in pp_post at ??:?
0x40104f23 in lmacRxDone at ??:?
0x40101c6c in pp_post at ??:?
0x40102807 in rcReachRetryLimit at ??:?
0x40101c6c in pp_post at ??:?
0x401007c8 in ets_post at /home/pablo2048/.platformio/packages/framework-arduinoespressif8266/cores/esp8266/core_esp8266_main.cpp:177
0x401007c8 in ets_post at /home/pablo2048/.platformio/packages/framework-arduinoespressif8266/cores/esp8266/core_esp8266_main.cpp:177
0x40104f0b in lmacTxFrame at ??:?
0x40101c6c in pp_post at ??:?
0x40104f23 in lmacRxDone at ??:?
0x40102807 in rcReachRetryLimit at ??:?
0x401029e8 in rcReachRetryLimit at ??:?
0x401029e8 in rcReachRetryLimit at ??:?
0x40102eaa in wDev_ProcessFiq at ??:?
0x401007c8 in ets_post at /home/pablo2048/.platformio/packages/framework-arduinoespressif8266/cores/esp8266/core_esp8266_main.cpp:177
0x40101c6c in pp_post at ??:?
0x40104f23 in lmacRxDone at ??:?
0x40102807 in rcReachRetryLimit at ??:?
0x401029e8 in rcReachRetryLimit at ??:?
0x40102eaa in wDev_ProcessFiq at ??:?
0x40102bcc in wDev_ProcessFiq at ??:?
0x401007c8 in ets_post at /home/pablo2048/.platformio/packages/framework-arduinoespressif8266/cores/esp8266/core_esp8266_main.cpp:177
0x401008c2 in millis at /home/pablo2048/.platformio/packages/framework-arduinoespressif8266/cores/esp8266/core_esp8266_wiring.cpp:188 (discriminator 3)
0x40100e3f in umm_free_core at /home/pablo2048/.platformio/packages/framework-arduinoespressif8266/cores/esp8266/umm_malloc/umm_malloc.cpp:351
0x401008c2 in millis at /home/pablo2048/.platformio/packages/framework-arduinoespressif8266/cores/esp8266/core_esp8266_wiring.cpp:188 (discriminator 3)
0x401007c8 in ets_post at /home/pablo2048/.platformio/packages/framework-arduinoespressif8266/cores/esp8266/core_esp8266_main.cpp:177
0x401007e9 in esp_schedule at /home/pablo2048/.platformio/packages/framework-arduinoespressif8266/cores/esp8266/core_esp8266_main.cpp:125
0x40235740 in loop_wrapper() at /home/pablo2048/.platformio/packages/framework-arduinoespressif8266/cores/esp8266/core_esp8266_main.cpp:19

but - because it's from my MQTTVPN I don't think that it will help...

bertmelis commented 3 years ago

Got it. I thought I could be smart but I am not. Will push in a sec.

Strangely, on my stack trace, the line numbers don't show up. It could have helped me A LOT.

proddy commented 3 years ago

Strangely, on my stack trace, the line numbers don't show up. It could have helped me A LOT.

you need to compile with -g or use build_type = debug in platformio.ini

Pablo2048 commented 3 years ago

So with the new develop the crash is gone. Unfortunately it disconnects from the broker every ~24.9 seconds. Snímek z 2021-02-25 17-14-51 ... just to make it clear - this message is sent in onConnect callback.

bertmelis commented 3 years ago

You keepalive is probably standard, 15 secs?

Pablo2048 commented 3 years ago

Yes, I didn't modified it.

Pablo2048 commented 3 years ago

The reason reported in onDisconnect callback is always 0.

Pablo2048 commented 3 years ago

I do have some another observations. In the onConnect callback I publish to some topics like this:

    String info;

    info.reserve(512);
    info = F("{\"hw\":\"");
    info.concat(F(HW_NAME "-" HW_VARIANT));
    info.concat(F("\",\"app\":\""));
    info.concat(F(APP_NAME));
    info.concat(F("\",\"appversion\":\""));
    info.concat(F(APP_VERSION));
    info.concat(F("\",\"ip\":\""));
    IPAddress ip = mqtt_if->netif.ip_addr;
    info.concat(ip.toString().c_str());
    info.concat(F("\"}"));
    sprintf_P(buff, PSTR("%sinfo"), mqtt_if->service);
    mqtt_if->mqttcl.publish(buff, 0, true, info.c_str());
    sprintf_P(buff, PSTR("%sresetreason"), mqtt_if->service);
    mqtt_if->mqttcl.publish(buff, 0, true, ESP.getResetReason().c_str());
    mqtt_if->mqttcl.publish(mqtt_if->will, 1, true, online);

Publish to the /info and /resetreason topic seems to be ok (QoS 0), publish to the mqtt_if->will topic (it's my last will and testament topic) with QoS 1 NEVER succeed. The MQTT Explorer just logs periodic '0' writes to the last will topic, which probably goes from the broker itself. Snímek z 2021-02-25 17-36-19

bertmelis commented 3 years ago

what is this online variable?

Pablo2048 commented 3 years ago

Just character string:

static const char online[] = "1";
static const char offline[] = "0";
bertmelis commented 3 years ago

Can't reproduce unfortunately. Is your code public?

Pablo2048 commented 3 years ago

Unfortunately not yet :-( . Maybe I can publish it at the weekend (at least some version...).

bertmelis commented 3 years ago

I messed up again with the packetId. Apparently you can have the same member variable in a derived class as in a base class without any compiler warning?

Pablo2048 commented 3 years ago

I didn't notice that - I'll check through the code. Fortunately the last push did the trick - everything seems to be working again. Including the web access - didn't test the OTA so far. Edit: So the OTA is also working... Thank You.

bertmelis commented 3 years ago

Strangely enough the packetId errors didn't show up before I simplified the code. The error was there though.

Anyway, done for this week. Gotta do some other things now. Next week I'll be trying to speed things up by having more messages qos>0 in-flight.

proddy commented 3 years ago

I'm getting regular crashes too on the ESP32 so getting my esp-prog debugger out to find where it's bombing.

Got it all working. I had to switch to a different ESP32 partition configuration. Anyway, latest develop has been running flawlessly on one of my projects for the last few days now with around 50K of topics per day on QOS0. Tested large payloads too. Works wonderfully, nice work Bert!

bertmelis commented 3 years ago

Thanks. It's not over yet. The to do:

When that's done, I propose to merge the improvements to master, adjust the documentation and release a new version.

proddy commented 3 years ago

Nice. There's probably a few more things we can iron out too. I was looking at Phil's comments on AsyncMQTTClient that triggered him to create Pangolin. They are still valid concerns and it would be good to address them and mark them down as 'enhancements' in the GitHub so at least we have each feature logged.

bertmelis commented 3 years ago

Do you happen to "know" which remarks are left? I'll also dig up his old rant as see what's left (I remember some of his remarks were as cryptic as the code in his mqtt client).

proddy commented 3 years ago

Most of the original remarks were removed but you can trace it back here, hidden in the now removed files. Short summary:

It's a shame that Marvin's library, which is widely used and works fine for most people's needs to get slammed. The whole point of open source is to collaborate and improve each other's code, instead of spinning off a competitive project and asking people to donate $.

bertmelis commented 3 years ago
  1. ~Spontaneous DCX/CNX.~ Done
  2. ~Will Topic bug Prevents sketch from starting with non-static input.~ goes into docs, alternatives?
  3. ~Bad Subscribe Invalid topic causes DCX/CNX~ I'm not going to implement input validation
  4. ~No topic validation for subscribe (see above)~
  5. Discarded messages 1 what's this 1?
  6. Discarded messages 2 what's this 2?
  7. Total Message Loss what's this 3?
  8. No error handling callback TODO
  9. ~"Killer Packet" inbound~ stays as it is. Payload sizes > RAM needs to be enabled for OTA over MQTT
  10. ~QoS1 Protocol Violation~ Done
  11. Fragment Failure TCP does the reassembling, right? So what's this 5?
  12. Numerous API errors - sufficient for their own document Seriously? what's this 6?
  13. QoS 1/2 protocol violation - no message resend TODO implement retry together with inflight > 1
  14. ~QoS 1/2 protocol violation - no session recovery~ Done
  15. ~QoS 1 protocol violation - breach of delivery promise~ Done
  16. ~QoS 2 protocol violation - breach of delivery promise~ Done

My extra thoughts:

mcspr commented 3 years ago

Fragment Failure TCP does the reassembling, right? So what's this 5?

I wonder if this is related to this sort of issue? https://github.com/xoseperez/espurna/issues/2166#issuecomment-596126080 Single packet payload is allowed to be spread across multiple onmessage calls. Meaning, user needs to track 'total' and 'length', so idk if this is really a lib issue. Maybe another possible helper / example.

Other buffer / discard points seem to be related to add() / send() usage, since it is possible to get stuck when network buffers are full and 'lose' outgoing messages

Thx for fixing keepalive btw :) (will try asap with some more load, only tried some basic services for like 10minutes)

bertmelis commented 3 years ago

Fragment Failure TCP does the reassembling, right? So what's this 5?

I wonder if this is related to this sort of issue? xoseperez/espurna#2166 (comment) Single packet payload is allowed to be spread across multiple onmessage calls. Meaning, user needs to track 'total' and 'length', so idk if this is really a lib issue. Maybe another possible helper / example.

Something like this: https://github.com/marvinroger/async-mqtt-client/issues/234

bertmelis commented 3 years ago

To be clear. This is how the developbranch works:

What I'm thinking of now is

bertmelis commented 3 years ago

Just a sign of life. I'm still alive 😂

I'm just very low on computer time.

epiller commented 3 years ago

Sorry that I might be spamming as I have nothing to add, but I just wanted to give a huge thanks for the initiative to continue this project as it is the best one out there. Kudos @bertmelis, kudos.

bertmelis commented 3 years ago

Merged and released. Closing.