marvinroger / async-mqtt-client

📶 An Arduino for ESP8266 asynchronous MQTT client implementation
MIT License
834 stars 266 forks source link

Discussion: #develop branch #219

Closed bertmelis closed 3 years ago

bertmelis commented 3 years ago

I'm been doing some work on the https://github.com/marvinroger/async-mqtt-client/tree/develop branch lately.

It should solve lots of the current issues. The main improvement is that outgoing messages are queued and the memory is only freed after the TCP ack allowes to do so.

:exclamation: : although the API remained unchanged, the exlamation mark denotes a change in behaviour.

I'm looking forward to feedback. I do like positive feedback :smile:, but I'm more interested in the negative ones. I still have to go through the list of mandatory statements of the MQTT specification. TLS is also untested but there has been no change in that part of the code.

bertmelis commented 3 years ago

A sample sketch with the development code:

RAM:   [====      ]  36.5% (used 29936 bytes from 81920 bytes)
Flash: [===       ]  28.1% (used 292984 bytes from 1044464 bytes)

Master branch:

RAM:   [====      ]  36.7% (used 30104 bytes from 81920 bytes)
Flash: [===       ]  27.6% (used 288428 bytes from 1044464 bytes)

A similar library whose owner is quite patronizing (API differs a bit, so sketch is not 100% the same):

RAM:   [====      ]  36.5% (used 29880 bytes from 81920 bytes)
Flash: [===       ]  28.2% (used 294636 bytes from 1044464 bytes)
luebbe commented 3 years ago

I added https://github.com/marvinroger/async-mqtt-client.git#develop to the platformio.ini of my https://github.com/luebbe/homie-node-collection compiled and uploaded the sensor-nodes demo to a d1 mini.

This one contains a lot of nodes and many mqtt messages (84 topics) have to be sent upon startup. I'm running mosquitto on a raspberry Pi 1 and MQTT Explorer on Windows for monitoring.

The only notable difference when the device is reset is that it seems to take longer until all the initial messages have been received by MQTT explorer. I guess this is to be expected since they are queued up. Perhaps before messages were lost when the buffer was full.

I have also added "freeheap" to $stats in my fork of homie-esp2866, so I can monitor if anything runs away there.

The graph looks like this (around 15 minutes of history) grafik

There is a significant drop directly after the start (expected) and then it is stable at around 30KB. You can also see the drop at the five minute intervals when sensor data is sent.

It is also interesting to see that when the $stats are sent, they now trickle in over the course of 1-3 seconds. With the previous version they came in the same second.

bertmelis commented 3 years ago

In which qos are you sending?

One notable difference in sending qos > 0 is that "the next message" is only sent when the previous has been acknowledged. I didn't want the logic to be too complicated to ensure proper message ordering (as per mqtt spec). In qos 0 this should not matter, but I might be wrong. I changed the logic a few times.

luebbe commented 3 years ago

Homie default. I didn't change anything. MQTT Explorer says qos 0?

bertmelis commented 3 years ago

Homie default. I didn't change anything. MQTT Explorer says qos 0?

qos subscribing might differ from qos publishing.

Stats are sent qos 1 (https://github.com/homieiot/homie-esp8266/blob/9cd83972f27b394eab8a5e3e2baef20eea1b5408/src/Homie/Boot/BootNormal.cpp#L180). So yes, this is slower than before.

luebbe commented 3 years ago

The homie convention states https://homieiot.github.io/specification/#qos-and-retained-messages that it uses qos 1, but I have to check the source to confirm. I don't know from where MQTT Explorer gets the qos 0 info.

luebbe commented 3 years ago

qos subscribing might differ from qos publishing.

guess so

Stats are sent qos 1 (https://github.com/homieiot/homie-esp8266/blob/9cd83972f27b394eab8a5e3e2baef20eea1b5408/src/Homie/Boot/BootNormal.cpp#L180). So yes, this is slower than before.

Fine by me. Better slow & certain than quick & dirty

I'll leave my dev board and the monitor running at least until I go to sleep, but I'm already pretty confident that there's no memory leak.

bertmelis commented 3 years ago

I'll anyhow add speed up to the wishlist.

luebbe commented 3 years ago

Just FYI, the sensor-demo has 84 subtopics. This is the second largest amount of my homie based applications.

kleini commented 3 years ago

For me this results in a crash. Please see attached serial output from my relays-device. I am using SSL encrypted connection. crash.log

bertmelis commented 3 years ago

Do you run your ESP on 160MHz? A soft watchdog reset on the sys stack during connecting is probably caused by the handshake taking too much time. Off course I can be wrong.

Edit: for the record, I am setting up TLS on my broker to test myself.

kleini commented 3 years ago

It is a standard ESP12-F. This time with decoded crash: crash.log

bertmelis commented 3 years ago

gotta look into that. But still, it's a watchdog reset, so something is taking too long. I workaround could be to not send immediately after connecting but on the first onPoll. You could try with commenting out line 382. I don't know how long a broker waits to receive a CONNECT before disconnecting.

kleini commented 3 years ago

I think, the connection to the MQTT broker does not work at all:

mosquitto    | 1612816096: New connection from 192.168.168.196 on port 1883.
mosquitto    | 1612816099: Client <unknown> has exceeded timeout, disconnecting.
mosquitto    | 1612816099: Socket error on client <unknown>, disconnecting.

And with some debug enabled, I see

✔ Wi-Fi connected, IP: 192.168.168.196
Triggering WIFI_CONNECTED event...
↕ Attempting to connect to MQTT...
CONNECTING
TCP conn, MQTT CONNECT
new front #1
snd #1: 165/100

--------------- CUT HERE FOR EXCEPTION DECODER ---------------

Soft WDT reset

So, is somehow the connect string not sent completely? Will try to sniff network traffic.

bertmelis commented 3 years ago

It's been a while since I've used TLS on the esp8266. How did you solve this error:

ESPAsyncTCP/src/ESPAsyncTCP.cpp:1325:31: error: no matching function for call to 'AsyncClient::_recv(tcp_pcb*&, pbuf*&, int)'           
         c->_recv(pcb, p->pb, 0);  
luebbe commented 3 years ago

Still running fine after 3,5 hours. FreeHeap still at 30k, goes down to 29,4k .. 29,6k when values are sent and always goes back to the same max value. No SSL.

kleini commented 3 years ago

It's been a while since I've used TLS on the esp8266. How did you solve this error:

ESPAsyncTCP/src/ESPAsyncTCP.cpp:1325:31: error: no matching function for call to 'AsyncClient::_recv(tcp_pcb*&, pbuf*&, int)'           
         c->_recv(pcb, p->pb, 0);  

https://github.com/me-no-dev/ESPAsyncTCP/pull/151 <- that requires a fix from @mhightower83, which is not merged for a 3/4 year now.

Sorry, enough for me for today.

bertmelis commented 3 years ago

Shoot me, with logging enable it gives this output:

CONNECTING              <-- connect call by the user
TCP conn, MQTT CONNECT  <-- TCP is connected (TLS handshake done)
new front #1            <-- CONNECT packet is added to the queue
snd #1: 101/40          <-- TCP sends 101 bytes whereas the CONNECT was only 40.

So the TCP lib is not returning the actual written payload when using encryption???

kleini commented 3 years ago

Should TCP library normally return exact the amount of bytes, that have been sent? I saw the similar behaviour some comments above.

bertmelis commented 3 years ago

It does, but only the non-tls. I guess it returns the encrypted TCP payload size. If you ask me, this a bad API design.

bertmelis commented 3 years ago

@luebbe To speed up Homie you can use set_tcp_nodelay to true if you use mosquitto as broker. But I added "inflight messages > 1" to the TODO.

bertmelis commented 3 years ago

I'm thinking about a workaround for TLS. It wont be pretty as it is impossible to know how much data actually has been sent: We can wait to send for the entire TCP buffer to be empty. Then copy a fixed amount to LWIP and "ack" our own data.

EDIT: Thinking about it... I might keep 2 sets of counters: one from tcp and one of our own and keep them in sync. I will work it out in my head while clearing the snow from the driveway.

bertmelis commented 3 years ago

I've added the hacky solution for TLS. I've tested it with the example in the repo + added a 2000 byte message for chuncked sending.

Did it like described above.

kleini commented 3 years ago

It works! works.log Reading your pull request, I see the _acked is already increased when sending the TLS data as it is unclear how log the data will be. The returned written length seems to be somewhere deep inside axTLS of ESPAsyncTCP. Pretty hard to just read and understand that code.

bertmelis commented 3 years ago

Yes, there is room for improvement. I didn't want to touch the existing code too much. _acked is keeping track of the acked data in the mqtt client. But since we're copying the data to TCP, we can already consider it acked. On non-tls versions, _acked is only incremented when the real zck comes in.

I also realized I can use client->space() instead of the fixed number.

bertmelis commented 3 years ago

I'm going to give it a rest for a week or two (covid restrictions a eased a bit, whoohoo). I'll keep an eye on the issue and PR tracker and read whatever is going on. I'll keep supporting!

Meanwhile I'll think about possible improvements:

kleini commented 3 years ago

Awesome work so far, I must say! Thank you very much @bertmelis !

luebbe commented 3 years ago

I'll go through old issues backwards and check if I can reproduce them. If not I'll tag them with "to be closed" and we can review together. Won't test TLS related stuff though. I didn't want to jump through that hoop yet.

kleini commented 3 years ago

@luebbe I can do the TLS related stuff.

bertmelis commented 3 years ago

I can do the TLS related stuff.

There might be a serious issue with TLS though. The last thing I was testing was to destroy the client. On non-TLS is was not problem. On TLS it crashed. Now because of a bug in the ESPAsyncTCP library, I had to have the tcp client on the heap to be able to destroy and recreate on each (re)connect. Should destroying the mqttclient's crash be related to the destruction of the tcp client, it will also crash when the mqtt client disconnects. I didn't test this.

It is solveable because the bug is related to acking data and we bypass this when using TLS. However, it involves a lot of preprocesser stuf...

bertmelis commented 3 years ago

Somebody pointed me to the fact that on both ESP8266 and ESP32, data is copied to LWIP instead of the data being referenced. So there never were dangling pointers. This means I can simplify the develop code as I don't need to keep data only for acking (It is still needed for retries etc).

see: https://git.savannah.nongnu.org/cgit/lwip.git/tree/src/core/tcp_out.c?h=STABLE-2_1_2_RELEASE#n422 https://github.com/d-a-v/esp82xx-nonos-linklayer/blob/7421258237b7c8f61629226961af498a0a6e0096/glue-lwip/arduino/lwipopts.h#L1665 https://github.com/espressif/esp-idf/blob/6e776946d01ec0d081d09000c36d23ec1d318c06/components/lwip/port/esp32/include/lwipopts.h#L449 The COPY flag is set by default with the configuration LWIP is built for ESP8266 and ESP32 and cannot be changed. If we would keep the data, it will in fact be kept twice.

Hence, I'll simplify the code and keep only the data that has to be acked by the MQTT protocol.

luebbe commented 3 years ago

Thanks @bertmelis

My test rig runs stable with develop, but there's no heavy load.

Two weeks ago I wrote:

I'll go through old issues backwards and check if I can reproduce them. If not I'll tag them with "to be closed" and we can review together.

I did this and tagged a handful of tickets. If someone could review these: https://github.com/marvinroger/async-mqtt-client/labels/to%20be%20closed and close where appropriate?

bertmelis commented 3 years ago

:+1: I'm currently trying to implement the "don't wait for confirmation before sending the next". This does require some next level thinking. Will take a few days depending on how much time I can spend.

kleini commented 3 years ago

For me everything works as well. Even OTA over MQTT over TLS improved and works better now.

bertmelis commented 3 years ago

lovely to hear :smile:

Pablo2048 commented 3 years ago

I also want to report - I have develop branch in 5 devices in two different locations, all of them connected to my MQTT VPN. So far so good - uptime is about 4 days now. VPN network interface statistics on my PC is: Snímek z 2021-02-24 09-18-03 Everything seems to be working smoothly - OTA from PIO, OTA via web page, web page access, ping, ... So I can say great job and big thank You...

OldGreyCells commented 3 years ago

Hi Bert Thanks for doing this. Only issue I've found so far is the char* payload onMessage is often over length. eg 43 chars instead of 32:

Received MQTT topic: homeassistant/localtime 
payload* length: 43 
2021-02-24T20:28:20.236189+00:0�?w␂d␂�� @

This is from the master branch:

Received MQTT topic: homeassistant/localtime 
payload* length: 32 
2021-02-24T20:33:05.467993+00:00

All the best

bertmelis commented 3 years ago

Strange... I didn't touch that part of the code (not that I can remember, that is :wink:)

However, I did spot some other bugs. On disconnect with a session present, it might block and crash by the watchdog... Fixing it as we speak.

I will look into your issue also!

OldGreyCells commented 3 years ago

Thank you Bert - apologies for no test script (recovering from Covid, so energy levels are low) but it is a very trivial MQTT payload that doesn't change between async-mqtt-client builds. I'm only publishing and subscribing, so not testing all the features!

bertmelis commented 3 years ago

no worries, although it would be easier if you could test again with logging enabled. A question though: it is always over length or only sometimes?

OldGreyCells commented 3 years ago

It appears to always be the same over length (for this particular payload). Forgive my laziness but how do I enable logging? I see log_i and log_w but how to turn it on?

bertmelis commented 3 years ago

Do you use Arduino IDE or something else (VSCode/Platformio)?

OldGreyCells commented 3 years ago

I'm using platformio for the first time via VSCode on Ubuntu.

bertmelis commented 3 years ago

my platformio.ini:

[env:esp8266]
platform = espressif8266
board = d1_mini
build_flags =
  -Tesp8266.flash.4m1m.ld
  -D PIO_FRAMEWORK_ARDUINO_LWIP2_HIGHER_BANDWIDTH
  ;-D ASYNC_TCP_SSL_ENABLED
  -D DEBUG_ESP_PORT=Serial
  -D DEBUG_ASYNC_MQTT_CLIENT
monitor_filters = esp8266_exception_decoder
monitor_speed = 115200

Debug output goes to Serial.

proddy commented 3 years ago

FYI I've started testing on the ESP32 - will report back any issues here

OldGreyCells commented 3 years ago

my platformio.ini: Thank you.

OldGreyCells commented 3 years ago

I hope this helps (have deleted some chars as they were the bulk of the mqtt broker password!) A different wrong length with this build.

Subscribing to: homeassistant/localtime... SUBSCRIBE
new back #8
snd #8: 30/30
success
PUBLISH
new back #3
ack 30 (30)
data rcv (5)
rcv SUBACK
SUBACK
SUB released
rmv #8
snd #3: 35/35
p #3 rel
ack 35 (35)
data rcv (59)
rcv PUBLISH
Received MQTT topic: homeassistant/localtime 
payload* length: 61 
str::payload length: 61 
2021-02-24T21:56:09.727588+00:00alt����reqũ�� <!!!part of mqtt password!!!>
setLocaltime(): 2021-02-24T21:56:09.727588+00:00alt����reqũ�� <!!!part of mqtt password!!!> len: 61 
 pm open,type:2 0
PING
new back #12
snd #12: 2/2
bertmelis commented 3 years ago

I updated the develop code again to simplify and iron out some bugs.

@OldGreyCells You receive data with a total length 59 (that is: including MQTT headers), how do you manage to get a payload of 61 out of it. Could you share your callback code? Mind that the payload is binary and may not include a 0-terminator for c-strings. So you cannot do strlen(payload).

luebbe commented 3 years ago

Mind that the payload is binary and may not include a 0-terminator for c-strings. So you cannot do strlen(payload).

Exactly my thoughts.

OldGreyCells commented 3 years ago

I updated the develop code again to simplify and iron out some bugs.

@OldGreyCells You receive data with a total length 59 (that is: including MQTT headers), how do you manage to get a payload of 61 out of it. Could you share your callback code? Mind that the payload is binary and may not include a 0-terminator for c-strings. So you cannot do strlen(payload).

Aha! That may well be the cause - I am doing a strlen(payload). My apologies if I've wasted your time - I've been a little* sub-par.

*understatement - this is my 'distraction'.