martin-ger / MQTT_VPN

IP over MQTT for ESP controllers and Linux
136 stars 28 forks source link

Add ESP32 mqtt_if_add_reading_topic() and fix pbuf copy issue #16

Closed bwjohns4 closed 1 year ago

bwjohns4 commented 1 year ago

I wanted to be able to setup napt similar to the Arduino ESP8266 example so I added the same mqtt_if_add_reading_topic() concepts to the ESP32 file. I also found an error that when trying it without the encryption (for speed) it did not work at all because of this: pub_data->data_len = pbuf_copy_partial(p, pub_data->data_buf, sizeof(pub_data->data_buf), 0); It would only copy 4 bytes of the packet since 'sizeof(pub_data->data_buf)' was 4. I changed it to: pub_data->data_len = pbuf_copy_partial(p, pub_data->data_buf, 2048, 0); and it worked fine from that point forward.

There's one standing issue where when I tried to access a particular png file over the vpn connection, it appears to just die and stop receiving the MQTT messages. I'm not sure if it just drops the connection or what. I enabled debugging and could never find the MQTT_EVENT_DISCONNECTED log coming through. Once it would silently stop receiving the messages, then nothing would work over the MQTT VPN (the ESP didn't crash, it kept running other things but the MQTT VPN just stopped working) until I restart the chip and then everything works fine again. Is there a way to check and re-initialize the MQTT connection periodically within this?

bwjohns4 commented 1 year ago

Regarding the MQTT client dying, here are some logs from verbose dbg level. Even when there are no MQTT messages, the MQTT client still logs that it's alive in verbose logging:

14:40:10.353 > I (174688) MQTTIF: Send topic mqttipKC/10.0.1.1 received - len: 60 14:40:10.424 > I (174774) MQTTIF: BWJ PBUF: 60 bytes 14:40:10.465 > D (174776) MQTT_CLIENT: deliver_publish: msg_topic_len=zu 14:40:10.527 > I (174888) MQTTIF: BWJ PBUF Cont: E bytes 14:40:10.573 > D (174949) MQTT_CLIENT: Get data len= zu, topic len=zu, total_data: 60 offset: zu 14:40:10.659 > D (175081) MQTTIF: MQTT_EVENT_DATA 14:40:10.696 > I (175119) MQTTIF: Received mqttipKC/192.168.1.145 - 60 bytes 14:40:10.762 > I (175184) MQTTIF: topic_len: 22 14:40:10.798 > I (175220) MQTTIF: calculated receive topic_len: 17 14:40:10.854 > I (175276) MQTTIF: data->receive_topic: mqttipKC/10.0.1.2 14:40:10.915 > I (175337) MQTTIF: topic: mqttipKC/192.168.1.145E 14:40:10.968 > I (175391) MQTTIF: BWJ Made it here 14:40:11.007 > I (175429) MQTTIF: BWJ unencrypted take 14:40:11.050 > D (175472) MQTTIF: Buffer post - len: 60 tot_len: 60 14:40:11.106 > D (175532) event: running post MQTTVPN_EVENTS:1 with handler 0x400db9c0 and context 0x3ffd5940 on loop 0x3ffd45e4 14:40:11.243 > ␛[33m=> 0x400db9c0: packet_send_handler at lib/MQTT_VPN_ESP32/mqttif.c:149␛[0m 14:40:11.244 > D (175532) MQTT_CLIENT: mqtt_message_receive: first byte: 0x30 14:40:11.292 > I (175648) MQTTIF: Send topic mqttipKC/10.0.1.1 received - len: 60 14:40:11.363 > D (175715) MQTT_CLIENT: mqtt_message_receive: read "remaining length" byte: 0x4c 14:40:11.449 > D (175871) MQTT_CLIENT: mqtt_message_receive: total message length: 78 (already read: zu) 14:40:11.543 > D (175966) MQTT_CLIENT: mqtt_message_receive: read_len=76 14:40:11.605 > D (176027) MQTT_CLIENT: mqtt_message_receive: transport_read():zu zu 14:40:11.678 > D (176100) MQTT_CLIENT: msg_type=3, msg_id=0 14:40:11.726 > D (176148) MQTT_CLIENT: deliver_publish, message_length_read=zu, message_length=zu 14:40:11.813 > D (176235) MQTT_CLIENT: deliver_publish: msg_topic_len=zu 14:40:11.875 > D (176297) MQTT_CLIENT: Get data len= zu, topic len=zu, total_data: 52 offset: zu 14:40:11.961 > D (176383) MQTTIF: MQTT_EVENT_DATA 14:40:11.999 > I (176421) MQTTIF: Received mqttipKC/192.168.1.145 - 52 bytes 14:40:12.064 > I (176486) MQTTIF: topic_len: 22 14:40:12.100 > I (176522) MQTTIF: calculated receive topic_len: 17 14:40:12.156 > I (176578) MQTTIF: data->receive_topic: mqttipKC/10.0.1.2 14:40:12.217 > I (176640) MQTTIF: topic: mqttipKC/192.168.1.145E 14:40:12.270 > I (176693) MQTTIF: BWJ Made it here 14:40:12.309 > I (176731) MQTTIF: BWJ unencrypted take 14:40:12.352 > D (176774) MQTTIF: Buffer post - len: 52 tot_len: 52 14:40:12.408 > I (176789) MQTTIF: BWJ PBUF: 60 bytes 14:40:12.450 > I (176872) MQTTIF: BWJ PBUF Cont: E bytes 14:40:12.495 > D (176918) MQTT_CLIENT: mqtt_message_receive: first byte: 0x30 14:40:12.563 > D (176984) MQTT_CLIENT: mqtt_message_receive: read "remaining length" byte: 0xcb 14:40:12.647 > D (177070) MQTT_CLIENT: mqtt_message_receive: read "remaining length" byte: 0x3 14:40:12.732 > D (177154) MQTT_CLIENT: mqtt_message_receive: total message length: 462 (already read: zu) 14:40:12.828 > D (177252) MQTT_CLIENT: mqtt_message_receive: read_len=459 14:40:12.890 > D (177312) MQTT_CLIENT: mqtt_message_receive: transport_read():zu zu 14:40:12.965 > D (177385) MQTT_CLIENT: msg_type=3, msg_id=0 14:40:13.011 > D (177433) MQTT_CLIENT: deliver_publish, message_length_read=zu, message_length=zu 14:40:13.099 > D (177521) MQTT_CLIENT: deliver_publish: msg_topic_len=zu 14:40:13.160 > D (177582) MQTT_CLIENT: Get data len= zu, topic len=zu, total_data: 435 offset: zu 14:40:13.248 > D (177670) MQTTIF: MQTT_EVENT_DATA 14:40:13.285 > I (177707) MQTTIF: Received mqttipKC/192.168.1.145 - 435 bytes 14:40:13.352 > I (177774) MQTTIF: topic_len: 22 14:40:13.387 > I (177809) MQTTIF: calculated receive topic_len: 17 14:40:13.444 > I (177866) MQTTIF: data->receive_topic: mqttipKC/10.0.1.2 14:40:13.505 > I (177927) MQTTIF: topic: mqttipKC/192.168.1.145E 14:40:13.558 > I (177980) MQTTIF: BWJ Made it here 14:40:13.596 > I (178019) MQTTIF: BWJ unencrypted take 14:40:13.639 > D (178061) MQTTIF: Buffer post - len: 435 tot_len: 435 14:40:13.698 > o clients subscribed
14:40:13.722 > Ensure WiFi 14:40:13.736 > no clients subscribed
14:40:13.789 > I (178345) MQTTIF: BWJ PBUF: 60 bytes 14:40:13.831 > I (178345) MQTTIF: BWJ PBUF Cont: E bytes

<------------ All the logs below here are from my sketch that's doing other things -------------------> <------------The MQTT client appears to have just died and never re-initialized ---------------------->

14:40:18.731 > no clients subscribed
14:40:23.749 > Reminder that Reset Button Monitor Task is running: 0 14:40:23.806 > BWJ Stack Watermark: no clients subscribed
14:40:23.854 > 3028 remaining of 8192 14:40:28.767 > no clients subscribed
14:40:33.733 > Ensure WiFi 14:40:33.785 > no clients subscribed
14:40:38.803 > no clients subscribed
14:40:43.821 > no clients subscribed
14:40:44.770 > Reminder that Reset Button Monitor Task is running: 0 14:40:44.826 > BWJ Stack Watermark: 3028 remaining of 8192 14:40:48.839 > no clients subscribed

bwjohns4 commented 1 year ago

I tried some more on and realized that the TCP mechanism appears to block indefinitely after a few packets when sending the get requesting the 10k png. I changed stack sizes and lots of things, but the only thing that appears to help is reducing the MTU on the mqttif to less than 1500. I tried both 500 and 1000 and the crash block stopped happening. I still can't get the file to actually go through. Is there anything that could help improve the throughout to something more useable in a real life use case?

bwjohns4 commented 1 year ago

I tried everything imaginable to trace where this problem was occurring (lots of LWIP changes, stack sizes, etc) and found that simply increasing the queue size fixed the problem and allowed reliable connection. Not sure if 50 is too many, but I also tried increasing to 10 and still observed the same situation where the TCP stack stopped responding completely resulting in all tasks making calls to it to get stuck blocking forever.

bwjohns4 commented 1 year ago

@martin-ger , I've noticed that the tcp blocking problem happens when the event loop gets full. I think the problem is that the ESP32 mqtt interface is sending pbufs which are not getting freed. I think they get freed after the event loop processes/sends them and no longer has a reference to the pbuf object (am I correct). Would it be better to track how many messages are in the event loop and if it's full then just do free_pbuf(p) without actually posting the pbuf to the processing loop (which then just fall back on tcp retransmissions to get the data through). Something like:

if event loop is full
drop packet silently by calling free_pbuf(p) and skip posting to event loop
else
post to event loop

within this code section.

pub_data->data_len = pbuf_copy_partial(p, pub_data->data_buf, 2048, 0);
        //pub_data->data_len = pbuf_copy_partial(p, pub_data->data_buf, sizeof(pub_data->data_buf), 0);

        //struct ip_hdr *iph; = (struct ip_hdr *)data->buf;
        //printf("packet %d, buf %x\r\n", len, p);
        //printf("to: " IPSTR " from: " IPSTR " via " IPSTR "\r\n", IP2STR(&iph->dest), IP2STR(&iph->src), IP2STR(ipaddr));

        esp_event_post_to(loop_with_task, MQTTVPN_EVENTS, PACKET_SEND_EVENT, &pub_data, sizeof(&pub_data), portMAX_DELAY);

Does that make sense that if the event loop gets full then the pbufs just never get freed and if all the TX pbuf allocation is full then the TCP stack becomes unresponive? Do you know exactly what mechanism would make the TCP stack no longer work. I just observe that my main Arduino task goes into permanent blocking waiting on sending a small request at this point.