ESP32 - network.PPP - uMQTT - Does not give OSERROR on dropped connection.

trip96 commented 4 years ago

Good morning,

I have noticed that network.PPP does not behave in the same way as the Wi-Fi network module in handling the TCP portion of uMQTT.simple. Specifically when an MQTT message is sent and there is no response from TCP socket I do not get the OSERROR I would see when using Wi-Fi.

The setup that contributes to this issue:

ESP-32
Simcom7000g Cellular modem.
ESP32 over serial to simcom7000g.
network,PPP for handling the connection through serial through the simcom7000g.

This works perfectly for long periods of time. BUT, if the data call is dropped or I remove the antenna micropython still returns ppp.isconnected() as True. I think this is expected because the connection was never terminated but rather it 'dropped' (kind of a half open situation). The problem is when MQTT publishes a message there is no error given. Usually this is the OSERROR. I would expect there to be no TCP Acknowledgment on the sent MQTT message either (due to no data connection). This is what should cause the OSERROR (to my understanding).

Eventually the program will spiral into a loop, printing -113 over and over then eventually crash.

I have tried modifying the MQTT.simple to have a timeout on the socket. The socket is also currently wrapped in SSL with no additional SSL params (I know this reduces security, we are testing data consumption currently). I also have modified the MQTT.simple to print a message on an exception. Both of these modifications have not resolved this strange issue or given further insight.

Of course this is all dependent on the IDF as I believe PPP is handled via the Espressif IDF. So perhaps a new IDF revision fixes this for me as well.

Any ideas?

Thank you all in advance.

peterhinch commented 4 years ago

You might like to read this notably sections 1.1-1.2. tl;dr The official MQTT libraries don't reliably recover from WiFi outages: perhaps your observations are unsurprising. I have no experience with PPP but it's possible that the resilient library can be adapted for PPP.

trip96 commented 4 years ago

@peterhinch

As usual thank you for being the first to reply to github and forum issues. It is greatly appreciated.

In response, to your comments:

If i do not receive a TCP acknowledgment for the the MQTT publish I would like some kind of error or feedback to be issued. My recovery process is machine.reset() in this case.

Thank you again for the work you do for micropython community.

tve commented 4 years ago

Dang, I'm 2 hours late to beat Peter! :-) :-)

While I second Peter's suggestion to use mqtt_as, there is something wrong. You should get an error due to TCP ACK timeout and from looking at it I suspect code in the ESP-IDF (e.g. LwIP). You didn't say which version you're using, have you tried V1.12 with ESP-IDF v4? Other suggestions I have are to try (1) without SSL to see whether that swallows the error and (2) using Wifi to see whether you get the expected error, and perhaps (3) try your test code in the unix version of MP or CPython to see what happens there. These experiments would allow you to narrow down the source of the problem.

NB: Looks like we have another case of a "well-documented" feature (PPP)... Sigh.

trip96 commented 4 years ago

@tve

I agree with the strange behavior with the TCP ACK timeout not being caught in the usual way. I also figured the IDF would be the first place to start testing.

I will start testing your suggestions and report back as the information comes in.

Thank you again for your reply and significant amount of information in regards to this issue.

peterhinch commented 4 years ago

People have commented previously about the TCP ACK timeout.

The approach in mqtt_as is to work at the MQTT level using the MQTT keepalive concept to keep the link open and to hold off the last will from the broker. Publications with qos==1 are subject to a timeout. If the timeout elapses without a response packet the link is assumed to be down: the socket is closed, the link is downed and a reconnect task is set in progress. Inevitably this can take an arbitrarily long time but the qos==1 guarantee is maintained when connectivity revives.

The official solutions assume perfect connectivity without network or broker outages: remedying this proved non-trivial. Options are either to modify mqtt_as for PPP or to do a ground-up rewrite.

trip96 commented 4 years ago

@peterhinch

Thank you for continuing discussing this. I will take some time to do some testing suggested by @tve and then perhaps we can move forward. I will add mqtt_as into the testing schedule. This will take a little while to do testing as I am in the middle of a data consumption test that will take 24 hours. I also have limited time to test each stage.

I will report back.

MrSurly commented 4 years ago

PPP is data layer, and cannot detect problems with the physical layer. Specifically, it cannot detect that the modem has sent NO CARRIER and is in command mode.

trip96 commented 4 years ago

@MrSurly

This is a great addition to the conversation. As a functional work around to my current issue I could periodically send 'AT' to ping the modem. If it responds we know the data connection has been lost because the modem has returned to command mode (as evident by it accepting the AT command).

I also can understand that PPP is the data layer and why if the connection is dropped at the cellular physical layer that PPP.isconnected() would still return True.

But, why can't the socket determine the socket it closed / interrupted when it does not receive a response (TCP ACK)?

Do you have any thoughts on this particular portion of the problem?

Thank you for the contribution to this discussion.

peterhinch commented 4 years ago

@trip96

As a functional work around to my current issue I could periodically send 'AT' to ping the modem.

You can also do this at MQTT level with an MQTT ping packet (awaiting PINGRESP) - as done in mqtt_as. But your approach may be faster.

trip96 commented 4 years ago

@peterhinch

I know this is super frugal. But, the main reason I want to use the 'AT' command to test the modems current state is to use the least amount of data over the network as possible.

Of course, our main goal is a robust connection so a Keep-Alive of some sort MAY be required.

Thank you for pointing out this feature on 'mqtt_as' as well. It is invaluable to have the developer of a module to come into a forum and relay great ideas in order to fix a problem.

MrSurly commented 4 years ago

But, why can't the socket determine the socket it closed / interrupted when it does not receive a response (TCP ACK)?

If it's at the TCP level, then it should work the same over PPP.

As a functional work around to my current issue I could periodically send 'AT' to ping the modem. If it responds we know the data connection has been lost because the modem has returned to command mode (as evident by it accepting the AT command).

This will be difficult, since many modems (e.g. BG96) will return to command mode upon PPP termination, and you'll have to terminate PPP to see any response from the modem, since the PPP code launches a task to intercept all the serial port data. The PPP code could easily be modified to "pause" listening to the port to allow what you propose, but there's no guarantee that sending AT when PPP is still active won't mess up the session. I don't know enough about the PPP protocol to say.

One thing you could try is to use your modem with a more robust PPPD (i.e. a PC). This is how I debugged my PPP chat scripts for use in MP.

MrSurly commented 4 years ago

Also something to think of is does your modem firmware allow toggling an external pin when the connection drops? If so, this could be wired up in the same fashion as the old RS-232 DCD or DSR signalling.

tve commented 4 years ago

More input... @trip96 wrote:

I have noticed that network.PPP does not behave in the same way as the Wi-Fi network module in handling the TCP portion of uMQTT.simple. Specifically when an MQTT message is sent and there is no response from TCP socket I do not get the OSERROR I would see when using Wi-Fi.

I do not believe that Wifi and PP work the same. When Wifi drops, I believe the interface is torn down, that will close all TCP sockets and any operation on them will error. When your modem drops the connection nothing happens other than that there is no more input coming into PPP.

Snippet from http://man7.org/linux/man-pages/man7/tcp.7.html:

tcp_retries2 (integer; default: 15; since Linux 2.2)
              The maximum number of times a TCP packet is retransmitted in
              established state before giving up.  The default value is 15,
              which corresponds to a duration of approximately between 13 to
              30 minutes, depending on the retransmission timeout.  The
              RFC 1122 specified minimum limit of 100 seconds is typically
              deemed too short.

Obviously Linux != LwIP, but I wouldn't bank on the timing to be that different... You could investigate what the LwIP TCP parameters are.

trip96 commented 4 years ago

@tve

Your latest information is in alignment with what I have seen so far.

I did have time for only one test. I removed the antenna from the modem and it took approx 25min for the connection to actually be dropped (application mqtt socket - not - physical layer obviously).

I believe the LwIP information you posted is extremely helpful in understanding what is going on and further testing and adjustments.

I guess one would need to adjust the amount of TCP retries and timeout within the ESP-IDF LwIP code in order to shorten this period to say 10 minutes of retries.

Thank you so much @tve for taking the time to post this. I think we are close to the heart of the issue here.

chtinnes commented 4 years ago

Also something to think of is does your modem firmware allow toggling an external pin when the connection drops? If so, this could be wired up in the same fashion as the old RS-232 DCD or DSR signalling.

My modem has a DCD pin but unfortunately, I can not access it on the module I am using.

I guess one would need to adjust the amount of TCP retries and timeout within the ESP-IDF LwIP code in order to shorten this period to say 10 minutes of retries.

I am not sure, if this is the right way to do it. When the connection dropped (and the modem already knows of it) even waiting another 10min seems to be wrong. Also, for a mobile connection it could happen that there is no data for a period that long. I think handeling this on a TCP level would rather be a work around because your modem does not tell you that it has lost connection and will not recover without intervention.

IMHO, the correct way would be to receivce information of the modem when connection has been dropped. Is there some "standard" way to check connection state via serial with a modem (maybe even part of PPP?) or is the way this should be handled by using the DCD pin? In the latter case, I could not use my current module :'(

tve commented 4 years ago

I believe there are two cases to consider:

The device sends a message at relatively long intervals, the messages need to be delivered reliably, and it is essential to absolutely minimize the data volume over the cellular connection (e.g. for cost reasons). In this case, I believe TCP won't cut it, you have to use UDP and do your own end-to-end ACK and then decide what, how, when, and how often you retry. Given carrier time-outs you will be firing up a fresh connection for each message.
You're not strapped to the minimum as above. In that case, use an end-to-end ACK, e.g. QoS=1 in the MQTT case. Give the ACK a reasonable time-out for the network technology (2 minutes for cellular to allow for connection re-establishment?). If the ack times out, wrangle the modem explicitly.

chtinnes commented 4 years ago

@tve At the moment, I also see this two options for my case, but I consider them only as a workaround. The thing is, that the modem already knows about the dropped connection, why should I then not use this information. As you wrote above, with WiFi, the interface is dorn down on connection failure. From an application (or socket user) point of view, there should be no difference in WiFi and mobile. I had a quick look at the PPP specification and it say about the dead link phase:

Implementation Note:

 Typically, a link will return to this phase automatically after
 the disconnection of a modem.

So maybe this could be used in the PPP implementation, to tear down the interface?

EDIT: Another solution which might work (not verified yet, as @MrSurly pointed out this could mess up the session), is to periodically leave the data mode and check status and reconnect if necessary. My modem supports leaving the data mode by sending "+++".

trip96 commented 4 years ago

Since this is active again I thought I could share where I am currently in this. I have not completely solved the situation but I have made some useful modifications.

Here is my basic hardware:

ESP32 with a Simcom 7000g modem. Only Rx and Tx connected with a power pin as well. No DCD pin or anything like that.

The behavior:

Simcom7000g and cell carrier drop connection sporadically - perhaps every 48 hours or so. However I am currently on a 5 day 100% up time streak right now. I will update if I continue to see improvements in uptime.

Anyways, let's say every 48 hours I would loose connection from the simcom7000g and the carrier noted by the change in LED blinking rate. PPP connected with simcom7000g produces a rapid blink. When signal is lost the blinking is much slower. (as a note, I am wondering why this drops. Is it the carrier? the modem? the environment? or is it something I am doing via the python or micropython / ESP-IDF? Lwip? I would love suggestions on how to pin point the problem of losing a connection.)

Previously I would have to wait approx 45-minutes for my micropython scripts to recognize the connection drop. I now have it down to 13 minutes.

The Setup

Since I am using cellular data payloads are a major factor in cost. We need the smallest bandwidth possible. Right now I am using MQTT and pushing temperature readings every 10 minutes. This also acts as a really long keep-alive.

Latest Attempts at Fixing

So, I now use MQTT (Blocking) QoS-1 and now because if the connection is dropped I can trigger the watchdog to restart..... Definitely not the ideal way to do this. I know.... Anyways, it does work and with the 10minute MQTT push and a 3 minute watchdog we get a 13 minute downtime on connection dropping. This is totally usable and has been reliable for months.

What Didn't Work - But Maybe I did it Wrong?

When micropython does ppp.connect() the serial port communication is modified to handle the ppp. So, I tried a couple of things in order to see if the ppp connection dropped faster than using a non responding TCP socket every 10 minutes (MQTT sends the message, blocks the script while waiting for ACK, ACK never comes through, watchdog restarts after not being fed for 3 mintues).

I created a function that got called every 3 seconds that sent an 'AT' command directly to the simcom. The idea was if the modem responded then it had lost the connection. If it didn't then the connection was still active. Perhaps I coded it poorly but it did not work. Even when the modem connection to cellular carrier dropped I still would not receive the 'AT-OK' response. I believe this is because ppp.active() was still True for micropython so this function will never work. If you call ppp.active() = False then the connection is dropped anyways.
I tried asking for the IP from micropython every 3 seconds (ppp.ifconfig()). This also would return the IP even when the cellular connection dropped.
I tried asking micropython if the ppp.active() was True. It will always return True. In my case the ppp from micropython won't recognize that the simcom has dropped connection to cellular provider.
I used the MQTT functionality of the Simcom7000g and did everyting via AT commands. This actually worked alright but callbacks into micropython were a bit messy and sometimes the serial info got a bit garbled. The real reason I abandoned it was because the keep alive has a max setting of 180 seconds. This added significant overhead.

Things I didn't try but might work - and why I didn't try them

I would imagine that calling ppp.active=False after each successful send and then calling ppp.active = True and then ppp.connect() would work better for 100% sending success of messages but this is costly for data overhead upon connecting / reconnecting to MQTT over TLS. Trying to keep cost down.... Also, I wish to send commands to this device over MQTT and disconnecting would obviously cause a problem when I want as close to 0 latency as possible. I want to unlock a door for example. I do not want to wait up to 10 minutes for that command to go through because it is sitting in the MQTT server waiting for the ESP-32 and simcom to connect again.
UDP - As funny as it sounds I just never tried this but on TVE's suggestion I may start to play around with this. I have 0 experience with UDP so this might be fun.
Reconnecting through the simcom7000g. Ideally if the connection was lost between the modem and cell carrier making another data call (AT+CGDATA="PPP",1') would be great. However due to time I haven't yet explored this.

What's Next for me?

I do want to get to the bottom of why the ppp on micropython can not determine when the simcom is disconnected.

I have made progress with longer uptime by using the latest uasyncio with the event scheduling to lock the UART while sending / recieving ppp data through the simcom. Perhaps my problems were in UART / serial handling all along?

I have this idea (probably isn't correct but) that the ppp from micropython should be checking the status of the modem periodically. Once the modem drops out of the connection with the cellular carrier it will respond to regular AT-commands. Perhaps there is a way to modify the ppp framework to test whether the modem is still connected? I honestly don't know. I am not that versed in these frameworks yet. I am slowly making progress though so maybe in the future I'll have that answer.

Thank you everyone for discussing this. I hope my recent additions shed some light on this.

Looking forward to replies and questions.

tve commented 4 years ago

It sounds like you're looking strictly at outbound communication, if you're also interested in "spontaneous" inbound (to the device) messages then beware that there is a pretty short NAT timeout for UDP on most cellular carriers.

chtinnes commented 4 years ago

I worked around the issue in my case by using a timeout for the check_msg which works for my use case. What seems strange to me is that I can observe now connection resets (on TCP level) every 150s. I don't know who or what is resetting the connection but it looks like it is not my backend. @tve Might this be the NAT timeout you are talking about? Can this problem be solved by setting the keep alive option for the socket? Btw.: Do I see correctly that this option is currently not implemented for ESP32 port?

tve commented 4 years ago

socket keep-alive probably helps, but it would result in extra packets sent periodically ($$), not sure that's what you want...

jonnor commented 4 days ago

There has been several updates to the esp-idf, which provides the PPP code on ESP32. There also seems to be some reports above of getting things to mostly-work. Is there still a specific problem to be fixed here, present on recent MicroPython version?

micropython / micropython

ESP32 - network.PPP - uMQTT - Does not give OSERROR on dropped connection. #5566