mongoose-os-libs / pppos

PPPoS / cellullar modem support
Other
27 stars 23 forks source link

Core Dump after failing to get IP address on "latest" #20

Closed DrBomb closed 3 years ago

DrBomb commented 4 years ago

Platform: ESP32

Issue:

If the PPPoS session fails to get an IP address. This line forces the PPPoS stack to reinit. After reaching the same state it finally core dumps twice (not leaving a coredump file behind) and reboots.

How to reproduce:

Hook up a modem to the ESP32. I used no flow control lines, TX and RX connected to the ESP32. I'm using UART 2 for this.

Set up the APN settings incorrectly. Make sure the modem is on before restarting the firmware.

The stack will reach the part where the PPP link is established, and due to the wrong APN settings it will fail to get an IP address. Afterwards the firmware will restart the PPP stack and fail to get the IP address and the firmware will crash.

More Info

The following logs have been acquired running latest on both multicore and unicore. Because there are no core dump files, I symbolized the backtraces and placed them at the end of the logs.

Unicore

Multicore

Running on 2.17.0 with the 2.17.0 version of the PPPoS library does not make the firmware crash. After awhile of the PPP link staying in connected a Error 9 (phase 0), reconnect error is shown and the process is restarted.

However, I did build the latest PPPoS library on 2.17.0 (release) mOS, compiling the sources directly and can confirm the crash happens as well, leaving the device for some reason on a unrecoverable state where the bootloader is unable to boot mOS again until a hard reset

Some remote network conditions mean that the PPPoS link might not be able to get an IP address the first time around, so this could affect many devices of ours deployed on the field.

rojer commented 4 years ago

ok, i tried to reproduce this today and couldn't. it seems that my operators just don't care what AP i give, they completely ignore it.

DrBomb commented 4 years ago

If your module is 4G I believe the whole tower session includes the data session so it does not matter if you have an incorrect APN.

Perhaps trying setting your module to 3G only, or using a 3G sim card

rojer commented 4 years ago

i'm using SIM800L modem, which i believe is GPRS only

DrBomb commented 3 years ago

Reviving this issue as the same is still happening. I've been avoiding rebasing the library due to this. But I decided doing one this week to see if I missed any new features and to consolidate my library fork only to be met with it again.

[Apr 22 01:45:09.508] mgos_pppos.c:408        Connected to mobile network (home)
[Apr 22 01:45:09.913] mgos_pppos.c:459        Operator: 73404,movistar Movistar
[Apr 22 01:45:10.005] mgos_pppos.c:468        RSSI: -67
[Apr 22 01:45:10.214] mgos_pppos.c:816        Starting PPP, user ''
[Apr 22 01:45:10.221] mgos_mongoose.c:66      New heap free LWM: 225964
[Apr 22 01:45:10.226] mgos_net.c:90           PPP: connected
[Apr 22 01:45:10.238] mgos_mongoose.c:66      New heap free LWM: 223016
[Apr 22 01:45:10.251] mgos_mongoose.c:66      New heap free LWM: 222740
[Apr 22 01:45:40.242] mgos_pppos.c:854        Failed to acquire IP
[Apr 22 01:45:40.246] mgos_pppos.c:857        Phase before close 9
[Apr 22 01:45:40.251] mgos_pppos.c:858        Close return value 0
[Apr 22 01:45:40.253] mgos_pppos.c:859        Phase after close 6
[Apr 22 01:45:40.259] mgos_pppos.c:860        Free return value -11
[Apr 22 01:45:40.264] mgos_net.c:86           PPP: connecting
[Apr 22 01:45:42.712] mgos_pppos.c:668        Connecting (UART2, APN 'test')...
[Apr 22 01:45:42.718] mgos_pppos.c:711        Automatic operator selection
[Apr 22 01:45:43.193] mgos_pppos.c:279        Manufacturer: SIMCOM INCORPORATED
[Apr 22 01:45:43.193] Model: SIMCOM_SIM7500A
[Apr 22 01:45:43.193] Revision: SIM7500A_V1.5
[Apr 22 01:45:43.193] SVN: 04
[Apr 22 01:45:43.193] IMEI: SNIP
[Apr 22 01:45:43.198] +GCAP: +CGSM,+DS,+ES, IMEI: SNIP
[Apr 22 01:45:43.202] mgos_stmi2c.c:476       STM Version 0x27
[Apr 22 01:45:43.509] mgos_pppos.c:349        SIM is ready, IMSI: SNIP, ICCID: SNIP
[Apr 22 01:45:43.806] mgos_pppos.c:408        Connected to mobile network (home)
[Apr 22 01:45:44.211] mgos_pppos.c:459        Operator: 73404,movistar Movistar
[Apr 22 01:45:44.304] mgos_pppos.c:468        RSSI: -67
[Apr 22 01:45:44.513] mgos_pppos.c:816        Starting PPP, user ''
[Apr 22 01:45:44.526] assertion "netif already added" failed: file "/opt/Espressif/esp-idf/components/lwip/lwip/src/core/netif.c", line 405, function: netif_add
[Apr 22 01:45:44.571]
[Apr 22 01:45:44.571] abort() was called at PC 0x40113683 on core 0
[Apr 22 01:45:44.571]
[Apr 22 01:45:44.571] Backtrace:0x400878ba 0x40087e85 0x4008e402 0x40113683 0x4016ee97 0x40177fac 0x40178645 0x40177cee 0x4016a165

So, the crash is right now happening because there's an assertion that tells us "netif already added". Digging deeper and adding a few more log statements I track the phase state of the PPP process and seems like when calling pppapi_close the phase doesn't change to DEAD which would result on the pppapi_free to return 0 and then remove the netif, instead it returns -11 which is ERR_CONN and the netif is never removed, leading to the assert failure on the next time the interface is added.

How do we enable lwip logging? I'm looking for the reason why the phase doesn't go to DEAD

For reproduction steps, I remember you telling me choosing a wrong APN doesn't work. Perhaps you could lower the deadline for acquiring the IP address, that would trip the section of the code and hopefully, you should get the same crash as I do.

DrBomb commented 3 years ago

Alright, I've got the final results of testing.

When calling pppapi_close the liberation of the protocol is not instant, rather it takes some time while I assume lwip finishes their own processes. Afterwards the mgos_pppos_status_cb callback is executed with the PPPERR_USER code which is what calls pppapi_free and wraps up everything for the next connection try.

The problem is that PPPOS_INIT is entered immediately after calling pppapi_close resulting on a new interface being added to lwip before the old one is freed.

I propose adding a PPPOS_CLEANUP state to the main pppos lib that will not allow PPPOS_INIT to start before PPPERR_USER is emitted and the interface is freed.

I already suspect it might introduce more issues. As the library code at least for me is somewhat complex, I might be screwing up other things such as user disconnections or running AT commands while on cleanup. I will create a preliminary PR but I think this needs a more expert hand to properly address.

rojer commented 3 years ago

thanks for investigating it. what you say makes sense, adding cleanup state before init is the right way to deal with this, i think. let's call it PPPOS_CLOSING

DrBomb commented 3 years ago

Added PR, please check it out when you can :+1:

rojer commented 3 years ago

looks good, just clean up the style and i'll merge