pycom / pycom-micropython-sigfox

A fork of MicroPython with the ESP32 port customized to run on Pycom's IoT multi-network modules.
MIT License
199 stars 167 forks source link

Lopy4 WLAN connection problem + isconnected() == true but own IP #369

Closed pascalschaefer closed 4 years ago

pascalschaefer commented 4 years ago

Dear Pycom Forum

We are running 1.201.r1 with frozen modules. (sysname='LoPy4', nodename='LoPy4', release='1.20.1.r1', version='3138a13d-dirty on 2019-11-05', machine='LoPy4 with ESP32', lorawan='1.0.2', sigfox='1.0.1')

Most of our devices were running quite stable for the last weeks within our internal WLAN Networks. In other Networks they often can't reconnect after reboot or loose the connection during the day. Specifically we have the situation that devices are sometimes not able to connect to the WLAN after restarting the device, restarting the router or changing the dhcp configuration on the router. Once the device is in the below described state, it doesn’t recover. Even when restarting or changing the target WLAN.

I read in other topics that isconnect() returns also true if a valid IP is configured.

We really need to understand under which condidtions that happens and how the device can recover to reconnect to the configured SSID to get the IP via DHCP

Additionally I am interested whether there is any further information about long term connectivity in WLAN Networks an best practise for setting them up correctly in order that the pycom microntroller works at his best

We tried to find out the cause, and it seems that sometimes the devices returnes for “wlan.isconnected()” true. But it is actually connected to is own WLAN. Causing the [Errno 202] EAI_FAIL, because it can’t resolve the DNS of the target Url when sending the request.

  1. The device setups is own WLAN via Mode STA_AP.
  2. The device connects to the target WLAN configured via the configuration file

Can you please let us know what we can do about this problem?

Code

log("Connecting to WIFI", options['wifi_ssid'])
wlan.init(mode=WLAN.STA_AP, ssid=self.ssid, auth=self.auth)
wlan.connect(ssid=options['wifi_ssid'], auth=(WLAN.WPA2, options['wifi_password']))
while not wlan.isconnected():
  # Flash the LED red
  led.error()
  time.sleep(0.05)
  led.off()

  if (config.WDT_MAIN_TIMEOUT > 0):
    wdt.feed()
    time.sleep(2)

log("Connected to WIFI network:", wlan.ifconfig())

Normally: Device successfully connects, gets ip from network via dhcp

TRACE [Main] Connecting to WIFI HZN244224071
TRACE [Main] Connected to WIFI network: ('192.168.192.30', '255.255.255.0', '192.168.192.1', '192.168.192.1')
TRACE [Clock] Starting time sync
TRACE [WebUplink] Sending TIME request for event
TRACE [WebUplink] POST request
TRACE [WebUplink] success

Log Device in problematic state

TRACE [Main] Connecting to WIFI HZN244224071
TRACE [Main] Connected to WIFI network: ('192.168.4.1', '255.255.255.0', '192.168.4.1', '8.8.8.8')
TRACE [Clock] Starting time sync
TRACE [WebUplink] Sending TIME request for event
TRACE [WebUplink] POST request
> socket failed: 202 [Errno 202] EAI_FAIL
TRACE [WebUplink] ERROR Unable to make API request [Errno 202] EAI_FAIL
TRACE [WebUplink] Received None
TRACE [WebUplink] Sending TIME request for event
TRACE [WebUplink] POST request
> socket failed: 202 [Errno 202] EAI_FAIL
TRACE [WebUplink] ERROR Unable to make API request [Errno 202] EAI_FAIL
TRACE [WebUplink] Received None

Many thanks in advance Best Regards, Pascal

amotl commented 4 years ago

Dear @pascalschaefer,

as you closed this issue right away, I am assuming you resolved this already? If not, I might be able to look into that. I believe we worked around similar issues within [1] already.

With kind regards, Andreas.

[1] https://github.com/hiveeyes/hiveeyes-micropython-firmware

pascalschaefer commented 4 years ago

Dear @amotl

Thank you so much for your comment and your reference WLAN Code. Your code was very helpful. The mentioned issue above is resolved. It is embarassing because we used to prepare and test some target SSID/PW configurations for some devices via an IOS Hotspot. After the 9th device, we got socket failed: 202 [Errno 202] EAI_FAIL. Then we realized that the maximum IP Adresses for DCHP Hotspot is maybe reached. I was playing around and set the IP statically to the default device IP configuration to check the config etc. in different states and forgot to remove it, because we use WLAN.STA_AP

But this is only half of the story, because to me many things related to the WLAN and the stability are still not clear. Because if the device doesn't get a valid IP from the connected WLAN ,wlan.isconnected() will still be true after wlan.connect(). The WLAN or Hotspot doesn't provide a IP, so the device will somehow still have a IP Configuration (0.0.0.0,0.0.0.0,0.0.0.0,0.0.0.0). So later when using the SSL Socket, we get socket failed: 202 [Errno 202] EAI_FAIL because their is no dns configured I assume. That case I see handled in your code. So thanks again. To close the line about this specific problem, we understood that if the device is in WLAN.STA_AP, connects to the wlan and doesn't get a IP from the AP, we need to detect it to prevent the later socket fail. This issue was resolved in other networks by increasing the limit of DHCP devices in the network.

I am searching for basically all possibilities which can cause a "Network Connection lost" for our devices. We have run the device in our internal network for around 2 Weeks in an automated test were sending 40 messages all 10 minutes. Which worked very well.

Then we installed around 10 devices in 3 other networks each and their things are not very stable. Apart from sporadic GURU Exceptions, were I still don't understand all circumstance when they happen, the devices randomly loose the connection during the day. After several failed retries we reboot and then device sometimes can't reconnect. After some time, when rebooting again, it works.

So on one hand, restarting the device once a day seems to help a lot for the long term stability, related to crashes from threats related issue (memory/stack size?), but on the other hand related to wlan, the chances a device can't reconnect is somehow bigger? It seems to be also not only device related, maybe different router types. Today one guy told me he will change the wlan channel across all routers to the same fixed channel. That if the devices changes the channel during the day that maybe could be a reason. I am looking forward whether this helps in this specific network.

I don't know if there is any further information about best practice for setting up the wlan properly in order that the devices run at their best.

I see in your code that you use connect two times. does this helped related to maybe similar problems? when rebooting

Many thanks for your valuable information. Best Regards, Pascal

amotl commented 4 years ago

I see in your code that you use connect two times. does this helped related to maybe similar problems when rebooting?

Yes, it seems to work for us when rebooting. The root cause is still unclear.

Because if the device doesn't get a valid IP from the connected WLAN ,wlan.isconnected() will still be true after wlan.connect(). The WLAN or Hotspot doesn't provide a IP, so the device will somehow still have a IP Configuration (0.0.0.0, 0.0.0.0, 0.0.0.0, 0.0.0.0). So later when using the SSL Socket, we get socket failed: 202 [Errno 202] EAI_FAIL because their is no DNS configured I assume. That case I see handled in your code. So thanks again.

While the documentation promises the lowlevel station.isconnected() should only return true after the device got a valid IP address, we found this is not always happening. It might be related to having this configured in dual STA_AP mode, we are unsure about this.

For working around that, we added a custom is_connected() method which checks against that. This will get used from within wait_for_connection().

We are looking forward to receive stability and robustness updates probably coming in through ESP-IDF v3.3 and v4.0 here in the future.

Good luck with further development on these and other aspects!

pascalschaefer commented 4 years ago

Hi @amotl

Thank you so much for your feedback and the explanation. I will try out these adjustments.

Also to you all the best and good luck in your development.

Best Regards, Pascal

amotl commented 4 years ago

Dear @pascalschaefer,

Apart from sporadic GURU Exceptions...

You might want to try our "dragonfly" series of custom builds which might be able to bring in additional robustness against these issues. However, we are still evaluating it and you might help in doing so. Maybe you can run this guy on a few devices for testing purposes and report the outcome back to us.

Thanks already and with kind regards, Andreas.

[1] https://community.hiveeyes.org/t/testing-the-custom-dragonfly-builds-on-pycom-devices/2746

pascalschaefer commented 4 years ago

Dear @amotl

Thank you so much for your effort regarding improving the stability. I will for sure try it out and let you know about stability improvements. I have a stupid question: With pymakr in Visual Studio or Atom we were not able to upload our firmware with the new pycom release because of timeouts / too many files ?.

Therefore we bundled all together via frozen modules into the firmware from pycom and have overwritten the main.py startup.

Is there any source package reference like the one from pycom to build it our self for the vanilla-dragonfly package and integrate our code via frozen modules? Is the esp32 package code also modified?

Thanks in advance and Best Regards, Pascal

amotl commented 4 years ago

I have a stupid question: With pymakr in Visual Studio or Atom we were not able to upload our firmware with the new pycom release because of timeouts / too many files?

We have been running into similar issues. With the "dragonfly" builds, we also observe more robustness here.

Is there any source package reference like the one from pycom to build it our self for the vanilla-dragonfly package and integrate our code via frozen modules?

Unfortunately, not yet. Many things are involved here as outlined within [1] and beyond and we just ran out of resources [2] upstreaming and mainlining them.

I will be traveling for the next two weeks or so starting on Saturday and will be happy to come back to this when I'm back at the keyboard.

[1] https://community.hiveeyes.org/t/investigating-core-panics-with-ble-on-pycom-devices/2715 [2] Saying this, I will be happy if someone will reach out to me at andreas[at]terkin.org who might be able to support our work on that.

amotl commented 4 years ago

Dear Pascal,

while we already got some feedback from the community that this will really make things more robust [1] and while we are also trying to get in touch with the folks at Pycom, we are still in the phase of confirming it with more people.

So, if you will have the chance to test these builds on your hardware, it will help us tremendously to raise more confidence that our findings and mitigations for the different things outlined within [2] and beyond are not dreamed up in any way and really will make things more stable on different hardware and within different environments.

As the core panics are pretty much indeterministic and debugging them is quite difficult, we need more feedback from the community to reassure ourselves in this regard.

Thanks already and with kind regards, Andreas.

[1] https://github.com/pycom/pycom-micropython-sigfox/issues/361#issuecomment-553415760 [2] https://community.hiveeyes.org/t/investigating-core-panics-with-ble-on-pycom-devices/2715

amotl commented 4 years ago

Using the most recent 1.20.2.rc3 release seemed to help @pascalschaefer here in order to mitigate the issues he was observing.

-- https://community.hiveeyes.org/t/squirrel-firmware-for-pycom-esp32/2960/2