micropython / micropython-esp32

Old port of MicroPython to the ESP32 -- new port is at https://github.com/micropython/micropython
MIT License
673 stars 216 forks source link

Write to nonblocking sockets fails #166

Open peterhinch opened 7 years ago

peterhinch commented 7 years ago

In the course of updating this asynchronous MQTT driver I want to make it compatible with ESP32. The code uses a nonblocking socket. I get OSError 119 when writing to the socket. The following test runs on ESP8266 but fails on ESP32.

import network
import socket
from time import sleep
import uerrno
SERVER = '192.168.0.9'  # can use 'iot.eclipse.org'
s = network.WLAN(network.STA_IF)
s.active(True)

def connect():
    s.disconnect()
    sleep(1)
    s.connect('SSID', 'PASSWORD')  # Redacted EDIT this
    print('Awaiting conection')
    while not s.isconnected():
        sleep(1)
    print('Got conection, pausing')  # Just to make sure
    sleep(10)
    print('conection done')

def foo(n):
    print('Attempt:', n)
    connect()
    addr = socket.getaddrinfo(SERVER, 1883)[0][-1]  # MQTT
    s = socket.socket()
    s.setblocking(False)
    try:
        s.connect(addr)
    except OSError as e:
        if e.args[0] not in [uerrno.EINPROGRESS, uerrno.ETIMEDOUT]:
            print('Error connecting', e)
    try:
        s.write(b"\xc0\0") # MQTT ping
        print('Write succesful.')
    except OSError as e:
        print('**** ERROR writing ****', e)
    s.close()

foo(1)

Note that I issue disconnect() before connecting. This is because my driver reconnects after a network outage. It also ensures consistent behaviour if an application is started with a network connection already active.

MrSurly commented 7 years ago

Looking into this ...

MrSurly commented 7 years ago

Why not check for EINPROGRESS when attempting to write? Code below works for me

import network
import socket
from time import sleep
import uerrno
SERVER = '192.168.0.9'  # can use 'iot.eclipse.org'
SERVER = 'iot.eclipse.org'
s = network.WLAN(network.STA_IF)
s.active(True)

def connect():
    s.disconnect()
    sleep(1)
    s.connect('SSID', 'PASSWORD')  # Redacted EDIT this
    print('Awaiting conection')
    while not s.isconnected():
        sleep(1)
    print('Got conection, pausing')  # Just to make sure
    sleep(3)
    print('conection done')

def foo(n):
    print('Attempt:', n)
    connect()
    addr = socket.getaddrinfo(SERVER, 1883)[0][-1]  # MQTT
    s = socket.socket()
    s.setblocking(False)
    try:
        s.connect(addr)
    except OSError as e:
        if e.args[0] not in [uerrno.EINPROGRESS, uerrno.ETIMEDOUT]:
            print('Error connecting', e)
            return

    attempts = 3
    while attempts:
        attempts -= 1
        try:
            s.write(b"\xc0\0") # MQTT ping
            print('Write succesful.')
            break
        except OSError as e:
            if e.args[0] not in [uerrno.EINPROGRESS, uerrno.ETIMEDOUT]:
                print('**** ERROR writing ****', e)
        sleep(.5)
    s.close()

foo(1)
peterhinch commented 7 years ago

Here it works only by ignoring the 119 error and trying again. I've never seen the ESP8266 raise 119 and I'm not sure of its significance - perhaps you could advise. Is it always legitimate to trap it and re-send the entire buffer (which in the general case might be long)?

Running your code with only the WiFi credentials changed:

>>> import rats1
I (179751) wifi: wifi firmware version: f092575
I (179751) wifi: config NVS flash: enabled
I (179751) wifi: config nano formating: disabled
I (179751) system_api: Base MAC address is not set, read default base MAC address from BLK0 of EFUSE
I (179761) system_api: Base MAC address is not set, read default base MAC address from BLK0 of EFUSE
I (179781) wifi: Init dynamic tx buffer num: 32
I (179781) wifi: Init dynamic rx buffer num: 64
I (179781) wifi: wifi driver task: 3ffd926c, prio:23, stack:4096
I (179781) wifi: Init static rx buffer num: 10
I (179791) wifi: Init dynamic rx buffer num: 0
I (179791) wifi: Init rx ampdu len mblock:7
I (179801) wifi: Init lldesc rx ampdu entry mblock:4
I (179801) wifi: wifi power manager task: 0x3ffde558 prio: 21 stack: 2560
I (179811) wifi: wifi timer task: 3ffdf5d4, prio:22, stack:3584
I (179831) phy: phy_version: 355.1, 59464c5, Jun 14 2017, 20:25:06, 0, 0
I (179831) wifi: mode : null
I (179841) modsocket: Initializing
I (179851) wifi: mode : sta (24:0a:c4:05:e5:58)
I (179851) wifi: STA_START
Attempt: 1
Awaiting conection
I (181211) wifi: n:3 0, o:1 0, ap:255 255, sta:3 0, prof:1
I (181771) wifi: state: init -> auth (b0)
I (181781) wifi: state: auth -> assoc (0)
I (181781) wifi: state: assoc -> run (10)
I (182681) wifi: connected with misspiggy, channel 3
I (182681) wifi: event 4
I (183261) event: ip: 192.168.0.33, mask: 255.255.255.0, gw: 192.168.0.1
I (183261) wifi: GOT_IP
Got conection, pausing
conection done
**** ERROR writing **** 119
Write succesful.
>>> I (191781) wifi: pm start, type:0

>>>  
MrSurly commented 7 years ago

The ESP8266 and ESP32 use two different versions of LWIP (ESP8266 version is from 2012, ESP32 from 2016).

The ESP8266 version should throw an EINPROGRESS if it doesn't connect immediately when timeout is set to 0. The fact that it doesn't suggests the underlying LWIP implementation is different. The ESP32 version doesn't seem to implement timeouts for connect at all, unless one of the underlying LWIP timeout options also applies to connect. But it fails on write -- it's almost as if the write triggers something in LWIP to complete the connection process.

peterhinch commented 7 years ago

Which I'd understand if it threw EINPROGRESS or ETIMEDOUT on the first call to write(). The remedy would be to delay and retry. But error 119 is ENAVAIL "No XENIX semaphores available" which seems unexpected.

Throwing an exception on connect wouldn't help the user unless there were a way subsequently to poll the connection status before using the socket. In the absence of such a method the only way seems to be to use the socket, trap the exception, and retry until at least one character is written; I guess this conforms to modern Python coding practices.

I performed a quick test where I trapped the 119, including tests of writing a longer string. It seems that it only fails on the first pass. The second pass correctly writes the string.

So perhaps this boils down to the LWIP raising a "wrong" exception number in the event of a pending connected status. If you think we're on the right track I'll proceed on that basis.

nickzoic commented 7 years ago

I turned some debugging on ... lwip_send is getting error -5 back from netconn_write_partly and then translating that to EINPROGRESS. The trouble is that EINPROGRESS isn't consistently defined:

esp-idf/components$ egrep -r '#define\s+EINPROGRESS\s'
lwip/include/lwip/lwip/arch.h:#define  EINPROGRESS    115  /* Operation now in progress */
newlib/include/sys/errno.h:#define EINPROGRESS 119      /* Connection already in progress */

LWIP seems to be picking up on 119 definition, whereas lwip would normally define it as 115 and that's what py/mperrno.h things too ...

dpgeorge commented 7 years ago

LWIP seems to be picking up on 119 definition, whereas lwip would normally define it as 115 and that's what py/mperrno.h things too ...

Nice digging! There's already code in esp32/modsocket.c to convert EINPROGRESS to MP_EINPROGRESS, so the fix should be to make sure that the EINPROGRESS in that file is the same one that lwip uses. It's strange that the newlib errno.h is being used though.... how is that happening, is it on purpose?

peterhinch commented 7 years ago

I should add that I've also seen 118 in my testing. My code treats EINPROGRESS, ETIMEDOUT, 118 and 119 as indicating that the connection is not yet established. So far it seems reliable.

nickzoic commented 6 years ago

Yeah, I didn't quite get to the bottom of it but there's a LWIP_PROVIDE_ERRNO flag there which isn't being set, instead it is using newlib's one. Not sure why we define our own MP_E* as well ...

dpgeorge commented 6 years ago

Not sure why we define our own MP_E* as well ...

Because some toolchains use non-standard errno codes and in the past was confusing that some ports would report certain error numbers, while others would report other numbers for the same thing. It was decided to standardise around one set of errno codes so that it was easier to work out what the code corresponded to (for those codes that don't have a human-readable string associated with them in the uerrno module).