micropython / micropython-esp32

Old port of MicroPython to the ESP32 -- new port is at https://github.com/micropython/micropython
MIT License
673 stars 216 forks source link

SSL handshake interruption causes hard fault #209

Closed 0xDBFB7 closed 5 years ago

0xDBFB7 commented 6 years ago

Somewhat related to #193, if a socket times out about 20 times the board hard faults (unrecoverable).

Socket Open
Socket has timed out: [Errno 104] ECONNRESET
Socket Close

Socket Open
assertion "0" failed: file "../../py/../extmod/modussl_mbedtls.c", line 151, function: socket_new
abort() was called at PC 0x40131ad3 on core 0

Backtrace: 0x400873fb:0x3ffce020 0x40087427:0x3ffce040 0x40131ad3:0x3ffce060 0x400ecf2b:0x3ffce090 0x400df716:0x3ffce0e0 0x400dbbc1:0x3ffce110 0x400dbc29:0x3ffce130 0x400e8c09:0x3ffce150 0x400df7a4:0x3ffce1f0 0x400dbbc1:0x3ffce220 0x400e8b7f:0x3ffce240 0x400df7a4:0x3ffce2e0 0x400dbbc1:0x3ffce360 0x400e8b7f:0x3ffce380 0x400df7a4:0x3ffce420 0x400dbbc1:0x3ffce450 0x400dbbee:0x3ffce470 0x400f7352:0x3ffce490 0x400f75b5:0x3ffce530 0x400eeea7:0x3ffce550

Rebooting...
���x����xx�������

The relevant code:

def verify_server(socket):
    certHash = binascii.hexlify(
        hashlib.sha256(socket.getpeercert(True)).digest())
    print("Certificate Hash: " + str(certHash))
    for server in server_dict:
        if certHash in server['fingerprints']:
            return True
    return False

def send_to_server(data, url, server_index=0):
    wdt.feed()
    print("Sending " + str(data) + "to server " + url)
    s = socket.socket()
    print("Socket Open")
    try:
        s.settimeout(30)
        # server redundancy should be implemented. #FIXME: This will probably
        # do a dns lookup every time - probably shouldn't.
        ai = socket.getaddrinfo(server_dict[server_index]['address'], 443)
        addr = ai[0][-1]  # cache this
        s.connect(addr)
        s = ssl.wrap_socket(s, cert=sump_cert_2.CERT, key=sump_cert_2.KEY)

        if not verify_server(s):
            print("Certificate doesn't match.")
            s.close()
            return None

        server = server_dict[server_index]
        http_headers = '\r\n'.join([
            "POST {} HTTP/1.1".format(url),
            "Connection: Keep-Alive",
            "Accept: */*",
            "Host: {}".format(server['address']),
            "User-Agent: SafeSumpSeldon",
            "Content-Type: application/x-msgpack",
            "Content-Length: {}".format(len(data)),
            "\r\n"])
        s.write(http_headers)
        s.write(data)
        response = s.read(2048)
    except OSError as e:
        print("Socket has timed out: {}".format(e))
    except Exception as e:
        print("Server connection failed: {}".format(e))
        store_in_backlog(data)
    else:
        return response
        # process_backlog()
    finally:
        s.close()
        print("Socket Close")
    return None

We've been having numerous issues with sockets hardfaulting recently - it's a pretty serious issue for us.

I'm running the latest esp-idf (I got constant #193 errors otherwise).

Thanks!

MrSurly commented 6 years ago

@danicampora That's usually a memory allocation error, if the mbedtls documentation is to be believed.

0xDBFB7 commented 6 years ago

@MrSurly The plot thickens.

Tried running

for i in range(0,500):
    try:
        send_to_server()
    except:
        pass

If the server's off, I get:

Sending test to server /update
Socket Open
Socket has timed out: [Errno 104] ECONNRESET
OSError: 

Interesting that it never gets to s.close()? Weird. Anyhow, gc.mem_free() counts down by 0.5k every loop, and after about 60 cycles I get a string of OSErrors. Adding gc.collect() fixes that.

Even weirder, though, is what happens if the ssl handshake gets interrupted (server gets killed, wifi drops, etc):

Socket Open
mbedtls_ssl_handshake error: -68
Socket has timed out: [Errno 5] EIO
gc.mem_free(): 45552

...

Socket Open
Socket has timed out: [Errno 104] ECONNRESET
42000
Sending test to server 
Socket Open
assertion "0" failed: file "../../py/../extmod/modussl_mbedtls.c", line 151, function: socket_new
abort() was called at PC 0x40131a87 on core 0

Backtrace: 0x400873fb:0x3ffce0f0 0x40087427:0x3ffce110 0x40131a87:0x3ffce130 0x400ecf2b:0x3ffce160 0x400df716:0x3ffce1b0 0x400dbbc1:0x3ffce1e0 0x400dbc29:0x3ffce200 0x400e8c09:0x3ffce220 0x400df7a4:0x3ffce2c0 0x400dbbc1:0x3ffce2f0 0x400e8b7f:0x3ffce310 0x400df7a4:0x3ffce3b0 0x400dbbc1:0x3ffce430 0x400dbbee:0x3ffce450 0x400f7352:0x3ffce470 0x400f759c:0x3ffce510 0x400eeed8:0x3ffce550

Rebooting...
MrSurly commented 6 years ago

Interesting that it never gets to s.close()? Weird. Anyhow, gc.mem_free() counts down by 0.5k every loop, and after about 60 cycles I get a string of OSErrors. Adding gc.collect() fixes that.

These are all related; close de-allocates the memory associated with the mbedtls objects internal to the μPy SSL object, but if it's never called, then there's a memory leak, but one that's outside of the MicroPython heap.

The μPy garbage collector will destroy the μPy SSL object, but the internal mbedtls SSL objects don't get free-ed, because there's no finalizer for μPy SSL objects.

There are two problems to be fixed:

MrSurly commented 6 years ago

I modified your code a bit so I could run it here, and it's working for me. It errors out, but s.close() is being called.

#!/usr/bin/env python3
import socket
import ssl
import binascii
import sys

def send_to_server(data, url, server_index=0):
    print("Sending " + str(data) + "to server " + url)
    s = socket.socket()
    print("Socket Open")
    address = '127.0.0.1'
    try:
        s.settimeout(30)
        # server redundancy should be implemented. #FIXME: This will probably
        # do a dns lookup every time - probably shouldn't.
        ai = socket.getaddrinfo(address, 443)
        addr = ai[0][-1]  # cache this
        s.connect(addr)
        s = ssl.wrap_socket(s, cert=sump_cert_2.CERT, key=sump_cert_2.KEY)

        http_headers = '\r\n'.join([
            "POST {} HTTP/1.1".format(url),
            "Connection: Keep-Alive",
            "Accept: */*",
            "Host: {}".format(address),
            "User-Agent: SafeSumpSeldon",
            "Content-Type: application/x-msgpack",
            "Content-Length: {}".format(len(data)),
            "\r\n"])
        s.write(http_headers)
        s.write(data)
        response = s.read(2048)
    except OSError as e:
        print("Socket has timed out: {}".format(e))
        sys.print_exception(e)
    except Exception as e:
        print("Server connection failed: {}".format(e))
        sys.print_exception(e)
    else:
        return response
        # process_backlog()
    finally:
        s.close()
        print("Socket Close")
    return None

def test():
    for i in range(0, 500):
        try:
            send_to_server('hello', 'strawberry.local')
        except Exception as e:
            print ('outer exception', e)
            sys.print_exception(e)
dpgeorge commented 6 years ago

I've updated the ESP IDF to the latest version (note that it requires updating the xtensa toolchain), so the behaviour of this bug may have changed.

0xDBFB7 commented 6 years ago

@dpgeorge Yep:

I'm not getting that weird socket close behavior, all timeout mem-leaks are gone, however, if the SSL connection is interrupted for some reason, the thing still crashes:

Socket Open
Socket has timed out: [Errno 104] ECONNRESET
Socket Close
129
Sending test to server /update
Socket Open
assertion "0" failed: file "../../py/../extmod/modussl_mbedtls.c", line 151, function: socket_new
abort() was called at PC 0x401323ff on core 0

Rebooting...

Interestingly, it's not unrecoverable anymore - the esp completes rebooting and functions fine again. I think I'm okay with that behavior for now.

Thanks!

dpgeorge commented 6 years ago

@BasedOnTechnology the assertion failing on line 151 looks like it can only happen due to an out-of-memory condition in mbedtls and is not related to the actual socket connection. How exactly do you trigger this error? Is it reproducible?

MrSurly commented 6 years ago

@dpgeorge What do you think about adding finalizer to SSL sockets? Regular sockets already have one. I think that that could definitely be one cause of this, since close for SSL is the only thing that releases mbedtls RAM, and it's not called during GC.

dpgeorge commented 6 years ago

What do you think about adding finalizer to SSL sockets?

Yes it makes sense to do that.

MrSurly commented 6 years ago

I'll add it to mainline, though I don't have HW to test.

0xDBFB7 commented 6 years ago

@dpgeorge Yep, the error's completely reproducible. I have control over the server the board is connecting to - if I kill it (simulating a drop in wifi, say), the board always resets.

dpgeorge commented 6 years ago

Finaliser support for ussl objects was added in a6566fc2e15e8efd856b39c7ee2132fcdad3450c

0xDBFB7 commented 6 years ago

Sorry guys, but that didn't seem to do it.

[I kill the server]

Socket Open
mbedtls_ssl_handshake error: -7280
Socket has timed out: [Errno 5] EIO
Socket Close

[that's normal.]

[I kill the server in the middle of a handshake]

Socket Open
mbedtls_ssl_handshake error: -68
Socket Close

Socket Open
mbedtls_ssl_handshake error: -10
Socket has timed out: [Errno 5] EIO
Socket Close
Traceback (most recent call last):
  File "main.py", line 550, in <module>
  File "main.py", line 527, in process_data
AttributeError: 'NoneType' object has no attribute 'decode'
I (303267) wifi: bcn_timout,ap_probe_send_start
I (305767) wifi: ap_probe_send over, resett wifi status to disassoc
I (305767) wifi: state: run -> init (1)
I (305767) wifi: n:5 0, o:5 0, ap:255 255, sta:5 0, prof:1
I (305767) wifi: pm stop, total sleep time: 0/297598933

[this isn't unrelated, I never get these messages otherwise]

I (305767) wifi: STA_DISCONNECTED, reason:200
beacon timeout
I (308187) wifi: STA_DISCONNECTED, reason:201
no AP found
I (310597) wifi: STA_DISCONNECTED, reason:201
no AP found
I (313007) wifi: STA_DISCONNECTED, reason:201
no AP found
I (315417) wifi: STA_DISCONNECTED, reason:201
no AP found

[server's back up]

Socket Open
mbedtls_ssl_handshake error: -4290
Socket has timed out: [Errno 5] EIO
Socket Close

[and now I can never get a handshake again]

Socket Open
Socket has timed out: 118
Socket Close
0xDBFB7 commented 6 years ago

Here's the bare minimum code that still reproduces this issue.

Make an ssl server with openssl req -x509 -newkey rsa:2048 -keyout key.pem -out cert.pem -days 365 -nodes openssl s_server -key key.pem -cert cert.pem -accept 0.0.0.0:8000 -www

With the latest daily build from micropython.org, run

import network
import socket
import ussl
import utime
wlan = network.WLAN(network.STA_IF)
wlan.active(True)
wlan.connect(wifinetwork, password)
utime.sleep(10)
while(True):
    try:
        s = socket.socket()
        s.settimeout(30)
        ai = socket.getaddrinfo('192.168.0.42', 8000)
        addr = ai[0][-1]
        s.connect(addr)
        s = ussl.wrap_socket(s)
        s.close()
    except:
                s.close()
        print('Timeout')

Depending on if you kill the server with Ctrl-C after the 10 second timeout, restart the server, disconnect wifi, or just leave it running, you'll see a bunch of different behaviors.

If I just leave it running with the server off, for instance, eventually I get a string of

I (89158) wifi: bcn_timout,ap_probe_send_start
I (91658) wifi: ap_probe_send over, resett wifi status to disassoc
I (91658) wifi: state: run -> init (1)
I (91668) wifi: n:5 0, o:5 0, ap:255 255, sta:5 0, prof:1
I (91668) wifi: pm stop, total sleep time: 0/87161565

I (91668) wifi: STA_DISCONNECTED, reason:200
beacon timeout
I (94088) wifi: STA_DISCONNECTED, reason:201
no AP found

Thanks for looking into this.

0xDBFB7 commented 6 years ago

Adding gc.collect() seems to completely solve the issue now.

MrSurly commented 6 years ago

Adding gc.collect() seems to completely solve the issue now.

That's odd. I've seen that before, but only with the Pycom fork.

Edit: Maybe because there aren't enough µPy objects to trigger the GC, but they are consuming considerable FreeRTOS heap memory. Which is odd, because calling close() should perform the same cleanup that gc.collect() would trigger via the finaliser.

dpgeorge commented 6 years ago

I think I found the issue: there's only enough RAM (on the FreeRTOS heap) to create 2 SSL connections. If you try to create a third while the others are active then it will hit an assertion because mbedtls can't allocate any more memory.

And this case is being hit in the above test by the following sequence of events:

  1. server dies during a handshake, leading to a memory leak on esp32
  2. server restarts, connections can be made
  3. server dies again during a handshake, leading to another memory leak
  4. server restarts
  5. next connection fails due to out-of-memory

The fix (applied temporarily in this repo in 54c6ebcfd72cae2b3000987f99651e9c8015bf18, @BasedOnTechnology please can you test this fix?) is to clean up the mbedtls state when there is an error during the creation of the ssl object.

0xDBFB7 commented 6 years ago

@dpgeorge Thanks for the fix, I haven't encountered any issues with that commit. Tentatively, it looks like it's solved the problem.

dpgeorge commented 5 years ago

The fix mentioned just above was merged upstream some time ago.