micropython / micropython-lib

Core Python libraries ported to MicroPython
Other
2.42k stars 1k forks source link

Memory Leak in uasyncio #353

Closed t35tB0t closed 2 months ago

t35tB0t commented 5 years ago

The yield of IORead(self.polls) isn't aware socket connections have been reset and will keep trying forever. This results in orphaned co-routines stacking up with each new reset connection (likely on the self.objmap). This usually happens on the readline call in the handler (there will never be a line of incoming data to read so code is stuck yielding over and over on that statement). Eventually execution halts system with OOM allocation error.

Also, the start_server method in uasyncio does not have exception handling. As-is, a single connection which is suddenly reset in the middle of an HTTP session will throw and exception and terminate the app.

Adding exception handling to the code allows new connections after a failed one. Hitting the server with about 250 connections that are reset after 10msec exposes the IORead(s) memory leak issue.

Attached code was run on F429 and F767 boards with DFU from MicroPython v1.11-292-g5641aa55d on 2019-09-14: AsyncLeak.zip

Python aggressor script socket_test: socket_test.zip

Run with: python Python 3.4.3 |Anaconda custom (64-bit)| (default, Mar 6 2015, 12:06:10) [MSC v.1600 64 bit (AMD64)] on win32 Type "help", "copyright", "credits" or "license" for more information.

import socket_test as st st.cGet('10.0.0.5',80,0.010,250)

(micropython gives memory allocation error after about 240 iterations)

I'll guess that this might be stabilized with proper error handling but as far as I can tell, the modification would need to be in the asyncio library itself.

peterhinch commented 5 years ago

I don't think the StreamReader can be expected to detect a dropped connection. It is routine for a nonblocking socket to have no data to return and uasyncio handles this correctly. However if new connections keep being instantiated with none being closed the system will run out of RAM, if only because socket instances and corresponding objmap entries grow without limit.

As I understand it the only way to determine whether the link is down (rather than the client being temporarily inactive) is via a timeout. This can be implemented in the client_coro: if no data has been received for a given period the connection is assumed to have been dropped and the application code should react accordingly.

If this analysis is correct the bug is in Picoweb.

t35tB0t commented 5 years ago

Thank you Peter.  I understand that the -lib gets less attention than core so your attention here is greatly appreciated.  At this pointI'm wondering if this may be a core module issue in uselect.c and not in uasyncio?

Since the first posting, I've been digger deeper into the issue. Agreed, the client app (e.g. PIcoweb) should be handling socket exceptions but it needs those exceptions bubbled up from the ioctl calls in c-land.  We are being blocked on the uasyncio.StreamReader.readline() call deep down.   They way yielding works, the client has absolutely no control here once its made a yielded call that never returns.   Sure, a client can check to see if a socket connection has been reset before it makes the yield call but this just creates a race condition and in no way solves the issue here.  This is because a socket reset may occur after client checks.  IMHO, we must have exceptions passed up through the call stack so that the yield can be wrapped with a try/except handler at the client layer.  Then the client can be responsible for cleaning things up.  Timeouts inefficient and only a secondary mechanism at best (and I couldn't get them to change the behavior here).

Looking at my test case more closely, I see that the code ends up spinning on the uselect.poll.ipoll() waiting for some I/O to happen.  Its spinning on sockets connections that have been reset.  IMHO, that is not good.  With a simple instrumented example, I can see that each and every reset connection adds an zombie object to the asyncio ObjMap.  They will never get purged. And the client code that launched the yield request will never have control returned to its layer so it can't do anything about it.

Looking at the uselect.c code, poll_map_poll() code, we see the ioctl call and the following mp_raise_OSError(errcode) lines but I'm haven't dug deeper yet to understand why we're not getting that exception raised up through the mPy call stack or why the ioctl isn't detecting a reset connection as a socket error.

If there is something I'm doing wrong in setting things here, it would be helpful to know.  The out-of-the-box examples are simply "broken" in that they are highly vulnerable to real-world network mis-behaviors.  The first indication that we had a serious issue was when we held down the F5 refresh key on Windows7 Internet Explorer (I know - use FireFox or anything else...).  IE's behaviors on the wire quickly crashed a simple Picoweb example.  A simple python socket client that resets the connection after ~40msec delay makes this a very repeatable event - locking up the code with the StreamReader.readline() call.

peterhinch commented 5 years ago

I should stress that I'm not expert on the subject of socket programming. The question at the core of this issue is this:

Regardless of language or system implementation is it possible to distinguish between a socket whose peer has disconnected and one whose peer is connected but idle?

This is a question which has troubled me for some time and I don't know the definitive answer. I strongly suspect that the answer is that you cannot differentiate. I've yet to see any documentation or code which suggests that there is any way to do this, and everybody seems to use timeouts. If you know a better way I'd love to know how this can be achieved as I share your dislike of timeouts.

Another thing to bear in mind if considering changes to the StreamReader is that the "socket" can be any object with an ioctl. Some object types may report "no data available" for very long periods indeed.

t35tB0t commented 5 years ago

Yes, it is possible to determine actual state of sockets but it is awkward.  LWIP very much knows what the state of a socket is. If do a print(socket) you'll see the socket state as part of a long status string.  AFIAK, "3" means open, and negative values such as "-14" and "-17" are indicating things such as connection reset or closed.  Unfortunately, the socket state doesn't seem to be directly exposed any other way and parsing the verbose return is inefficient.  Most users will check a socket connection by reading zero bytes and trapping the returned exception.  That doesn't seem very clean IMHO and it certainly doesn't really tell us the actual state of a socket.  As such, it would be very help to have socket state exposed more directly so user code can take more intelligent actions.

However, the issue here with co-routines in microPython is a failure to throw exceptions on socket resets or other errors. This making the very powerful uasyncio sub-system useless with networking; yet networking amongst slow devices really needs co-routines.  Again, its possible that I'm setting something up wrong but I pretty much running examples from those who have written and published this before me and I've yet to see an async socket example I can't trivially crash from a remote client.

At this point, the code is definitely blocking at the StreamReader's call to uselect's ipoll.  So, unless I'm missing something here, the c-code is looping on the mapped objects, doing ioctls, and isn't reacting to reset connections.  If so, then the first change will need to be in or around uselect.c Unfortunately, I'm struggling without a debugger to wade through that code.

t35tB0t commented 5 years ago

At this point in debugging, it is clear that an ioctl call to a socket in state = -14 (RESET) does not return an error.  It returns "no data available".  And to me, this is quite wrong.  A reset socket connection is done - gone.  So, I would expect any further attempts to read from that socket should throw an exception.  Yet, the ioctl call in uselect.c happily keeps querying this socket in state = -14; returning nothing that breaks the deadlock.

I'm sorry to say, but this really does appear to be a bug inside mPy's c-code, deeper down than uselect.c.  I can't think of an acceptable work-around at the Python level.

peterhinch commented 5 years ago

Thanks for the information. Alas I don't have the know-how to help with your issue.

I understand that the -lib gets less attention than core

Indeed. One avenue might be to try the pycopy fork and its associated library. Assuming the problem occurs there you might raise an issue against that. Paul Sokolovsky is the original author of most of the MicroPython library and he may be able to provide a prompt response.

peterhinch commented 5 years ago

The following probably has no bearing on your issue but does relate to memory leaks. With Kevin Köck I spent some time studying nonblocking sockets for GitHub projects. Our empirical experience of detecting client outages seems not to match some of your statements and I'd like to resolve this.

Test scripts are below: they are derived from an application but simplified to remove uasyncio and select from the handling of client sockets. Reading of nonblocking sockets is also simplified. Outcomes are the same in CPython 3.6.8 and Micropython, except where stated. These are as follows:

  1. Issuing print(socket_instance) produces nothing useful in MicroPython. In CPython its output is identical whether the peer is open or closed.
  2. The only exception ever thrown by the server-side client socket is 11 (indicating that it would block). The server issues client_socket.read(n). This throws 11 if the peer socket is open but no data is available and returns '' if the peer is closed: this is the obvious way to detect client termination assuming the client closes the socket even under error conditions.
  3. A badly written client which fails to close sockets will therefore cause trouble not only to itself but also to the server because the server will never "know" when to close the client socket instance. Both server and client will consume RAM ad nauseam as new sockets are instantiated.
    1. If the server fails and closes the socket the client sees a 104 exception (Connection reset by peer). I find this asymmetry surprising.

We encountered situations such as connectivity outages where 3. occurs even with a properly written client. We therefore concluded that to handle this the client should issue keepalive packets and the server should implement a timeout. This applies in the opposite direction too: the client needs to detect outages to avoid memory leaks. I would love to be proved wrong on this. This article describes the efforts we made on these issues.

Server script (both scripts were run on the same PC):

import sys
upython = sys.implementation.name == 'micropython'
if upython:
    import usocket as socket
    import uselect as select
    import utime as time
else:
    import socket
    import select
    import time

PORT = 8123

addr = socket.getaddrinfo('0.0.0.0', PORT, 0, socket.SOCK_STREAM)[0][-1]
s_sock = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
s_sock.setsockopt(socket.SOL_SOCKET, socket.SO_REUSEADDR, 1)
s_sock.bind(addr)
s_sock.listen(5)
print('Awaiting connection on', PORT)
poller = select.poll()
poller.register(s_sock, select.POLLIN)
while True:
    res = poller.poll(1)
    if res:
        print('Got connection')
        sock, _ = s_sock.accept()  # get client socket
        sock.setblocking(False)
        while True:
            try:
                d = sock.recv(20)
            except Exception as e:
                if e.args[0] == 11:
                    print('Wait for data.', sock)
                else:
                    print(e)  # Never happens
                d = None
                time.sleep(0.25)
            if d is not None:
                if d == b'':  # Client has closed socket
                    print('Disconnect: closing socket.', sock)
                    sock.close()
                    break
                else:
                    print(d, end='')
        print('Awaiting connection on', PORT)

    time.sleep(0.2)

Client script:

import sys
upython = sys.implementation.name == 'micropython'
if upython:
    import usocket as socket
    import utime as time
else:
    import socket
    import time

PORT = 8123
SERVER = '0.0.0.0'

sock = socket.socket()
try:
    serv = socket.getaddrinfo(SERVER, PORT)[0][-1]
    sock.connect(serv)
except OSError:
    print('Connect fail.')
else:
    print('Connected')
    sock.setblocking(False)
    try:
        for n in range(10):
            s = ''.join((str(n), '\n'))
            sock.send(s.encode())
            print('Sent: ', s, end='')
            time.sleep(1)
    except Exception as e:
        print(e)  # Get 104 (Connection reset by peer) if server closes socket
    finally:
        sock.close()
        print('Closed socket')
t35tB0t commented 5 years ago

Thank you for the details. It will take a bit of time to parse here. There is no doubt that asyncio has error handling issues. And there is no doubt the modlwip.c ioctl is not doing what if expected from select.poll().

I modified uasyncio and this fixed the memory leaks where a TCP client was nastily resetting the connection after issuing a HTTP request. Picoweb needed some tweaks to it's exception handling. Unfortunately uasyncio needs a lot more work. By tuning my adversarial client, we can craft packet sequences that trigger exceptions in various sections of uasyncio.

Pycopy appears to have the same issues.

At this point, the only way I can make my application solid, is to fix uasyncio and picoweb. In order to make it robust, we need to be able to ask LWIP about socket state. Many folks are using round about ways of inferring state but LWIP holds this state itself so why not just get that directly?

Your suggestions regarding keep-alives and timeouts are noted and I will incorporate those where possible. But these are additional layers of protection, not a core solution. The is especially true when we have no control over the behavior of the other side of the connection.

After I've completed testing with adversarial clients and heavily loads, i will submit the exception handling enhanced uasyncio, picoweb and modlwip.c code here with a test case that demonstrates the effect.

peterhinch commented 5 years ago

A note of caution about submitting PR's fixing behaviour in the presence of deliberately adversarial clients. It may be that the maintainers decide the changes are too costly in terms of RAM usage. The uasyncio and picoweb libraries are designed to run on ESP8266 targets which are very RAM-constrained.

If changes are substantial, before committing effort to PR's it might be worth raising an RFC. I have found it hard to get even bugfix uasyncio PR's accepted, to the point where I maintain my own fork. This is primarily aimed at bare metal applications requiring high speed I/O rather than the kind of web-centric changes you are proposing.

Paul Sokolovsky is the uasyncio author. His library is more actively maintained than the official one which is at V2.0. Paul and I have (by different means) fixed the most egregious bug relating to task cancellation which lives on in V2.0.

But please keep us posted with your results.

t35tB0t commented 5 years ago

It is a fact that we probably can readily crash any micropython device running stock micropython asyncio from any of the known repositories simply by sending correctly crafted packets. 

I will do what is necessary to fix these and will happily share the polished mods in any way that pleases the community.   I'm not super familiar with the PR vs RFC process and I appreciate your suggestions regarding developers and related parties.  From my perspective, the whole lot of you are awesome.  What you all have collectively accomplished with micropython and its libraries is an amazing and a significant contribution to IoT.  Making this robust and fault tolerant is going to be necessary if this is going to be truly useful beyond the hobby world.  So, I'll pop-in when I think I've found something that helps with that.  There are some other suggestions here - like breaking up the picoweb run function so that it can be included as just-another-task in a more generalized task loop.  But, I'll focus on robustness first.

After doing my research, coding, and testing, I am 100% certain that the code I'm touching is proof of critical bugs in picoweb, uasyncio, and modlwip.c   These are readily fixed in a way that results in less code overall and less memory usage.  And I"ve identified some code-refactoring optimizations that will improve speed a bit.  I've already recompiled micropython with a fixed mpdlwip.c and also fixed most of the exception handling issues in uasyncio and picoweb.   The system is now super stable and far more difficult to crash.  SDcard files are being served by a picoweb app in excess of 20Mbits/sec while multiple tasks are being executed.  Later we'll work to improve the transfer speeds closer to the 85Mbits/sec raw ethernet speed achievable on an F767.

But first, we must make the device un-crashable from external I/O.

Below is a client code snippet for those who need more direct proof.  Run it against your favorite micropython server and see how long it survives.  With zero mods, I was able to crash the main micropython from a dumb internet explorer browser and artful pounding on the keyboard.  With some hardening, the script below can run for thousands of iterations and only rarely hits the remaining soft spots (which cause memory leaks).  Once I've weeded out those issues, this kind of client activity won't be able to crash a micropython device.

Finally, I'm pondering your question regarding "silent connections" vs "abandoned connections".  We did see some substantial mpy stack improvements this year with regards to abandoned connections.  I was previously able to spam a unit with lots of abandoned connections and quickly crash it.  But, you challenging scenario remains a valid concern.  The TCP timeouts should be adjustable and managed within the socket connection sub-system so the user code remains clean.  Shorter timeouts and fewer allowed connections will help to manage memory usage bloat. But, agreed, this kind of environmental use-case tuning is not ideal.  I haven't explored using keep-alives but we must assume we don't have any control over the other end of the connection and they might not support it.  So, your current approach of keep-alive probing with timeouts as a backup seems reasonable.  If implemented in the application layer, it does cloud the coding a bit.  Handled in the TCP layer, the SO_KEEPALIVE sockopt needs to be settable from a mpy app.  Nevertheless, its worth pursuing. Finally, you mentioned that print(socket_instance) doesn't show anything useful - yet here, we're seeing a very useful socket state value printed out.  And, if we could access this socket state directly, we certainly could write more intelligent state-aware code.  IMHO, this could actually result in overall savings.

!/usr/bin/env python

socket_test.py

Nasty HTTP Client

#

run with this:   python -c "from socket_test import

*;cRandom('10.0.0.5', 80, 0, 100, 1000)"

import socket import struct import time import random

Nasty bugger of an HTTP Client which drops TCP connections some random

time after sending a friendly HTTP GET request

Suggested random intervals are 0 to 100 (ms) with 1000 iterations

Turning debugging on slows things down, suggestion is to add following

into uasyncio.wait() to show memory leak: 

print('Number of yielded coroutine: %d' % (len(self.objmap)))

The picoweb app referenced in the HTTP GET simply toggles an LED

def cRandom(host, port,min,max,iterations):     from random import randint     for i in range(iterations):         try:             s = socket.socket(socket.AF_INET, socket.SOCK_STREAM, 0)             s.connect((host, port))             Lgr_onoff = 1             Lgr_linger = 0             s.setsockopt(socket.SOL_SOCKET, socket.SO_LINGER,struct.pack('ii', Lgr_onoff, Lgr_linger))             s.send(b'GET /toggle HTTP/1.0\r\n\r\n')         except Exception as e:             print('Pass:  ',i,e)         finally:             time.sleep(randint(min,max)/1000)             s.close()  # this will actually issue a socket RESET since we've set the linger option

Note:  A WireShark trace makes things pretty clear

t35tB0t commented 5 years ago

UPDATE:  Success!

micropython co-routines performing i/o with sockets are now more robust in that they can handle socket error conditions without hanging or leaking memory.

As-published in both micropython and pycopy, remote connections resets and other errors will hang the yielding co-routine forever.  Eventually, memory will fill up and system will be come unusable or crash.  This is because socket errors are NOT returned unsolicited AND because uasyncio does not have exception handling.  Adding PicoWeb on top of these two, just made matters worse.  The vulnerability is severe and I am suggesting the nobody use sockets with uasyncio until these fixes are implemented.  If anyone is interested, just let me know and I can push the fixes three here or elsewhere as desired.

The issues are:

modlwip.c      Not POSIX compliant - ie it requires event flag bits HUP and ERR to be set in order to return these socket error conditions.  Modifying modlwip.c makes this perform as expected. A work-around if you can't re-compile micropython is to modify uasyncio.py and set the flags there.

uasyncio.py     Missing exception handling in the start_server loop.  A socket which receives a remote-reset between the IOREAD(s() and the s.accept() will throw an unhandled exception which terminates the server.  Yes, the caller can trap this exception and re-start the server.  But, that requires re-initialized the entire server and is inefficient - the proper fix is to add exception handling in the start_server() method.

picoweb.py     Defective exception handling.  Requires modifications in the handle() method.  The handler is there but it isn't closing the read and write sockets properly.  Not cleaning up leaks memory - badly.

kevinkk525 commented 5 years ago

That sounds like good news! My guess is, it would be best to fix modlwip.c but if a PR for that doesn't work, maybe uasyncio is the next best option.

t35tB0t commented 5 years ago

Caution:  Fixing modwlwip.c alone will break uasyncio and downstream apps, like picoweb.  We did that mod first and experienced the dramatic results.  Add the related fixes in the other modules let's our little STM32 ride through millions of nasty socket connections while still smoothly serving multiple friendly clients.  (just sayin...)

However, IMHO, asyncio socket user apps are already all broken due to their vulnerability to simply leaning on IE's F5 browser refresh key FWIW.  Also, reading around, I'm surprised at the number of awkward work-arounds folks seem to have in wittingly implemented to deal with lack of knowing the state of a socket and/or connection.  Peter Hinch has noted the challenge of dealing with silently dropped connections (without piling up things memory).  For these small devices, shorter timeouts and keep-alive mechanisms, also noted by Peter, are likely the best avenue. However, user-space keep-alives shouldn't be necessary for TCP connections.  We should be able to active TCP KEEP_ALIVEs in the LWIP stack.  I havn't dug deeper into that one yet but it is nominally simply a socket option one can set and should "just work" as long as the remote device supports it.  The fallback until we have that, of course, is in the realm Peter Hinch's solutions.  Perhaps a mPy layer keep-alive mechanism could be added somewhere more re-useable, like in uasyncio, for example?  A place where the uasyncio call-back engine somewhat knows co-routine states and has access to the read and write socket objects?  Hmm...  its a possibility.  The motivate is to make the user level coding more compact a readable while making the socket connections (including over wifi) more robust and reliable.

So, regarding the current system crashing issue...  it may have been modlwip author's intentions to make this more flexible by giving uasyncio's the options to add the POLL_HUP and PULL_ERR flags up there.  But it never happened.  Still, reading the expected behavior in Linux and other POSIX systems, I believe that modlwip must return these errors unsolicited.  So then uasyncio, apps, and frameworks like picoweb must handle the returned exceptions.  My consternation with this fragility of the current state of micropython.  IMHO sockets, uasyncio and even picoweb are very critical to the usefulness of micropython.  Oddly, the circuitpython folks seem to think otherwise which is odd since beginners don't have to use co-routines, and advanced IoT programmers would be limiting the usefulness of their device if they don't.  uasyncio is very Pythonic and probably one of the best features and closest thing to a micro-kernel that micropython has.  It is especially a very good match to the hardware capabilities of the ESP and STM32 devices.   We are focused on SMT32 with 100BT ethernet at the moment so the testing and use-cases are focused on those devices.  This is primarily because of the price/power performance of the STM32F767ZI device combined with the fact that it is strongly connected with the PyBoard as a reference design for micropython.  The wifi capabilities and low price-point of the ESP devices is also attractive, however batteries are a bane and wifi environments can be unreliable.

kevinkk525 commented 5 years ago

Oh you need the modlwip fix and uasyncio fix, got it. I guess that would be beneficial.

During the development of micropython-iot with Peter I did try the TCP KEEP_ALIVE mechanism which works great on CPython but did not work on micropython sadly. If that worked, it would indeed make socket implementations easier and remove the need for a user keepalive.

Thank you for digging into this and trying to fix the bugs.

jimmo commented 5 years ago

@t35tB0t Thank you for this super detailed report!

If you'd be willing to post the diffs that you made to modlwip.c and uasyncio.py here I'd be more than happy to do the work to get this merged and released (and manage the communication about the compatibility). Otherwise I'll start with your test case and see what I can do myself.

t35tB0t commented 5 years ago

@jimmo Thanks for pitching in. Below is a link to the modlwip.c changes. I'll need a day here to get time clean up and post diffs for: uasyncio.py and picoweb.py

see: https://github.com/micropython/micropython/issues/5172

jonnor commented 2 months ago

I believe this issue has been fixed - details in https://github.com/micropython/micropython/issues/5172#issuecomment-548194591

peterhinch commented 2 months ago

Indeed. Further it relates to the long obsolete uasyncio V2.