miguelgrinberg / python-engineio

Python Engine.IO server and client
MIT License
232 stars 147 forks source link

Memory leak: engineio.Socket object never freed in error conditions #328

Closed pierrehebert closed 1 year ago

pierrehebert commented 1 year ago

Hello,

Bug description I'm facing memory leak issues when using python-engineio inside a Flask app, with Flask-SocketIO and either gevent or eventlet. This is easily reproducible using the sample below.

I believe this is closely related to #323 , but there is more than that, and this is why I'm opening a new issue. There are multiple problems in one, and probably not all in python-engineio. I would like to honestly say that my understanding is not perfect, so I apologize if my assumptions or conclusions are wrong.

Also I'm sorry for this long bug report...

Setup Using Arch Linux with python 3.11.3 and this requirements.txt file:

Flask
Flask-SocketIO
gevent
gevent-websocket
eventlet
gunicorn

For the reference, pip list gives:

Package          Version
---------------- -------
bidict           0.22.1
blinker          1.6.2
click            8.1.7
dnspython        2.4.2
eventlet         0.33.3
Flask            2.3.3
Flask-SocketIO   5.3.5
gevent           23.7.0
gevent-websocket 0.10.1
greenlet         2.0.2
gunicorn         21.2.0
itsdangerous     2.1.2
Jinja2           3.1.2
MarkupSafe       2.1.3
packaging        23.1
pip              22.3.1
python-engineio  4.6.1
python-socketio  5.8.0
setuptools       65.5.0
six              1.16.0
Werkzeug         2.3.7
zope.event       5.0
zope.interface   6.0

The python code below creates an app that can use either gevent or eventlet, either in standalone mode or using gunicorn. This app creates a single web page displaying a live count of connected clients.

#!/usr/bin/env python3
import sys

from flask import Flask, request
from flask_socketio import SocketIO

connections = set()

def build_app(async_mode, standalone=False):
    if async_mode == 'gevent':
        import gevent
    else:
        import eventlet

    app = Flask(__name__)
    socketio = SocketIO(app, async_mode=async_mode, ping_timeout=5)

    @socketio.on('connect')
    def socketio_handle_connect():
        sid = request.sid
        connections.add(sid)
        print('connect, connections count:', len(connections))
        socketio.emit('connections', list(connections))

    @socketio.on('disconnect')
    def socketio_handle_disconnect():
        connections.remove(request.sid)
        print('disconnect, connections count:', len(connections))

        # sometime triggers unrecoverable recursion error, so defer the call using spawn
        # socketio.emit('connections', list(connections))

        if async_mode == 'gevent':
            gevent.spawn(socketio.emit, 'connections', list(connections))
        else:
            eventlet.spawn(socketio.emit, 'connections', list(connections))

    @app.route('/')
    def index():
        return '''
<html>
    <body>
        <p style="font-size: 4em;">Status: <span id="status"></span></p>
        <p style="font-size: 4em;">Connection count: <span id="count"></span></p>
        <script src="https://cdnjs.cloudflare.com/ajax/libs/socket.io/4.7.2/socket.io.min.js" integrity="sha512-Xm9qbB6Pu06k3PUwPj785dyTl6oHxgsv9nHp7ej7nCpAqGZT3OZpsELuCYX05DdonFpTlBpXMOxjavIAIUwr0w==" crossorigin="anonymous" referrerpolicy="no-referrer"></script>
        <script>
            var socket = io();
            socket.on('connect', function() {
                document.getElementById('status').innerHTML = 'connected';
            });
            socket.on('disconnect', function() {
                document.getElementById('status').innerHTML = 'disconnected';
            });

            socket.on('connections', function(connections) {
                document.getElementById('count').innerHTML = connections.length;
            });
        </script>
    </body>
</html>        
'''

    return (app, socketio) if standalone else app

def main(async_mode):
    app, socketio = build_app(async_mode=async_mode, standalone=True)
    socketio.run(app, host='127.0.0.1', port=5001)

if __name__ == '__main__':
    main(sys.argv[1])

Using a venv and the aforementioned requirements.txt, the app can be launched these ways:

./main.py eventlet
./main.py gevent
./venv-3.11.3/bin/gunicorn -b 127.0.0.1:5001 --timeout 0 -k geventwebsocket.gunicorn.workers.GeventWebSocketWorker "main:build_app('gevent')"
./venv-3.11.3/bin/gunicorn -b 127.0.0.1:5001 --timeout 0 -k eventlet "main:build_app('eventlet')"

Steps to trigger the leak Open a browser with a few tabs on http://127.0.0.1:5001/. This works correctly and the displayed counter reflects the exact count of connections. Now duplicate the tabs to get between 30 or 50 connections. Select all tabs and reload them all at once using Ctrl-R multiple times to stress the server. The connections count will grow above the number of tabs but will eventually decrease and reach again the right count. But looking at the process RSS memory, it shows a constantly growing memory use and the process will ultimately be killed by the OOM. Underlying socket file descriptors have been closed and enough time have been given to the server to handle timeouts but RAM is never released.

Another but longer way to reproduce the issue is to simulate a network disconnection using iptables:

# block the traffic
iptables -A OUTPUT -p tcp --dport 5001 -j DROP
# release the traffic after a while
iptables -D OUTPUT -p tcp --dport 5001 -j DROP

Initially I caught the issue because of random network disconnections, but the 50 tabs trick is a better and faster way to reproduce the bug.

Please note that if the server is not stressed a bit, the leak doesn't seem to appear frequently. So the issue presumably appears in a context of errors, such as timeout or aborted connection.

Preliminary analysis Using tracemalloc and objgraph shows that the number of instances of engineio.Socket objects grows. This can be monitored using this patch:

diff --git a/venv-3.11.3/lib/python3.11/site-packages/engineio/server.py b/venv-3.11.3/lib/python3.11/site-packages/engineio/server.py
--- a/venv-3.11.3/lib/python3.11/site-packages/engineio/server.py   (revision d8405a735a6fe2508275790543c98238036cc4b6)
+++ b/venv-3.11.3/lib/python3.11/site-packages/engineio/server.py   (date 1693468463117)
@@ -775,6 +775,7 @@

             try:
                 # iterate over the current clients
+                print('sockets count', len(self.sockets))
                 for s in self.sockets.copy().values():
                     if not s.closing and not s.closed:
                         s.check_ping_timeout()

Because this object has a packet queue and messages sent in the sample are rather large and sent to all clients, the memory leak rapidly becomes visible.

After removing the @socketio.on('connect') and @socketio.on('disconnect') handlers, the leak is still present, although a lot less noticeable of course.

A simple (but obviously unsatisfying) workaround is to call del self.sockets[sid] somewhere in the _service_task where check_ping_timeout is called.

This improves a lot the situation, but alas, there is another leak. I'm not sure this one is linked with python-engineio though, but hopefully someone could point me in the right direction if this is not the case. The issue is related to python _DummyThread objects, used to handle threads that have been created outside of the python context. Such instances are, in the context of the above sample, created when using writer_task.join() from engineio/socket.py:260 (in particular but not only). These _DummyThread instances should not be created, but something is wrong somewhere in python's threading.current_thread() and get_ident(). I can see this with eventlet but not with gevent. Maybe this has something to do with eventlet but that's not obvious.

I could also prevent this leak using a dirty workaround (skip calls to the threading.current_thread() function), however there is still another leak (both with eventlet and gevent). I didn't started investigated this one yet.

Any help would be appreciated! And sorry again for the long post...

miguelgrinberg commented 1 year ago

You seem to be correct, it looks like some socket objects are not removed when there's a lot of concurrent reconnections. I've been testing with this patch and this appears to remove all lingering sockets:

diff --git a/src/engineio/server.py b/src/engineio/server.py
index 0fb1496..97d79e5 100644
--- a/src/engineio/server.py
+++ b/src/engineio/server.py
@@ -778,6 +778,11 @@ class Server(object):
                 for s in self.sockets.copy().values():
                     if not s.closing and not s.closed:
                         s.check_ping_timeout()
+                    elif s.closed:
+                        try:
+                            del self.sockets[s.sid]
+                        except KeyError:
+                            pass
                     if self.service_task_event.wait(timeout=sleep_interval):
                         raise KeyboardInterrupt()
             except (SystemExit, KeyboardInterrupt):

I need to test this more. I'm also not sure why some disconnections are caught due to a timeout and not with the disconnect that the browser sends (though this may be that the browser doesn't properly do this when it has to refresh too many pages at the same time). Let me know if this helps.

miguelgrinberg commented 1 year ago

@pierrehebert I committed a fix to the main branch. Can I ask you to retest on your side and report back? Thanks!

pierrehebert commented 1 year ago

Thank you for your fast reply and patch. Yes this fixes one leak, but unfortunately there is another leak (please see the last part of the bug report). Were you able to reproduce it ?

Also, there should be a signal indicating that the socket is no more valid, isn't it ? This might be a read/write exception or something else but once the socket has been closed, whatever the reason, the OS should report it reliably to the app. When closing 50 tabs, all sockets are gone with no delay and this can be seen in /proc/<pid of the server>/fd for instance. I didn't look at the code enough yet but I'll see if I can find a better way to catch this event and provide a patch that wouldn't rely on timeouts.

miguelgrinberg commented 1 year ago

Also, there should be a signal indicating that the socket is no more valid, isn't it ?

Low-level sockets do not have a notification system. If one side closes the socket nothing happens on the other side, which will only find out of the closure the next time it tries to read or write on it.

The JavaScript client adds a beforeunload event handler to the page, which is supposed to issue a proper close of the Socket.IO connection. This involves sending a CLOSE packet before the low-level socket is closed. The explicit close is what makes the server immediately realize the connection has been closed. This works well when a single tab is closed, but appears to be unreliable under the conditions you are testing with lots of tabs, probably due to the sticky activation requirement. So the server receives only some of these explicit closures. The remaining ones are detected through the ping/pong mechanism, which can take up to 25 seconds with default settings.

With respect to the _DummyThread issue. Who is calling current_thread()? If you are using eventlet, then you should be calling eventlet's own version of this function, you cannot mix threading functions from eventlet with those from Python as they are incompatible.

pierrehebert commented 1 year ago

current_thread() is called from writer_task.join() at engineio/socket.py:260. The python modules are monkey patched as needed, and there should be no mix between eventlet (or gevent) and python versions. Should there be a mix, chances are that this wouldn't work at all. I suspect either a bug in eventlet or an improper use from the app. However this seem improbable as it seems to be a fairly regular use of the feature, and also I can reproduce it with way older versions. Also, I'd like to insist on the fact that even with a workaround for this issue, there is still another leak somewhere :-(

miguelgrinberg commented 1 year ago

@pierrehebert monkey patching is not required and should not make any difference because the writer task is started using the eventlet/gevent implementations of the threading module, not the Python one. If eventlet and/or gevent call Python's current_thread() instead of their own, then I suppose it is their bug, but I would find it very strange that they do that.

miguelgrinberg commented 1 year ago

@pierrehebert I did some debugging on the _DummyThread problem. It definitely appears to be an issue with the eventlet implementation of the Thread class. The following patch creates a simplistic version of Thread implemented on top of the spawn() function, and this appears to avoid the creation of dummy threads:

diff --git a/src/engineio/async_drivers/eventlet.py b/src/engineio/async_drivers/eventlet.py
index f74f5af..86fa076 100644
--- a/src/engineio/async_drivers/eventlet.py
+++ b/src/engineio/async_drivers/eventlet.py
@@ -1,11 +1,23 @@
 from __future__ import absolute_import

-from eventlet.green.threading import Thread, Event
-from eventlet import queue
-from eventlet import sleep
+from eventlet.green.threading import Event
+from eventlet import queue, sleep, spawn
 from eventlet.websocket import WebSocketWSGI as _WebSocketWSGI

+class EventletThread:
+    def __init__(self, target, args=None, kwargs=None):
+        self.target = target
+        self.args = args or ()
+        self.kwargs = kwargs or {}
+
+    def start(self):
+        self.g = spawn(self.target, *self.args, **self.kwargs)
+
+    def join(self):
+        return self.g.wait()
+
+
 class WebSocketWSGI(_WebSocketWSGI):
     def __init__(self, handler, server):
         try:
@@ -26,7 +38,7 @@ class WebSocketWSGI(_WebSocketWSGI):

 _async = {
-    'thread': Thread,
+    'thread': EventletThread,
     'queue': queue.Queue,
     'queue_empty': queue.Empty,
     'event': Event,

Can I ask you to give this patch a try on your side and report back if you see improvements?