twisted / treq

Python requests like API built on top of Twisted's HTTP client.
Other
587 stars 140 forks source link

memory leak with each SSL web request #380

Open akrherz opened 9 months ago

akrherz commented 9 months ago

I'm attempting to diagnose a memory leak with each treq.get call on a HTTPS website leaking small amounts of memory. This is my simple reproducer.

import treq
from twisted.internet.defer import inlineCallbacks
from twisted.internet import reactor

@inlineCallbacks
def main():
    for i in range(1000):
        # author owns this website, so hitting it is fine :)
        _req = yield treq.get('https://mesonet.agron.iastate.edu/robots.txt')
    reactor.stop()

reactor.callLater(0, main)
reactor.run()

running this with memray seems to indicate a SSL verify_paths leak in OpenSSL/SSL.py set_result = _lib.SSL_CTX_set_default_verify_paths(self._context)

    def set_default_verify_paths(self):
        """
        Specify that the platform provided CA certificates are to be used for
        verification purposes. This method has some caveats related to the
        binary wheels that cryptography (pyOpenSSL's primary dependency) ships:

        *   macOS will only load certificates using this method if the user has
            the ``openssl@1.1`` `Homebrew <https://brew.sh>`_ formula installed
            in the default location.
        *   Windows will not work.
        *   manylinux cryptography wheels will work on most common Linux
            distributions in pyOpenSSL 17.1.0 and above.  pyOpenSSL detects the
            manylinux wheel and attempts to load roots via a fallback path.

        :return: None
        """
        # SSL_CTX_set_default_verify_paths will attempt to load certs from
        # both a cafile and capath that are set at compile time. However,
        # it will first check environment variables and, if present, load
        # those paths instead
        set_result = _lib.SSL_CTX_set_default_verify_paths(self._context)

Is there a means to workaround this within treq? Perhaps create some context that can be reused throughout the lifetime of the running app?

glyph commented 9 months ago

@akrherz thanks for debugging this down to the root!

The problem with recycling the context is that it contains all the information associated with the connection like, for example, the hostname that we are connecting to. So sharing them between hosts, at least, is not practically possible.

However, with some re-plumbing within Twisted, we could store additional connection-specific data such as the hostname in connection.set_app_data rather than just storing the TLSMemoryBIOProtocol in there.

In any case, this issue would probably reproduce just fine with just twisted's Agent, so I think this issue should be reopened there instead. We can leave this open and linked for tracking though, since it definitely affects treq's behavior.

akrherz commented 9 months ago

Thanks @glyph for chiming in and all your work on twisted over the years! I wonder if python/cpython#84904 is related? Will boggle this some more. I also have openssl 3.2.0 in play here on conda-forge, so lots of moving parts ;)

glyph commented 9 months ago

Thanks @glyph for chiming in and all your work on twisted over the years!

You're welcome! Kind of you to say so.

I wonder if python/cpython#84904 is related?

Seems likely that the actual OpenSSL function is the one that's leaking, in which case it would affect both in the same way?

Will boggle this some more. I also have openssl 3.2.0 in play here on conda-forge, so lots of moving parts ;)

OK, let me know what you discover.