ross / requests-futures

Asynchronous Python HTTP Requests for Humans using Futures
Other
2.11k stars 152 forks source link

ThreadPoolExecutor resource cleanup? #20

Closed boboli closed 9 years ago

boboli commented 9 years ago

When using FuturesSession for a long-running web scraper script, I've noticed a memory leak due to the fact that I wasn't cleaning up the ThreadPoolExecutors that were created by the many FuturesSession(max_workers=blah) calls I was making.

I fixed the issue by writing a contextmanager that cleaned up my executor when exiting:

@contextmanager
def clean_futures_session_when_done(session):
    try:
        yield
    finally:
        if session.executor:
            session.executor.shutdown()

with clean_futures_session_when_done(FuturesSession(max_workers=2)):
    do_stuff()

This feels a bit slimy since I'm using the internal(?) self.executor reference. I also realize that the shutdown() will block until all Futures are done, but I feel this is acceptable for many use cases.

An alternative I've considered is having FuturesSession implement the context manager protocol with __enter__() and __exit__() so we can directly use it in a with statement. This would be similar to how open() works:

class FuturesSessionWithCleanup(FuturesSession):
    def __enter__(self):
        return self

    def __exit__(self, type, value, traceback):
        self.executor.shutdown()

with FuturesSessionWithCleanup(max_workers=2):
    do_stuff()
# block until all Futures are cleaned

Does this sound reasonable?

ross commented 9 years ago

interesting. that's not a use-case i've come across myself so hadn't though about addressing.

is there a specific reason you're creating a bunch of sessions and not a single long-lived FuturesSession to be shared across time. unless each one needs to be its own distinct session (cookies etc.) then you would probably be better off creating a single FuturesSession with a larger number of max_workers and just let it live for the life of the script.

i'm not completely opposed to FuturesSession implementing the context manager protocol, just want to make sure that it's needed first.

boboli commented 9 years ago

I think cleaning up the thread pool resources is akin to calling .close() on file objects when we're done with them. And open() follows the context manager pattern to give you a convenient wrapper that automatically calls the .close(), so that's where I got the idea from.

I agree that for my script it's better to use just a single FuturesSession, but I feel it's good practice to clean up resources regardless.

ross commented 9 years ago

feel free to pr that change to FuturesSession and ideally provide an example in the README. i assume the example should catch and use the session.

with FuturesSession(max_workers=2) as session:
    session...

ideally there'd be some sort of unit testing of the functionality. perhaps there's a way to tell if the executor has been shutdown correctly.

perpetual-hydrofoil commented 9 years ago

+1. Seems useful when chunking large numbers of requests (I'm doing 5000 per FuturesSession)

ross commented 9 years ago

happy to accept patches w/tests. otherwise i'll try and get to it in an upcoming weekend.

boboli commented 9 years ago

Heh I was dragging my feet on the PR because of the difficulty of writing a proper unit test. I've investigated the concurrent.futures module, and there's only 2 ways I can think of to determine if the executor has been shutdown:

  1. Inspect executor._shutdown which is a private field on ThreadPoolExecutor (https://hg.python.org/cpython/file/3.2/Lib/concurrent/futures/thread.py#l125). Feels really icky to rely on private API.
  2. Rely on the documented fact that a RuntimeError will be raised if we try to use the ThreadPoolExectuor again: (https://docs.python.org/3.2/library/concurrent.futures.html#concurrent.futures.Executor.shutdown): "Calls to Executor.submit() and Executor.map() made after shutdown will raise RuntimeError."

Option 2 sounds slightly more proper but still icky in that it's not directly asserting what we intended, but a side effect.

Lemme know which option sounds better and I can try to do a PR with it.

ross commented 9 years ago

another option might be to monkey patch executor.shutdown in the unit test and replace it with something that sets a flag and calls the original.

or slightly cleaner, inherit from FuturesSession and override exit and set a flag that can be checked there.

definitely a tough thing to test, that it shut down as designed. i guess the most important part to test is that the object functions in the with context correctly. that it calls exit is nice to test, but not critical.