python / cpython

The Python programming language
https://www.python.org
Other
63.16k stars 30.24k forks source link

Add `Executor.filter` to concurrent.futures #68383

Open 06547701-06a2-4e4a-b672-16a33475101e opened 9 years ago

06547701-06a2-4e4a-b672-16a33475101e commented 9 years ago
BPO 24195
Nosy @rhettinger, @pfmoore, @brianquinlan, @ncoghlan, @pitrou, @cool-RR
Files
  • 1.patch
  • issue24195.stoneleaf.01.patch
  • 2.patch
  • filter_example.py
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields: ```python assignee = None closed_at = None created_at = labels = ['type-feature', 'docs'] title = 'Add `Executor.filter` to concurrent.futures' updated_at = user = 'https://github.com/cool-RR' ``` bugs.python.org fields: ```python activity = actor = 'bquinlan' assignee = 'docs@python' closed = False closed_date = None closer = None components = ['Documentation'] creation = creator = 'cool-RR' dependencies = [] files = ['39373', '39377', '39381', '39417'] hgrepos = [] issue_num = 24195 keywords = ['patch'] message_count = 32.0 messages = ['243217', '243218', '243221', '243240', '243242', '243254', '243256', '243260', '243272', '243274', '243276', '243281', '243296', '243298', '243388', '243390', '243436', '243463', '243493', '243871', '243886', '243888', '243890', '243891', '244914', '245004', '245016', '245017', '245025', '245026', '246029', '341635'] nosy_count = 9.0 nosy_names = ['rhettinger', 'paul.moore', 'bquinlan', 'ncoghlan', 'pitrou', 'jnoller', 'cool-RR', 'docs@python', 'sbt'] pr_nums = [] priority = 'normal' resolution = None stage = None status = 'open' superseder = None type = 'enhancement' url = 'https://bugs.python.org/issue24195' versions = ['Python 3.6'] ```

    06547701-06a2-4e4a-b672-16a33475101e commented 9 years ago

    Executor.filter is to filter what Executor.map is to map.

    See Python-ideas thread:

    https://groups.google.com/forum/#!topic/python-ideas/EBOC5YCWPyo

    Patch attached. I don't know how to run the Python test suite (I'm guessing it involves building Python and I don't know how to do that.) Please tell me whether the tests I wrote pass.

    I'm guessing that other than that, the main thing missing in my patch is documentation. If there's agreement that this feature is desirable and the implementation is good, I'll be happy to write the docs. Let me know.

    06547701-06a2-4e4a-b672-16a33475101e commented 9 years ago

    Also, I notice Python 3.5 feature freeze is a bit over a week away, and I hope we can get that in so it could go in Python 3.5. (Assuming it goes in at all.)

    pfmoore commented 9 years ago

    You should add docs for the new method, as well.

    ethanfurman commented 9 years ago

    Updated the tests (had to use real defs, not lambdas, and the expected results for filter_exception weren't right).

    Tests pass.

    Get some docs written! :)

    (More reviews would also be good. ;)

    brianquinlan commented 9 years ago

    This feature seems unnecessary to me but couldn't the implementation be simplified to work in terms of map? i.e. (pseudocode):

    def filter(self, fn, iterable, timeout=None):
      l = list(iterable)
      return (item for (item, keep) in zip(l, self.map(fn, l, timeout)) if keep)
    06547701-06a2-4e4a-b672-16a33475101e commented 9 years ago

    Patch with documentation attached. (I don't know how to concatenate patches, so 2.patch contains only the documentation, while 1.patch has the implementation and the tests (but Ethan's patch is better.))

    Brian, regarding your simpler implementation based on Executor.map: It looks good to me, but I'm not sure if it passes the tests. If it does then I don't mind if that's the implementation that would be used.

    pfmoore commented 9 years ago

    Just as a note - to test a pure Pthon patch like this, you can apply the patch to your installed Python 3.4 installation, and just run the test using that. There should be no need to build your own Python.

    python C:\Apps\Python34\Lib\test\test_concurrent_futures.py test_cancel (main.FutureTests) ... ok test_cancelled (main.FutureTests) ... ok test_done (main.FutureTests) ... ok test_done_callback_already_cancelled (main.FutureTests) ... ok test_done_callback_already_failed (main.FutureTests) ... ok test_done_callback_already_successful (main.FutureTests) ... ok [... etc]

    pitrou commented 9 years ago

    This feature looks unnecessary to me as well. Adding features has a non-zero cost in maintenance.

    ethanfurman commented 9 years ago

    Short History: \=============

    (Ram Rachum) What do you think about adding a method: Executor.filter? I was using something like this:

        my_things = [thing for thing in things if some_condition(thing)]

    But the problem was that some_condition took a long time to run waiting on I/O, which is a great candidate for parallelizing with ThreadPoolExecutor. I made it work using Executor.map and some improvizing, but it would be nicer if I could do:

        with concurrent.futures.ThreadPoolExecutor(100) as executor:
            my_things = executor.filter(some_condition, things)

    And have the condition run in parallel on all the threads.

    (Nick Coughlan) I think this is sufficiently tricky to get right that it's worth adding filter() as a parallel to the existing map() API.

    pitrou commented 9 years ago

    (Nick Coughlan)

    Nick Coghlan isn't currently on this issue. Unless you were talking about another, separate Nick Coughlan that I don't know of? :)

    pitrou commented 9 years ago

    But to answer your argument:

    I think this is sufficiently tricky to get right that it's worth adding filter() as a parallel to the existing map() API.

    "Tricky to get right" is not a good criterion. The question is whether it's useful or not. Only the OP has had that need AFAIK.

    ethanfurman commented 9 years ago

    That comment was from an email by Nick, not to Nick. But now I've added him. ;)

    ncoghlan commented 9 years ago

    filter() usage has always been less common than map() usage, and we see a similar pattern in comprehension usage as well (i.e. [f(x) for x in y] is a far more common construct than [x for x in p(y)]). That "less common" status doesn't keep us from providing filter() as builtin, or syntactic support for filtering in the comprehension syntax.

    As a result, the main question I'd like to see a clear and authoritative answer to is "Given 'seq2 = filter(p, seq)' or 'seq2 = [x for seq if p(x)]', what's the concurrent.futures based parallel execution syntax in cases where the filtering key is expensive to calculate?"

    I'd be quite OK with Brian's 2-line implementation going into the concurrent.futures documentation as a filtering recipe, similar to the way Raymond uses the recipes in the itertools documentation to help minimise complexity growth in the core API.

    I *don't* mind if Brian's judgement is that it doesn't rise to the level of being worth including as a core feature in its own right, as I agree that the typical case of filtering functions is that they're fast, and when they're not, it's often a sign that data model denormalisation may be desirable in order to cache the relevant derived property.

    ncoghlan commented 9 years ago

    Folks could probably guess what I meant, but both filter comprehensions in my comment should have read: "[x for x in seq if p(x)]"

    06547701-06a2-4e4a-b672-16a33475101e commented 9 years ago

    Okay, let me understand what the opinions of people are about this feature.

    Ram Rachum: +1 Antoine Pitrou: -1 Nick Coghlan: Is that a +1 or a +0? Ethan and Paul, you've participated but I didn't get your opinion on whether you want to see this feature go in.

    If there are more -1s than +1s we can close this ticket.

    pfmoore commented 9 years ago

    I'm not a heavy user of concurrent.futures, so I don't have a strong opinion. I've never felt a need for this function, though, so I guess I'm -0. A recipe in the docs would be good, though.

    ncoghlan commented 9 years ago

    I'm now +1 on a recipe in the docs, -0 on a concrete method on Executor that delegates to the map() method.

    06547701-06a2-4e4a-b672-16a33475101e commented 9 years ago

    Looks like this a recipe in the docs is where this ticket is headed.

    I took my original example for Executor.filter and changed it using Brian's suggestion so now it uses Executor.map. Please check it out. If someone other than me feels comfortable in putting it into the documentation page, I'll be happy if you could do that.

    Possible issues:

    This example uses requests, which isn't in the standard library.

    I would say that this for line:

        for person in (person for person, filter_result in
                       zip(people, executor.map(has_wikipedia_page, people))
                       if filter_result):

    Is a bit ugly, but since the consensus here is that this is better than implementing Executor.filter, so be it.

    ethanfurman commented 9 years ago

    I'd rather see an Executor.filter() myself, but it's Brian Quinlan's call.

    brianquinlan commented 9 years ago

    Ethan: I'm -0 so I'd happily go with the consensus. Does anyone have a strong opinion?

    Ram: What did you mean by "Possible issues"? Did you mean that to be an example using the executor.map() technique?

    ncoghlan commented 9 years ago

    I vary between +0 and -0 for the addition of the concrete method.

    When I'm at +0, the main rationale is that we *don't* have the "Where do we stop?" risk here that itertools faces, as we're just replicating the synchronous builtins.

    When I'm at -0, the main rationale is that a recipe works with *any* version of Python that provides concurrent.futures (including any version of the PyPI backport), and is hence useful immediately, while a method would only work the version where we added it.

    ethanfurman commented 9 years ago

    It looks like we are pretty much neutral between the +1's and -1's.

    Antoine seems to be opposed on general principles against bloat, while I am for it on general principles of completeness.

    The recipe could still go in the docs for people to use on previous versions of Python.

    If I was a module maintainer, and the other maintainers were not opposed, I would add it.

    (So, Brian, care to have a co-maintainer? ;)

    rhettinger commented 9 years ago

    I would like to tip the balance in favor of the -1. Antoine is right about this one.

    06547701-06a2-4e4a-b672-16a33475101e commented 9 years ago

    Raymond: Thank you. So the discussion is back on adding a recipe to the docs.

    Brian: When I said "possible issues", I followed that with a couple of issues with the example I uploaded (filter_example.py).

    06547701-06a2-4e4a-b672-16a33475101e commented 9 years ago

    A problem I just realized with Brian's 2-line implementation of filter: It doesn't work for iterables which aren't sequences, since it attempts to exhaust the iterable twice. So if you have a non-sequence you'll have to make it into a sequence first.

    brianquinlan commented 9 years ago

    Actually it immediately converts the iterable into a list. Recall:

    def filter(self, fn, iterable, timeout=None):
      l = list(iterable)  # iterable => list
      return (item for (item, keep) in zip(l, self.map(fn, l, timeout)) if keep)
    ethanfurman commented 9 years ago

    The recipe will work in simple situations.

    In more complex situations:

    the recipe will either have a severe impact on performance or fail entirely.

    06547701-06a2-4e4a-b672-16a33475101e commented 9 years ago

    Ethan: That's what I also thought, but then I figured that you're creating N futures anyway with an item attached to each, so it's on the same scale as creating a list, no?

    ethanfurman commented 9 years ago

    The recipe creates a list before it ever starts processing, while Executor.filter() starts processing with the first item.

    06547701-06a2-4e4a-b672-16a33475101e commented 9 years ago

    Right, so it might be garbage-collecting the items before the futures are done being created, so the whole list wouldn't need to be saved in memory. I understand now.

    ethanfurman commented 9 years ago

    Brian, given my comments in msg245016 are you willing to add the Executor.filter() function?

    brianquinlan commented 5 years ago

    Hey Ethan, I'm really sorry about dropping the ball on this. I've been burnt out on Python stuff for the last couple of years.

    When we left this, it looked like the -1s were in the majority and no one new has jumped on to support filter.

    If you wanted to add this, I wouldn't object. But I've been inactive so long that I don't think that I should make the decision.