Open 06547701-06a2-4e4a-b672-16a33475101e opened 9 years ago
Executor.filter
is to filter
what Executor.map
is to map
.
See Python-ideas thread:
https://groups.google.com/forum/#!topic/python-ideas/EBOC5YCWPyo
Patch attached. I don't know how to run the Python test suite (I'm guessing it involves building Python and I don't know how to do that.) Please tell me whether the tests I wrote pass.
I'm guessing that other than that, the main thing missing in my patch is documentation. If there's agreement that this feature is desirable and the implementation is good, I'll be happy to write the docs. Let me know.
Also, I notice Python 3.5 feature freeze is a bit over a week away, and I hope we can get that in so it could go in Python 3.5. (Assuming it goes in at all.)
You should add docs for the new method, as well.
Updated the tests (had to use real defs, not lambdas, and the expected results for filter_exception weren't right).
Tests pass.
Get some docs written! :)
(More reviews would also be good. ;)
This feature seems unnecessary to me but couldn't the implementation be simplified to work in terms of map? i.e. (pseudocode):
def filter(self, fn, iterable, timeout=None):
l = list(iterable)
return (item for (item, keep) in zip(l, self.map(fn, l, timeout)) if keep)
Patch with documentation attached. (I don't know how to concatenate patches, so 2.patch contains only the documentation, while 1.patch has the implementation and the tests (but Ethan's patch is better.))
Brian, regarding your simpler implementation based on Executor.map
: It looks good to me, but I'm not sure if it passes the tests. If it does then I don't mind if that's the implementation that would be used.
Just as a note - to test a pure Pthon patch like this, you can apply the patch to your installed Python 3.4 installation, and just run the test using that. There should be no need to build your own Python.
python C:\Apps\Python34\Lib\test\test_concurrent_futures.py test_cancel (main.FutureTests) ... ok test_cancelled (main.FutureTests) ... ok test_done (main.FutureTests) ... ok test_done_callback_already_cancelled (main.FutureTests) ... ok test_done_callback_already_failed (main.FutureTests) ... ok test_done_callback_already_successful (main.FutureTests) ... ok [... etc]
This feature looks unnecessary to me as well. Adding features has a non-zero cost in maintenance.
Short History: \=============
(Ram Rachum)
What do you think about adding a method: Executor.filter
?
I was using something like this:
my_things = [thing for thing in things if some_condition(thing)]
But the problem was that some_condition
took a long time to run waiting on I/O, which is a great candidate for parallelizing with ThreadPoolExecutor. I made it work using Executor.map
and some improvizing, but it would be nicer if I could do:
with concurrent.futures.ThreadPoolExecutor(100) as executor:
my_things = executor.filter(some_condition, things)
And have the condition run in parallel on all the threads.
(Nick Coughlan) I think this is sufficiently tricky to get right that it's worth adding filter() as a parallel to the existing map() API.
(Nick Coughlan)
Nick Coghlan isn't currently on this issue. Unless you were talking about another, separate Nick Coughlan that I don't know of? :)
But to answer your argument:
I think this is sufficiently tricky to get right that it's worth adding filter() as a parallel to the existing map() API.
"Tricky to get right" is not a good criterion. The question is whether it's useful or not. Only the OP has had that need AFAIK.
That comment was from an email by Nick, not to Nick. But now I've added him. ;)
filter() usage has always been less common than map() usage, and we see a similar pattern in comprehension usage as well (i.e. [f(x) for x in y] is a far more common construct than [x for x in p(y)]). That "less common" status doesn't keep us from providing filter() as builtin, or syntactic support for filtering in the comprehension syntax.
As a result, the main question I'd like to see a clear and authoritative answer to is "Given 'seq2 = filter(p, seq)' or 'seq2 = [x for seq if p(x)]', what's the concurrent.futures based parallel execution syntax in cases where the filtering key is expensive to calculate?"
I'd be quite OK with Brian's 2-line implementation going into the concurrent.futures documentation as a filtering recipe, similar to the way Raymond uses the recipes in the itertools documentation to help minimise complexity growth in the core API.
I *don't* mind if Brian's judgement is that it doesn't rise to the level of being worth including as a core feature in its own right, as I agree that the typical case of filtering functions is that they're fast, and when they're not, it's often a sign that data model denormalisation may be desirable in order to cache the relevant derived property.
Folks could probably guess what I meant, but both filter comprehensions in my comment should have read: "[x for x in seq if p(x)]"
Okay, let me understand what the opinions of people are about this feature.
Ram Rachum: +1 Antoine Pitrou: -1 Nick Coghlan: Is that a +1 or a +0? Ethan and Paul, you've participated but I didn't get your opinion on whether you want to see this feature go in.
If there are more -1s than +1s we can close this ticket.
I'm not a heavy user of concurrent.futures, so I don't have a strong opinion. I've never felt a need for this function, though, so I guess I'm -0. A recipe in the docs would be good, though.
I'm now +1 on a recipe in the docs, -0 on a concrete method on Executor that delegates to the map() method.
Looks like this a recipe in the docs is where this ticket is headed.
I took my original example for Executor.filter
and changed it using Brian's suggestion so now it uses Executor.map
. Please check it out. If someone other than me feels comfortable in putting it into the documentation page, I'll be happy if you could do that.
Possible issues:
This example uses requests
, which isn't in the standard library.
I would say that this for
line:
for person in (person for person, filter_result in
zip(people, executor.map(has_wikipedia_page, people))
if filter_result):
Is a bit ugly, but since the consensus here is that this is better than implementing Executor.filter
, so be it.
I'd rather see an Executor.filter()
myself, but it's Brian Quinlan's call.
Ethan: I'm -0 so I'd happily go with the consensus. Does anyone have a strong opinion?
Ram: What did you mean by "Possible issues"? Did you mean that to be an example using the executor.map() technique?
I vary between +0 and -0 for the addition of the concrete method.
When I'm at +0, the main rationale is that we *don't* have the "Where do we stop?" risk here that itertools faces, as we're just replicating the synchronous builtins.
When I'm at -0, the main rationale is that a recipe works with *any* version of Python that provides concurrent.futures (including any version of the PyPI backport), and is hence useful immediately, while a method would only work the version where we added it.
It looks like we are pretty much neutral between the +1's and -1's.
Antoine seems to be opposed on general principles against bloat, while I am for it on general principles of completeness.
The recipe could still go in the docs for people to use on previous versions of Python.
If I was a module maintainer, and the other maintainers were not opposed, I would add it.
(So, Brian, care to have a co-maintainer? ;)
I would like to tip the balance in favor of the -1. Antoine is right about this one.
Raymond: Thank you. So the discussion is back on adding a recipe to the docs.
Brian: When I said "possible issues", I followed that with a couple of issues with the example I uploaded (filter_example.py).
A problem I just realized with Brian's 2-line implementation of filter
: It doesn't work for iterables which aren't sequences, since it attempts to exhaust the iterable twice. So if you have a non-sequence you'll have to make it into a sequence first.
Actually it immediately converts the iterable into a list. Recall:
def filter(self, fn, iterable, timeout=None):
l = list(iterable) # iterable => list
return (item for (item, keep) in zip(l, self.map(fn, l, timeout)) if keep)
The recipe will work in simple situations.
In more complex situations:
the recipe will either have a severe impact on performance or fail entirely.
Ethan: That's what I also thought, but then I figured that you're creating N futures anyway with an item attached to each, so it's on the same scale as creating a list, no?
The recipe creates a list before it ever starts processing, while Executor.filter() starts processing with the first item.
Right, so it might be garbage-collecting the items before the futures are done being created, so the whole list wouldn't need to be saved in memory. I understand now.
Brian, given my comments in msg245016 are you willing to add the Executor.filter() function?
Hey Ethan, I'm really sorry about dropping the ball on this. I've been burnt out on Python stuff for the last couple of years.
When we left this, it looked like the -1s were in the majority and no one new has jumped on to support filter
.
If you wanted to add this, I wouldn't object. But I've been inactive so long that I don't think that I should make the decision.
Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.
Show more details
GitHub fields: ```python assignee = None closed_at = None created_at =
labels = ['type-feature', 'docs']
title = 'Add `Executor.filter` to concurrent.futures'
updated_at =
user = 'https://github.com/cool-RR'
```
bugs.python.org fields:
```python
activity =
actor = 'bquinlan'
assignee = 'docs@python'
closed = False
closed_date = None
closer = None
components = ['Documentation']
creation =
creator = 'cool-RR'
dependencies = []
files = ['39373', '39377', '39381', '39417']
hgrepos = []
issue_num = 24195
keywords = ['patch']
message_count = 32.0
messages = ['243217', '243218', '243221', '243240', '243242', '243254', '243256', '243260', '243272', '243274', '243276', '243281', '243296', '243298', '243388', '243390', '243436', '243463', '243493', '243871', '243886', '243888', '243890', '243891', '244914', '245004', '245016', '245017', '245025', '245026', '246029', '341635']
nosy_count = 9.0
nosy_names = ['rhettinger', 'paul.moore', 'bquinlan', 'ncoghlan', 'pitrou', 'jnoller', 'cool-RR', 'docs@python', 'sbt']
pr_nums = []
priority = 'normal'
resolution = None
stage = None
status = 'open'
superseder = None
type = 'enhancement'
url = 'https://bugs.python.org/issue24195'
versions = ['Python 3.6']
```