pytoolz / toolz

A functional standard library for Python.
http://toolz.readthedocs.org/
Other
4.64k stars 258 forks source link

eager map & filter? #483

Open bijoythomas opened 4 years ago

bijoythomas commented 4 years ago

Hello, I'm new to toolz and am trying out the functions in the curried namespace. The code below

from toolz.curried import *
is_even = lambda n: n % 2 == 0
inc = lambda n: n + 1
compose(
    map(inc), 
    filter(is_even)
)([1,2,3,4])

returns a map object instead of a list (which I was expecting). However,

compose(
    groupby(lambda n: "A" if n < 2 else "B"), 
    map(lambda n: n + 1), 
    filter(lambda n: n %2 == 0)
)([1,2,3,4])

return a dict with list values as expected instead of a dict with sub-iterators (like itertools.groupby)

Is there a reason for keeping the curried map & filter lazy like the native Python3 functions?

groutr commented 4 years ago

toolz.groupby and itertools.groupby are not equivalent functions. 'itertools.groupby creates a new group every time the key function changes value. This effectively requires the input iterator to be sorted by the key function. toolz.groupby makes no such assumption. This is the reason why itertools.groupby is lazy and toolz.groupby is not.

map and filter have always been lazy in toolz. When toolz supported Python 2, map was an alias for itertools.imap and filter was an alias for itertools.ifilter. In Python 3, they are simply their respective builtin functions.

eriknw commented 4 years ago

Good questions @bijoythomas and thanks for the quick, informative reply @groutr. I always like to hear experiences of new users. Since the questions have been answered, can we close this issue?

Btw, we have considered having a non-lazy namespace so one could do things like toolz.eager.map(func, data). I'm open to this idea. When teaching, learning, or exploring, it can be helpful to effortlessly see the data instead of a lazy object. One challenge is how to have a curried, eager namespace? Would it be toolz.eager.curried, toolz.curried.eager, both, or something else?

startakovsky commented 3 years ago

Agree. I had these above questions myself. Good to know.

One thing I'd say is that since map and filter's value is not differentiated by this library anymore, then the docs should not show them being imported from the itertoolz library or any library. Seeing from toolz import map created some confusion while reading the documentation.

ruancomelli commented 3 years ago

@startakovsky there is a difference between the built-in map and toolz.curried.map since the second one is, of course, curried.

@eriknw I would suggest keeping everything lazy and just adding a consumer function that enforces eager evaluation. For instance, to eagerly evaluate a map object, you can just build a list out of it: map(f, it) is eagerly consumed by list(map(f, it)). This is usually what I do if wish to retain the computed values. If the values are not important and can be safely discarded, I usually resort to more_itertools.consume which is a lot faster and doesn't store anything. I think that this is also related to #445 .

This way, there would be no need for an eager namespace. Everything would be lazy by default, and if you want eager evaluation you either build a list (if you wish to keep the values) or consume your iterable.

mentalisttraceur commented 1 year ago

@ruancomelli note that toolz.last is already basically consume. (The only difference is that it returns the last value, whereas I'd expect consume to return nothing. Internally the current implementation builds on tail and thus stores a deque, but if that O(1) overhead is modest and presents a lower bar of difficulty for being automatically optimized away by the Python implementation compared to the O(n) of list.)

mentalisttraceur commented 1 year ago

@eriknw I think the answer to curried.eager vs eager.curried is that curried is a higher-level/more-general operation (curry(f) makes sense for any f, even if f is eager.f, but eagerness is much more specific to just iteration functions), so it should go first: toolz.curried.eager.

"Both" might also make sense as a user/developer improvement, but on the other hand by only having one, that's more Pythonic ("there should only be one way to do it"), it kinda teaches the outer-name-scope-should-be-more-general pattern by example, and there's no breaking change in starting with just one and switching to both later if it proves to be a usability problem.