pytoolz / toolz

A functional standard library for Python.
http://toolz.readthedocs.org/
Other
4.7k stars 263 forks source link

`get` on iterators #97

Open eriknw opened 11 years ago

eriknw commented 11 years ago

It was brought up in #93 that get didn't work on iterators, but it could be made to do so. This would act as a generalized nth that could accept a list of indices to fetch. I thought this was interesting, so I developed the following:

def iterget(ind, seq):
    indices = sorted(zip(ind, itertools.count()))
    results = []
    val = next(seq)
    prev_index = 0
    for index, count in indices:
        if index != prev_index:
            n = index - prev_index - 1
            val = next(itertools.islice(seq, n, n + 1))
            prev_index = index
        results.append((count, val))
    return tuple(item for _, item in sorted(results))

and some output:

>>> iterget([1, 3, 5, 2, 3], iter(range(100, 200)))
(101, 103, 105, 102, 103)

>>> iterget(range(10) + range(9, -1, -1), iter(range(10)))
(0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 9, 8, 7, 6, 5, 4, 3, 2, 1, 0)

>>> iterget([100000000, 1, 100000000, 5], iter(xrange(1000000000)))
(100000000, 1, 100000000, 5)

>>> iterget([100000000000000, 1, 5], iter(xrange(1000000000000000)))
# will need to wait a long, long time, but at least the memory footprint is low!

I don't know if this is the best or cleanest way to do it, and it currently doesn't accept default arguments, but I thought this was neat enough to share.

So, should we allow get to work on iterators? Should there instead be a separate function for getting multiple items from an iterator?

eriknw commented 11 years ago

Accepts a default value as per get, but it will still fail if indices are not non-negative integers:

no_default = '__no__default__'
def iterget(ind, seq, default=no_default):
    indices = sorted(zip(ind, itertools.count()))
    results = [default] * len(indices)
    try:
        val = next(seq)
        prev_index = 0
        for index, count in indices:
            if index != prev_index:
                n = index - prev_index - 1
                val = next(itertools.islice(seq, n, n + 1))
                prev_index = index
            results[count] = val
    except StopIteration:
        if default == no_default:
            raise IndexError('index out of range')
    return tuple(results)

>>> iterget([1, 10, 3, 11, 2], iter(range(10)), default=None)
(1, None, 3, None, 2)
jcrichton commented 11 years ago

Hi Erik,

Nice code -- it's quite a tough problem to solve without resorting to tee.

The functions need to be amended slightly to work with iterables, which by definition have no next or __next__ method.

For example:

>>> iterget([1, 3, 5, 2, 3], iter(range(100, 200)))
(101, 103, 105, 102, 103) # works as expected

>>> iterget([1, 3, 5, 2, 3], range(100, 200))
Traceback (most recent call last):
  File "<pyshell#16>", line 1, in <module>
    iterget([1, 3, 5, 2, 3], range(100, 200))
  File "<pyshell#14>", line 4, in iterget
    val = next(seq)
TypeError: 'range' object is not an iterator

If we add one extra line at the start of the function we can fix the problem.


def iterget(ind, seq):
    seq = iter(seq) # only change required
    indices = sorted(zip(ind, itertools.count()))
    results = []
    val = next(seq)
    prev_index = 0
    for index, count in indices:
        if index != prev_index:
            n = index - prev_index - 1
            val = next(itertools.islice(seq, n, n + 1))
            prev_index = index
        results.append((count, val))
    return tuple(item for _, item in sorted(results))

>>> iterget([1, 3, 5, 2, 3], iter(range(100, 200)))
(101, 103, 105, 102, 103)

>>> iterget([1, 3, 5, 2, 3], range(100, 200))
(101, 103, 105, 102, 103)
mrocklin commented 11 years ago

My view is that get depends on indexing while take, drop, nth depend on iteration. I'm not too concerned about get failing on lazy iterators. If we do allow this then the documentations should probably be updated.

eriknw commented 11 years ago

Yeah, I was actually thinking about allowing nth to accept a list of items to return, so it would be analogous to get but for iterators and iterables in general. Is this appropriate for nth, or should a new function be defined?

mrocklin commented 11 years ago

I would prefer extending nth before making a new function. Do you have a particular use case?

eriknw commented 11 years ago

Do you have a particular use case?

Not at present, but I'll try to come up with one, because I can see using this at some point. I thought this was a neat convenience, and intuitive especially when juxtaposed with get.

eriknw commented 11 years ago

If it is decided that being able to pass a list of indices to nth is a good idea (because it is convenient), I made the following branch:

https://github.com/eriknw/toolz/tree/nth_with_list and the diff https://github.com/eriknw/toolz/compare/nth_with_list

mrocklin commented 10 years ago

I found a use case for this. In the future I may need to efficiently select certain rows out of a lazily read file. I might be doing this a lot (and in strange ways) so I want it to be simple (hence in toolz).

This might be an interesting case to test how cytoolz responds to changes in toolz. I'm not in any rush to do this. I figure I'll wait until after @eriknw releases cytoolz to PyPI.