pytoolz / toolz

A functional standard library for Python.
http://toolz.readthedocs.org/
Other
4.71k stars 263 forks source link

Add an unzip function? #239

Open davidshepherd7 opened 9 years ago

davidshepherd7 commented 9 years ago

It might useful to have an unzip function that is the inverse of zip. The trivial implementation is just

def unzip(seq):
    return zip(*seq)

It's a little nicer to have a function for this if your code is likely to be read by people who are new(ish) to python and could be confused by the * operator, but it's a pain to have to define it in every project.

(This was also mentioned here but it didn't get any discussion.)

eriknw commented 9 years ago

I agree that x, y = unzip(seq) is more understandable than x, y = zip(*seq). Part of why we like functional programming is that we can give suitable names to operations, so unzip scores a point there.

One of the concerns with unzip is that, unlike most functions in itertoolz, it is not lazy. As a transpose operation, it needs to read the entire sequence into memory. Using izip will not help.

I'm currently undecided whether this should be added to toolz. I think it would be perfect for an "extra toolz" package.

mrocklin commented 9 years ago

I'm -0.5 on adding unzip. I think that the benefits of comprehension added by unzip are outweighed by the costs of adding a new function to toolz.

eriknw commented 9 years ago

Aye, me too. I do occasionally (but not frequently) use zip(*items), though, so my code would arguably be more readable if I used unzip instead.

I think "Tips and Tricks" or the sandbox would be appropriate. The sandbox has the advantage that it's executable code that can be tested, used, and more easily moved into another project should a suitable project arise.

davidshepherd7 commented 9 years ago

One of the concerns with unzip is that, unlike most functions in itertoolz, it is not lazy. As a transpose operation, it needs to read the entire sequence into memory. Using izip will not help.

What do you mean exactly? Isn't zip a lazy transpose operation?

I think the difficulty is in being lazy over two infinite "axes", e.g. zip can't handle an infinite number of finite lists but it can handle a finite number of infinite lists.

Below is an unzip implementation I've been playing with that can handle a finite list of infinite lists (like zip). It might be possible to handle an infinite list of finite lists but I can't figure out how to use tee in such a way.

#! /usr/bin/env python3

import cytoolz as tlz
import itertools as it

def unzip(l):
    iters = it.tee(l) # TODO: This should be replaced by a lazy version
                      # which can yield as many iterators as are needed.
    return it.starmap(tlz.pluck, enumerate(iters))

# # Trivial implementation hangs on test_2 and test_3:
# def unzip(l):
#     return zip(*l)

# Unzips a simple list correctly
def test_1():
    assert list(map(list, unzip([('a', 1), ('b', 2), ('c', 3)]))) == [['a', 'b', 'c'], [1, 2, 3]]

# Can handle a finite list of infinite iterators
def test_2():
    out = unzip(zip(it.count(1), it.repeat([0])))
    a = next(out)
    assert list(tlz.take(10, a)) == list(tlz.take(10, it.count(1)))

# Can handle an infinite list of finite itertors
def test_3():
    out = unzip(it.repeat([1, 2, 3]))
    a = next(out)
    assert list(tlz.take(10, a)) == list(it.repeat(1, 10))

    b = next(out)
    assert list(tlz.take(10, b)) == list(it.repeat(2, 10))

    # # TODO: Can only do fixed (currently with n = 2) number of output iterators, so
    # # this fails
    # c = next(out)
    # assert list(tlz.take(10, c)) == list(it.repeat(3, 10))

if __name__ == "__main__":
    test_1()
    test_2()
    test_3()

There's major a caveat to this implementation: it uses tee, so all the iterators must be either consumed at once or destroyed somehow, otherwise the memory usage will grow. I'm not sure if this is avoidable, any ideas?

I'm -0.5 on adding unzip. I think that the benefits of comprehension added by unzip are outweighed by the costs of adding a new function to toolz.

If we could come up with a good lazy implementation (or at least lazy in one "axis") would you consider it worth including?

mrocklin commented 9 years ago

If we could come up with a good lazy implementation (or at least lazy in one "axis") would you consider it worth including?

We would also need to demonstrate that this is something that people need. In general I would suggest putting such a thing in toolz.sandbox for a while and see if it gets use.

eriknw commented 9 years ago

Just to mirror mrocklin, I'd like to see some examples that are considered idiomatic use of unzip. I'm not thrilled about the use of tee (no other function is toolz uses it), but there's no way around it for having support for streaming.

After playing with this a little, I think the advanced version of unzip should return a tuple of iterators, not an iterator of iterators.

In [152]: from toolz.curried import map, take, drop

In [153]: 

In [153]: # 10x10

In [154]: pipe(unzip(partition(10, range(100))), map(list), list)
Out[154]: 
[[0, 10, 20, 30, 40, 50, 60, 70, 80, 90],
 [1, 11, 21, 31, 41, 51, 61, 71, 81, 91],
 [2, 12, 22, 32, 42, 52, 62, 72, 82, 92],
 [3, 13, 23, 33, 43, 53, 63, 73, 83, 93],
 [4, 14, 24, 34, 44, 54, 64, 74, 84, 94],
 [5, 15, 25, 35, 45, 55, 65, 75, 85, 95],
 [6, 16, 26, 36, 46, 56, 66, 76, 86, 96],
 [7, 17, 27, 37, 47, 57, 67, 77, 87, 97],
 [8, 18, 28, 38, 48, 58, 68, 78, 88, 98],
 [9, 19, 29, 39, 49, 59, 69, 79, 89, 99]]

In [155]: 

In [155]: # 20x20, get 5x5 block diagonals

In [156]: # use of tee results in memory use that doesn't scale

In [157]: A = unzip(partition(20, range(400)))

In [158]: pipe(A, take(5), map(take(5)), map(list), list)
Out[158]: 
[[0, 20, 40, 60, 80],
 [1, 21, 41, 61, 81],
 [2, 22, 42, 62, 82],
 [3, 23, 43, 63, 83],
 [4, 24, 44, 64, 84]]

In [159]: pipe(A, take(5), map(drop(5)), map(take(5)), map(list), list)
Out[159]: 
[[105, 125, 145, 165, 185],
 [106, 126, 146, 166, 186],
 [107, 127, 147, 167, 187],
 [108, 128, 148, 168, 188],
 [109, 129, 149, 169, 189]]

In [160]: pipe(A, take(5), map(drop(10)), map(take(5)), map(list), list)
Out[160]: 
[[210, 230, 250, 270, 290],
 [211, 231, 251, 271, 291],
 [212, 232, 252, 272, 292],
 [213, 233, 253, 273, 293],
 [214, 234, 254, 274, 294]]

In [161]: pipe(A, take(5), map(drop(15)), map(take(5)), map(list), list)
Out[161]: 
[[315, 335, 355, 375, 395],
 [316, 336, 356, 376, 396],
 [317, 337, 357, 377, 397],
 [318, 338, 358, 378, 398],
 [319, 339, 359, 379, 399]]

In [162]: 

In [162]: # more sane way that operates on chunks and limits the impact of tee

In [163]: A = list(unzip(partition(10, xrange(1000000000000000000))))

In [164]: pipe(A, map(take(5)), map(list), list)
Out[164]: 
[[0, 10, 20, 30, 40],
 [1, 11, 21, 31, 41],
 [2, 12, 22, 32, 42],
 [3, 13, 23, 33, 43],
 [4, 14, 24, 34, 44],
 [5, 15, 25, 35, 45],
 [6, 16, 26, 36, 46],
 [7, 17, 27, 37, 47],
 [8, 18, 28, 38, 48],
 [9, 19, 29, 39, 49]]

In [165]: pipe(A, map(take(5)), map(list), list)
Out[165]: 
[[50, 60, 70, 80, 90],
 [51, 61, 71, 81, 91],
 [52, 62, 72, 82, 92],
 [53, 63, 73, 83, 93],
 [54, 64, 74, 84, 94],
 [55, 65, 75, 85, 95],
 [56, 66, 76, 86, 96],
 [57, 67, 77, 87, 97],
 [58, 68, 78, 88, 98],
 [59, 69, 79, 89, 99]]

In [166]: pipe(A, map(take(5)), map(list), list)
Out[166]: 
[[100, 110, 120, 130, 140],
 [101, 111, 121, 131, 141],
 [102, 112, 122, 132, 142],
 [103, 113, 123, 133, 143],
 [104, 114, 124, 134, 144],
 [105, 115, 125, 135, 145],
 [106, 116, 126, 136, 146],
 [107, 117, 127, 137, 147],
 [108, 118, 128, 138, 148],
 [109, 119, 129, 139, 149]]
davidshepherd7 commented 9 years ago

This is my main use case at the moment:

files, classifications, signals = unzip(parse(line) for line in sys.stdin)

where parse(line) returns a tuple (file, classification, signal).

I'm also not too happy with tee, but as far as I can tell there's no way around it.

Yeah I think returning a tuple of iterators is enough. I found a way to do an iterator of iterators (it turns out that tee objects are fine to be copied with copy.copy) but I can't think of any use cases at the moment.

Related to the use of tee: is there any way to encourage garbage collection of spare tee'd iterators so that it's safe to do something like

_, classifications, signals = unzip(parse(line) for line in sys.stdin)

then consume only classifications and signals without wasting memory on _? I suppose you could do something like

classifications, signals = toolz.get([1, 2], unzip(parse(line) for line in sys.stdin))

but it's a bit less clean.

llllllllll commented 9 years ago

We would also need to demonstrate that this is something that people need.

I realize that I am late to this; however, I often use the unzip pattern. The absolute biggest mistake I see is that a, b = zip(*seq) is incorrect if seq is empty. If seq is empty, then zip returns a single empty iterator and the unpack assign results in a value error. I hit this in all kinds of project (just today in blaze somewhere). Because the more correct version is a, b = tuple(zip(*seq)) or ((), ()) I would argue that a, b = unzip(seq) would be far more readable.

+1 from a person that wouldn't mind seeing this move into the core.

yoavram commented 7 years ago

+1. I just came by this issue after implementing a crude version of unzip myself. The use case is very similar to what @davidshepherd7 outlines above:

def parse_line(line):
    m = pattern.match(line)
    if m:
        return m.groups() # gives a 2-tuple
lines = (parse_line(line) for line in lines)
lines = filter(lambda x: x is not None, lines)
ts, ps = toolz.sandbox.unzip(lines)

Quick question: if I want to get concrete lists (for example, to plot them), is it efficient to follow this with

ts, ps = list(ts), list(ps)

?

lumbric commented 5 years ago

I think the docstring of toolz.sandbox.core.unzip(seq) contains two minor mistakes.

Unlike the naive implementation def unzip(seq): zip(*seq) this implementation can handle a finite sequence of infinite sequences.

If seq is a finite sequence of infinite sequences, zip(*seq) works perfectly fine in Python 3.x+. In Python 2.x it won't work. Example with Python 3.6:

>>> from itertools import count, islice
>>> zipped_inf_range = zip(count(), count(), count())  # finite sequence of infinite sequences
>>> list(islice(zipped_inf_range, 5))
[(0, 0, 0), (1, 1, 1), (2, 2, 2), (3, 3, 3), (4, 4, 4)]
>>> next(zipped_inf_range)
(5, 5, 5)

If seq is in an infinite sequence of finite sequences, zip(*seq) will not terminate, e.g.:

>>> from itertools import count, islice
>>> zipped_inf_range = zip(count(), count(), count())  # finite sequence of infinite sequences
>>> zip(*zipped_inf_range)   # this hangs because *zipped_inf_range is an infinite list of arguments

The top level sequence cannot be infinite.

Why not? seems to work perfectly fine:

>>> from itertools import count, islice
>>> from toolz.sandbox.core import unzip
>>> range1, range2, range3 = unzip(zipped_inf_range)
>>> list(islice(range1, 5)), list(islice(range2, 5)), list(islice(range3, 5))
([0, 1, 2, 3, 4], [0, 1, 2, 3, 4], [0, 1, 2, 3, 4])

Let me know if this makes sense, I could prepare a pull request if you want to adapt the docstring.

lumbric commented 5 years ago

I've tried to fix the docstring in #432. Please let me know if this makes sense.