pandas-dev / pandas

Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more
https://pandas.pydata.org
BSD 3-Clause "New" or "Revised" License
43.6k stars 17.9k forks source link

Incorrect sequence handling with MultiIndex.from_tuples #14794

Open groutr opened 7 years ago

groutr commented 7 years ago
T = tuple(tuple(range(k)) for k in range(1, 8))
#Note: min(map(len, T)) == 1 and max(map(len, T)) == 7
bad = pd.MultiIndex.from_tuples(T)
bad.nlevels  # == 1 ???

good = pd.MultiIndex.from_tuples(list(T))
good.nlevels # == 7

Problem description

The docstring leaves some ambiguity as to whether a tuple or list can be passed in. It does say a list of tuples at the top, but in the argument specification, it hints that tuples can be a list/sequence of tuple-likes. A tuple of tuples is a sequence of tuple-likes. I find it strange that my tuple of tuples was truncated to a single level MultiIndex, while my list of tuples was not.
Changing https://github.com/groutr/pandas/blob/master/pandas/indexes/multi.py#L983 to call itertools.zip_longest would fix the truncation issue.

I also don't understand why sometimes the arrays are transposed and every other case we don't transpose. https://github.com/groutr/pandas/blob/master/pandas/indexes/multi.py#L980-L983

Expected Output

I would expect lists and tuples and other sequences of tuple-likes (sets, for example) to be treated the same. Do lists have a special meaning here?

This is in the latest master (588e29d23fa289916a8b057d093af62f35c6dcf6) and pandas 0.19.1. I'd be happy to submit a PR if this is unintended behavior

jreback commented 7 years ago
In [4]: list(zip(*T))
Out[4]: [(0, 0, 0, 0, 0, 0, 0)]

is what we do with a list-of-tuples (though), first making a list of this before hand doesn't change the result.

I am not sure of what zip does with unbalanced tuples.

So in this case we should be raising (.from_arrays will raise with unbalanced input arrays, which was a recent change).

So if you can figure out a nice performant way to zip these (w/o losing things), all ears.

groutr commented 7 years ago

Zip terminates at the shortest iterable. As I mentioned in my first post, itertools.zip_longest (from the standard library) is the ways to solve that (https://docs.python.org/3/library/itertools.html#itertools.zip_longest). What my questions were trying to get at was if there were any underlying reasons for the behavior.

I see that list(zip(*tuples)) seems to do a transpose.

In []: list(T55)
Out[]:
[(0, 1, 2, 3, 4),
 (0, 1, 2, 3, 4),
 (0, 1, 2, 3, 4),
 (0, 1, 2, 3, 4),
 (0, 1, 2, 3, 4)]
In []: list(zip(*T55))
Out[]:
[(0, 0, 0, 0, 0),
 (1, 1, 1, 1, 1),
 (2, 2, 2, 2, 2),
 (3, 3, 3, 3, 3),
 (4, 4, 4, 4, 4)]

I'd propose the following the if statement.

        elif isinstance(tuples, Sequence):     # Sequence is an abstract class from collections.abc
            arrays = list(lib.to_object_array_tuples(list(tuples)).T)
        else:
            arrays = list(zip_longest(*tuples))

But, if the elif is checking that tuples is a Sequence, then I think the else clause is largely unnecessary, because if tuples is a sequence, then lib.to_object_array_tuples should be able to handle it. In fact, the call to to_object_array_tuples maybe wholly unneeded because we can pass a list of tuples directly to MultiIndex.from_arrays (zip_longest will ensure that each inner tuple is the same length)

Maybe the else clause should raise an error if tuples isn't a sequence. This avoids the need to zip anything, unless I'm misunderstanding how lib.to_object_array_tuples can fail.

        elif isinstance(tuples, Sequence):
            arrays = list(zip_longest(*tuples))
        else:
            raise ValueError("tuples must be a sequence")
        return MultiIndex.from_arrays(arrays, ...)

Upon further investigation, to_object_array_tuples appears to mostly reimplement zip_longest. It seems for small iterables, zip_longest is can be much faster. Though for longer iterables, to_object_array_tuples starts to get an edge (timings done with python 3.5). Should we optimize index creation between large and small indexes?

# d55 = list of tuples like T (from OP) k=55
zip_longest = 19.9us
to_object_array_tuples: 15.5us

#where k = 20
zip_longest = 3.74us
to_object_array_tuples = 4.84us

# k =10
zip_longest = 1.46us
to_object_array_tuples = 3.35us

If you prefer to discuss code details in a PR, I can open one tonight.

jreback commented 7 years ago

not sure of the original rationale for having all of the clauses in the if

so feel free to see what happens if you add your examples as a test (and make changes as u suggest)