Open groutr opened 7 years ago
In [4]: list(zip(*T))
Out[4]: [(0, 0, 0, 0, 0, 0, 0)]
is what we do with a list-of-tuples (though), first making a list of this before hand doesn't change the result.
I am not sure of what zip does with unbalanced tuples.
So in this case we should be raising (.from_arrays
will raise with unbalanced input arrays, which was a recent change).
So if you can figure out a nice performant way to zip these (w/o losing things), all ears.
Zip terminates at the shortest iterable. As I mentioned in my first post, itertools.zip_longest
(from the standard library) is the ways to solve that (https://docs.python.org/3/library/itertools.html#itertools.zip_longest). What my questions were trying to get at was if there were any underlying reasons for the behavior.
I see that list(zip(*tuples)) seems to do a transpose.
In []: list(T55)
Out[]:
[(0, 1, 2, 3, 4),
(0, 1, 2, 3, 4),
(0, 1, 2, 3, 4),
(0, 1, 2, 3, 4),
(0, 1, 2, 3, 4)]
In []: list(zip(*T55))
Out[]:
[(0, 0, 0, 0, 0),
(1, 1, 1, 1, 1),
(2, 2, 2, 2, 2),
(3, 3, 3, 3, 3),
(4, 4, 4, 4, 4)]
I'd propose the following the if statement.
elif isinstance(tuples, Sequence): # Sequence is an abstract class from collections.abc
arrays = list(lib.to_object_array_tuples(list(tuples)).T)
else:
arrays = list(zip_longest(*tuples))
But, if the elif is checking that tuples is a Sequence, then I think the else clause is largely unnecessary, because if tuples is a sequence, then lib.to_object_array_tuples should be able to handle it. In fact, the call to to_object_array_tuples
maybe wholly unneeded because we can pass a list of tuples directly to MultiIndex.from_arrays (zip_longest will ensure that each inner tuple is the same length)
Maybe the else clause should raise an error if tuples isn't a sequence. This avoids the need to zip anything, unless I'm misunderstanding how lib.to_object_array_tuples can fail.
elif isinstance(tuples, Sequence):
arrays = list(zip_longest(*tuples))
else:
raise ValueError("tuples must be a sequence")
return MultiIndex.from_arrays(arrays, ...)
Upon further investigation, to_object_array_tuples appears to mostly reimplement zip_longest. It seems for small iterables, zip_longest is can be much faster. Though for longer iterables, to_object_array_tuples starts to get an edge (timings done with python 3.5). Should we optimize index creation between large and small indexes?
# d55 = list of tuples like T (from OP) k=55
zip_longest = 19.9us
to_object_array_tuples: 15.5us
#where k = 20
zip_longest = 3.74us
to_object_array_tuples = 4.84us
# k =10
zip_longest = 1.46us
to_object_array_tuples = 3.35us
If you prefer to discuss code details in a PR, I can open one tonight.
not sure of the original rationale for having all of the clauses in the if
so feel free to see what happens if you add your examples as a test (and make changes as u suggest)
Problem description
The docstring leaves some ambiguity as to whether a tuple or list can be passed in. It does say a list of tuples at the top, but in the argument specification, it hints that tuples can be a list/sequence of tuple-likes. A tuple of tuples is a sequence of tuple-likes. I find it strange that my tuple of tuples was truncated to a single level MultiIndex, while my list of tuples was not.
Changing https://github.com/groutr/pandas/blob/master/pandas/indexes/multi.py#L983 to call
itertools.zip_longest
would fix the truncation issue.I also don't understand why sometimes the arrays are transposed and every other case we don't transpose. https://github.com/groutr/pandas/blob/master/pandas/indexes/multi.py#L980-L983
Expected Output
I would expect lists and tuples and other sequences of tuple-likes (sets, for example) to be treated the same. Do lists have a special meaning here?
This is in the latest master (588e29d23fa289916a8b057d093af62f35c6dcf6) and pandas 0.19.1. I'd be happy to submit a PR if this is unintended behavior