Open toobaz opened 6 years ago
In [2]: pd.Index([1, 1.], dtype=object).is_unique
Out[2]: False
... which is correct (1 == 1.
), and means that the error message about unicity is correct. So the wrong thing is that the other call is automatically casting to int.
and means that the error message about unicity is correct
so does this mean that the second example is actually giving the expected output?
and that to initialize the MultiIndex you should add verify_integrity=False
:
pd.MultiIndex(levels=[pd.Index([1, 1.], dtype=object), range(3)],
labels=[[0,1], [1,1]], verify_integrity=False)
which gives a MultiIndex containing both 1 and 1.0 in its first level.
MultiIndex(levels=[[1, 1.0], [0, 1, 2]], labels=[[0, 1], [1, 1]])
if we, for the moment, ignore the float from the problem description and 'manually' create a MultiIndex using pd.Index([1, 1], dtype=object
for the first level:
>>> pd.MultiIndex(levels=[pd.Index([1, 1], dtype=object), range(3)],
... labels=[[0, 0, 0, 1, 1, 1], [0, 1, 2, 0, 1, 2]])
we get:
Traceback (most recent call last):
File "<stdin>", line 2, in <module>
File "...\pandas\core\indexes\multi.py", line 242, in __new__
result._verify_integrity()
File "...\pandas\core\indexes\multi.py", line 285, in _verify_integrity
level=i))
ValueError: Level values must be unique: [1, 1] on level 0
which makes sense so we add verify_integrity=False
to get the expected output :
>>> pd.MultiIndex(levels=[pd.Index([1, 1], dtype=object), range(3)],
... labels=[[0, 0, 0, 1, 1, 1], [0, 1, 2, 0, 1, 2]], verify_integrity=False)
MultiIndex(levels=[[1, 1], [0, 1, 2]],
labels=[[0, 0, 0, 1, 1, 1], [0, 1, 2, 0, 1, 2]])
>>>
if we now try to re-create this using MultiIndex.from_product()
, we get:
>>> pd.MultiIndex.from_product([pd.Index([1, 1], dtype=object), range(3)])
MultiIndex(levels=[[1], [0, 1, 2]],
labels=[[0, 0, 0, 0, 0, 0], [0, 1, 2, 0, 1, 2]])
>>>
which is not the same as the output from either of the two previous cases!
since the from_product()
method does not have a verify_integrity
parameter, the expected output would be the ValueError: Level values must be unique: [1, 1] on level 0
since this is the default for pd.MultiIndex
so going back to the original issue, it appears it is not related to the input containing a float and that the expected output from the first example in the issue description should actually be:
ValueError: Level values must be unique: [1, 1.] on level 0
i think this then raises the question: Should MultiIndex.from_product()
have a verify_integrity
parameter?
since the from_product() method does not have a verify_integrity parameter, the expected output would be the ValueError: Level values must be unique: [1, 1] on level 0 since this is the default for pd.MultiIndex
I disagree: the from_product
docs (and intuition) just refer to the "cartesian product of iterables", not in any way to the underlying levels.
Vice-versa, when you do pd.MultiIndex(levels=...)
you are clearly passing levels, so it is OK check unicity and raise.
But indeed the problem is more subtle than I thought: ideally, we would want pd.Index([1, 1.], dtype=object).is_unique
to return False
, but it's maybe to late to change. So assuming that does return True
, and that MultiIndex
levels must be unique, we can't have both an int and its float representation in a same MultiIndex
level.
The only doubt then is whether we should favour the float
, rather than int
, representation, given that for instancepd.Index([1, 1.])
gives a Float64Index
.
... and that MultiIndex levels must be unique, we can't have both an int and its float representation in a same MultiIndex level.
...when using from_product()
The only doubt then is whether we should favour the float, rather than int, representation, given that for instancepd.Index([1, 1.]) gives a Float64Index.
swapping the order of the float and int gives a float for the first level:
>>> pd.MultiIndex.from_product([pd.Index([1, 1.], dtype=object), range(3)])
MultiIndex(levels=[[1], [0, 1, 2]],
labels=[[0, 0, 0, 0, 0, 0], [0, 1, 2, 0, 1, 2]])
>>>
>>> pd.MultiIndex.from_product([pd.Index([1., 1], dtype=object), range(3)])
MultiIndex(levels=[[1.0], [0, 1, 2]],
labels=[[0, 0, 0, 0, 0, 0], [0, 1, 2, 0, 1, 2]])
>>>
so it appears not to be a casting issue as the issue title suggests?
... and that MultiIndex levels must be unique ...
Indeed, according to the documentation for both pandas.MultiIndex
and pandas.MultiIndex.from_product
and yet in the non-float example, i passed a non-unique iterable as the first iterable and got a result instead of a value error:
>>> pd.Index([1, 1], dtype=object).is_unique
False
>>> pd.MultiIndex.from_product([pd.Index([1, 1], dtype=object), range(3)])
MultiIndex(levels=[[1], [0, 1, 2]],
labels=[[0, 0, 0, 0, 0, 0], [0, 1, 2, 0, 1, 2]])
>>>
I think it is also worth noting that:
>>>
>>> import pandas as pd
>>>
>>> pd.MultiIndex.from_product([[1, 1], range(3)])
MultiIndex(levels=[[1], [0, 1, 2]],
labels=[[0, 0, 0, 0, 0, 0], [0, 1, 2, 0, 1, 2]])
>>>
>>> pd.MultiIndex.from_product([[1, True], range(3)])
MultiIndex(levels=[[1], [0, 1, 2]],
labels=[[0, 0, 0, 0, 0, 0], [0, 1, 2, 0, 1, 2]])
>>>
>>> pd.MultiIndex.from_product([[1.0, True], range(3)])
MultiIndex(levels=[[1.0], [0, 1, 2]],
labels=[[0, 0, 0, 0, 0, 0], [0, 1, 2, 0, 1, 2]])
>>>
>>> pd.MultiIndex.from_product([[True, 1], range(3)])
MultiIndex(levels=[[True], [0, 1, 2]],
labels=[[0, 0, 0, 0, 0, 0], [0, 1, 2, 0, 1, 2]])
>>>
are probably not giving the expected output either.
which could result in:
>>>
>>> a = 19998989890
>>> b = 19998989889 +1
>>> a is b
False
>>> a == b
True
>>> pd.MultiIndex.from_product([[a,b], range(3)])
MultiIndex(levels=[[19998989890], [0, 1, 2]],
labels=[[0, 0, 0, 0, 0, 0], [0, 1, 2, 0, 1, 2]])
>>>
>>>
>>> from pandas.core.arrays.categorical import _factorize_from_iterables
>>>
>>> labels, levels =_factorize_from_iterables([[1, True], range(3)])
>>> labels
[array([0, 0], dtype=int8), array([0, 1, 2], dtype=int8)]
>>> levels
[Int64Index([1], dtype='int64'), Int64Index([0, 1, 2], dtype='int64')]
>>>
>>> from pandas.core.reshape.util import cartesian_product
>>>
>>> labels = cartesian_product(labels)
>>> labels
[array([0, 0, 0, 0, 0, 0], dtype=int8), array([0, 1, 2, 0, 1, 2], dtype=int8)]
>>>
>>> pd.MultiIndex(levels, labels)
MultiIndex(levels=[[1], [0, 1, 2]],
labels=[[0, 0, 0, 0, 0, 0], [0, 1, 2, 0, 1, 2]])
>>>
it appears that from_product()
would need to use a different implementation of _factorize_from_iterables
>>> import pandas as pd
>>> from pandas.core.arrays.categorical import _factorize_from_iterables
use 3 in the first iterable so that the objects do not compare equal
>>> labels, levels =_factorize_from_iterables([[3, True], range(3)])
>>> labels
[array([1, 0], dtype=int8), array([0, 1, 2], dtype=int8)]
>>> levels
[Index([True, 3], dtype='object'), Int64Index([0, 1, 2], dtype='int64')]
change the 3 back to a 1 so that the first iterable has different objects which compare equal
>>> levels = [pd.Index([1, True], dtype='object'), pd.Int64Index([0, 1, 2], dtype='int64')]
>>> levels
[Index([1, True], dtype='object'), Int64Index([0, 1, 2], dtype='int64')]
>>> from pandas.core.reshape.util import cartesian_product
>>> labels = cartesian_product(labels)
>>> labels
[array([1, 1, 1, 0, 0, 0], dtype=int8), array([0, 1, 2, 0, 1, 2], dtype=int8)]
>>> pd.MultiIndex(levels, labels, verify_integrity=False)
MultiIndex(levels=[[1, True], [0, 1, 2]],
labels=[[1, 1, 1, 0, 0, 0], [0, 1, 2, 0, 1, 2]])
which is the expected output?
changing _factorize_from_iterables
alone would give a value error unless MultiIndex is called with verify_integrity=False
>>> pd.MultiIndex(levels, labels)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "C:\Users\simon\OneDrive\code\pandas-simonjayhawkins\pandas\core\indexes\multi.py", line 242, in __new__
result._verify_integrity()
File "C:\Users\simon\OneDrive\code\pandas-simonjayhawkins\pandas\core\indexes\multi.py", line 285, in _verify_integrity
level=i))
ValueError: Level values must be unique: [1, True] on level 0
MultiIndex.from_product() casts float to int when corresponding int is also present
it depends on the ordering:
>>>
>>> import pandas as pd
>>> from pandas.core.arrays.categorical import _factorize_from_iterables
>>> labels, levels =_factorize_from_iterables([pd.Index([1, 1., 2., 2], dtype=object), range(3)])
>>> levels
[Index([1, 2.0], dtype='object'), Int64Index([0, 1, 2], dtype='int64')]
and the index type is unchanged:
>>> labels, levels =_factorize_from_iterables([pd.Index([1, 1.], dtype=object), range(3)])
>>> levels
[Index([1], dtype='object'), Int64Index([0, 1, 2], dtype='int64')]
>>> labels, levels =_factorize_from_iterables([pd.Index([1., 1], dtype=object), range(3)])
>>> levels
[Index([1.0], dtype='object'), Int64Index([0, 1, 2], dtype='int64')]
>>>
>>>
if a list is passed as the first iterable instead of an index object, then the int is cast to a float, not the float cast to an int:
>>>
>>> labels, levels =_factorize_from_iterables([[1., 1], range(3)])
>>> levels
[Float64Index([1.0], dtype='float64'), Int64Index([0, 1, 2], dtype='int64')]
>>> labels, levels =_factorize_from_iterables([[1, 1.], range(3)])
>>> levels
[Float64Index([1.0], dtype='float64'), Int64Index([0, 1, 2], dtype='int64')]
>>>
>>>
>>> pd.MultiIndex.from_product([[1, 1., 2., 2], range(3)])
MultiIndex(levels=[[1.0, 2.0], [0, 1, 2]],
labels=[[0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1], [0, 1, 2, 0, 1, 2, 0, 1, 2, 0, 1, 2]])
>>>
if a list is passed as the first iterable instead of an index object, then the int is cast to a float, not the float cast to an int:
unless the list also contains booleans and then it depends on the ordering again:
>>>
>>>
>>> pd.MultiIndex.from_product([pd.Index([1, 1., 2., 2, True, False], dtype=object), range(3)])
MultiIndex(levels=[[False, 1, 2.0], [0, 1, 2]],
labels=[[1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 1, 1, 1, 0, 0, 0], [0, 1, 2, 0, 1, 2, 0, 1, 2, 0, 1, 2, 0, 1, 2, 0, 1, 2]])
>>> pd.MultiIndex.from_product([[1, 1., 2., 2, True, False], range(3)])
MultiIndex(levels=[[False, 1, 2.0], [0, 1, 2]],
labels=[[1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 1, 1, 1, 0, 0, 0], [0, 1, 2, 0, 1, 2, 0, 1, 2, 0, 1, 2, 0, 1, 2, 0, 1, 2]])
>>>
I believe you should use pd.MultiIndex.from_product since that pd.MultiIndex creation with object dtype levels treats seemingly duplicate values (e.g., 1 and 1.0) as errors, although from_product works.
Out[2]: array([(1, 0), (1, 1), (1, 2), (1, 0), (1, 1), (1, 2)], dtype=object)
In [3]: pd.MultiIndex(levels=[pd.Index([1, 1.], dtype=object), range(3)], labels=[[0,1], [1,1]]
...: )
Code Sample, a copy-pastable example if possible
Problem
If a flat (object)
Index
allows us to distinguish 1 and 1., the same should doMultiIndex
. From https://github.com/pandas-dev/pandas/issues/18913#issuecomment-353748642Expected Output
Both the first two examples should return a
MultiIndex
containing both 1 and 1.0 in its first level.Output of
pd.show_versions()