pandas-dev / pandas

Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more
https://pandas.pydata.org
BSD 3-Clause "New" or "Revised" License
43.46k stars 17.87k forks source link

MultiIndex.from_product() casts float to int when corresponding int is also present #19432

Open toobaz opened 6 years ago

toobaz commented 6 years ago

Code Sample, a copy-pastable example if possible

In [2]: pd.MultiIndex.from_product([pd.Index([1, 1.], dtype=object), range(3)]).values
Out[2]: array([(1, 0), (1, 1), (1, 2), (1, 0), (1, 1), (1, 2)], dtype=object)

In [3]: pd.MultiIndex(levels=[pd.Index([1, 1.], dtype=object), range(3)], labels=[[0,1], [1,1]]
   ...: )
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-3-a2ad432a7b0f> in <module>()
----> 1 pd.MultiIndex(levels=[pd.Index([1, 1.], dtype=object), range(3)], labels=[[0,1], [1,1]])

/home/nobackup/repo/pandas/pandas/core/indexes/multi.py in __new__(cls, levels, labels, sortorder, names, copy, verify_integrity, _set_identity, name, **kwargs)
    238 
    239         if verify_integrity:
--> 240             result._verify_integrity()
    241         if _set_identity:
    242             result._reset_identity()

/home/nobackup/repo/pandas/pandas/core/indexes/multi.py in _verify_integrity(self, labels, levels)
    281                                  "level {level}".format(
    282                                      values=[value for value in level],
--> 283                                      level=i))
    284 
    285     @property

ValueError: Level values must be unique: [1, 1.0] on level 0

In [4]: pd.Index([1, 1.], dtype=object)
Out[4]: Index([1, 1.0], dtype='object')

Problem

If a flat (object) Index allows us to distinguish 1 and 1., the same should do MultiIndex. From https://github.com/pandas-dev/pandas/issues/18913#issuecomment-353748642

Expected Output

Both the first two examples should return a MultiIndex containing both 1 and 1.0 in its first level.

Output of pd.show_versions()

INSTALLED VERSIONS ------------------ commit: 8cbee356da1161c56c64f6f89cb5548bcadc3e44 python: 3.5.3.final.0 python-bits: 64 OS: Linux OS-release: 4.9.0-5-amd64 machine: x86_64 processor: byteorder: little LC_ALL: None LANG: it_IT.UTF-8 LOCALE: it_IT.UTF-8 pandas: 0.23.0.dev0+182.g8cbee356d.dirty pytest: 3.2.3 pip: 9.0.1 setuptools: 36.7.0 Cython: 0.25.2 numpy: 1.12.1 scipy: 0.19.0 pyarrow: None xarray: None IPython: 6.2.1 sphinx: 1.5.6 patsy: 0.4.1 dateutil: 2.6.1 pytz: 2017.2 blosc: None bottleneck: 1.2.0dev tables: 3.3.0 numexpr: 2.6.1 feather: 0.3.1 matplotlib: 2.0.0 openpyxl: 2.3.0 xlrd: 1.0.0 xlwt: 1.3.0 xlsxwriter: 0.9.6 lxml: 4.1.1 bs4: 4.5.3 html5lib: 0.999999999 sqlalchemy: 1.0.15 pymysql: None psycopg2: None jinja2: 2.10 s3fs: None fastparquet: None pandas_gbq: None pandas_datareader: 0.2.1
toobaz commented 6 years ago
In [2]: pd.Index([1, 1.], dtype=object).is_unique
Out[2]: False

... which is correct (1 == 1.), and means that the error message about unicity is correct. So the wrong thing is that the other call is automatically casting to int.

simonjayhawkins commented 6 years ago

and means that the error message about unicity is correct

so does this mean that the second example is actually giving the expected output?

and that to initialize the MultiIndex you should add verify_integrity=False:

pd.MultiIndex(levels=[pd.Index([1, 1.], dtype=object), range(3)],
              labels=[[0,1], [1,1]], verify_integrity=False)

which gives a MultiIndex containing both 1 and 1.0 in its first level.

MultiIndex(levels=[[1, 1.0], [0, 1, 2]], labels=[[0, 1], [1, 1]])

simonjayhawkins commented 6 years ago

if we, for the moment, ignore the float from the problem description and 'manually' create a MultiIndex using pd.Index([1, 1], dtype=object for the first level:

>>> pd.MultiIndex(levels=[pd.Index([1, 1], dtype=object), range(3)],
...               labels=[[0, 0, 0, 1, 1, 1], [0, 1, 2, 0, 1, 2]])

we get:

Traceback (most recent call last):
  File "<stdin>", line 2, in <module>
  File "...\pandas\core\indexes\multi.py", line 242, in __new__
    result._verify_integrity()
  File "...\pandas\core\indexes\multi.py", line 285, in _verify_integrity
    level=i))
ValueError: Level values must be unique: [1, 1] on level 0

which makes sense so we add verify_integrity=False to get the expected output :

>>> pd.MultiIndex(levels=[pd.Index([1, 1], dtype=object), range(3)],
...               labels=[[0, 0, 0, 1, 1, 1], [0, 1, 2, 0, 1, 2]], verify_integrity=False)
MultiIndex(levels=[[1, 1], [0, 1, 2]],
           labels=[[0, 0, 0, 1, 1, 1], [0, 1, 2, 0, 1, 2]])
>>>

if we now try to re-create this using MultiIndex.from_product(), we get:


>>> pd.MultiIndex.from_product([pd.Index([1, 1], dtype=object), range(3)])
MultiIndex(levels=[[1], [0, 1, 2]],
           labels=[[0, 0, 0, 0, 0, 0], [0, 1, 2, 0, 1, 2]])
>>>

which is not the same as the output from either of the two previous cases!

since the from_product() method does not have a verify_integrity parameter, the expected output would be the ValueError: Level values must be unique: [1, 1] on level 0 since this is the default for pd.MultiIndex

so going back to the original issue, it appears it is not related to the input containing a float and that the expected output from the first example in the issue description should actually be:

ValueError: Level values must be unique: [1, 1.] on level 0

i think this then raises the question: Should MultiIndex.from_product() have a verify_integrity parameter?

toobaz commented 6 years ago

since the from_product() method does not have a verify_integrity parameter, the expected output would be the ValueError: Level values must be unique: [1, 1] on level 0 since this is the default for pd.MultiIndex

I disagree: the from_product docs (and intuition) just refer to the "cartesian product of iterables", not in any way to the underlying levels.

Vice-versa, when you do pd.MultiIndex(levels=...) you are clearly passing levels, so it is OK check unicity and raise.

But indeed the problem is more subtle than I thought: ideally, we would want pd.Index([1, 1.], dtype=object).is_unique to return False, but it's maybe to late to change. So assuming that does return True, and that MultiIndex levels must be unique, we can't have both an int and its float representation in a same MultiIndex level.

The only doubt then is whether we should favour the float, rather than int, representation, given that for instancepd.Index([1, 1.]) gives a Float64Index.

simonjayhawkins commented 6 years ago

... and that MultiIndex levels must be unique, we can't have both an int and its float representation in a same MultiIndex level.

...when using from_product()

The only doubt then is whether we should favour the float, rather than int, representation, given that for instancepd.Index([1, 1.]) gives a Float64Index.

swapping the order of the float and int gives a float for the first level:

>>> pd.MultiIndex.from_product([pd.Index([1, 1.], dtype=object), range(3)])
MultiIndex(levels=[[1], [0, 1, 2]],
           labels=[[0, 0, 0, 0, 0, 0], [0, 1, 2, 0, 1, 2]])
>>>
>>> pd.MultiIndex.from_product([pd.Index([1., 1], dtype=object), range(3)])
MultiIndex(levels=[[1.0], [0, 1, 2]],
           labels=[[0, 0, 0, 0, 0, 0], [0, 1, 2, 0, 1, 2]])
>>>

so it appears not to be a casting issue as the issue title suggests?

simonjayhawkins commented 6 years ago

... and that MultiIndex levels must be unique ...

Indeed, according to the documentation for both pandas.MultiIndex and pandas.MultiIndex.from_product

and yet in the non-float example, i passed a non-unique iterable as the first iterable and got a result instead of a value error:

>>> pd.Index([1, 1], dtype=object).is_unique
False
>>> pd.MultiIndex.from_product([pd.Index([1, 1], dtype=object), range(3)])
MultiIndex(levels=[[1], [0, 1, 2]],
           labels=[[0, 0, 0, 0, 0, 0], [0, 1, 2, 0, 1, 2]])
>>>
simonjayhawkins commented 6 years ago

I think it is also worth noting that:

>>>
>>> import pandas as pd
>>>
>>> pd.MultiIndex.from_product([[1, 1], range(3)])
MultiIndex(levels=[[1], [0, 1, 2]],
           labels=[[0, 0, 0, 0, 0, 0], [0, 1, 2, 0, 1, 2]])
>>>
>>> pd.MultiIndex.from_product([[1, True], range(3)])
MultiIndex(levels=[[1], [0, 1, 2]],
           labels=[[0, 0, 0, 0, 0, 0], [0, 1, 2, 0, 1, 2]])
>>>
>>> pd.MultiIndex.from_product([[1.0, True], range(3)])
MultiIndex(levels=[[1.0], [0, 1, 2]],
           labels=[[0, 0, 0, 0, 0, 0], [0, 1, 2, 0, 1, 2]])
>>>
>>> pd.MultiIndex.from_product([[True, 1], range(3)])
MultiIndex(levels=[[True], [0, 1, 2]],
           labels=[[0, 0, 0, 0, 0, 0], [0, 1, 2, 0, 1, 2]])
>>>

are probably not giving the expected output either.

simonjayhawkins commented 6 years ago

which could result in:

>>>
>>> a = 19998989890
>>> b = 19998989889 +1
>>> a is b
False
>>> a == b
True
>>> pd.MultiIndex.from_product([[a,b], range(3)])
MultiIndex(levels=[[19998989890], [0, 1, 2]],
           labels=[[0, 0, 0, 0, 0, 0], [0, 1, 2, 0, 1, 2]])
>>>
simonjayhawkins commented 6 years ago

https://github.com/pandas-dev/pandas/blob/e2e1a1051576a48f210ce17272fc24b90ebcf24a/pandas/core/indexes/multi.py#L1357-L1367

>>>
>>> from pandas.core.arrays.categorical import _factorize_from_iterables
>>>
>>> labels, levels =_factorize_from_iterables([[1, True], range(3)])
>>> labels
[array([0, 0], dtype=int8), array([0, 1, 2], dtype=int8)]
>>> levels
[Int64Index([1], dtype='int64'), Int64Index([0, 1, 2], dtype='int64')]
>>>
>>> from pandas.core.reshape.util import cartesian_product
>>>
>>> labels = cartesian_product(labels)
>>> labels
[array([0, 0, 0, 0, 0, 0], dtype=int8), array([0, 1, 2, 0, 1, 2], dtype=int8)]
>>>
>>> pd.MultiIndex(levels, labels)
MultiIndex(levels=[[1], [0, 1, 2]],
           labels=[[0, 0, 0, 0, 0, 0], [0, 1, 2, 0, 1, 2]])
>>>

it appears that from_product() would need to use a different implementation of _factorize_from_iterables

simonjayhawkins commented 6 years ago
>>> import pandas as pd
>>> from pandas.core.arrays.categorical import _factorize_from_iterables

use 3 in the first iterable so that the objects do not compare equal

>>> labels, levels =_factorize_from_iterables([[3, True], range(3)])
>>> labels
[array([1, 0], dtype=int8), array([0, 1, 2], dtype=int8)]
>>> levels
[Index([True, 3], dtype='object'), Int64Index([0, 1, 2], dtype='int64')]

change the 3 back to a 1 so that the first iterable has different objects which compare equal

>>> levels = [pd.Index([1, True], dtype='object'), pd.Int64Index([0, 1, 2], dtype='int64')]
>>> levels
[Index([1, True], dtype='object'), Int64Index([0, 1, 2], dtype='int64')]
>>> from pandas.core.reshape.util import cartesian_product
>>> labels = cartesian_product(labels)
>>> labels
[array([1, 1, 1, 0, 0, 0], dtype=int8), array([0, 1, 2, 0, 1, 2], dtype=int8)]
>>> pd.MultiIndex(levels, labels, verify_integrity=False)
MultiIndex(levels=[[1, True], [0, 1, 2]],
           labels=[[1, 1, 1, 0, 0, 0], [0, 1, 2, 0, 1, 2]])

which is the expected output?

changing _factorize_from_iterables alone would give a value error unless MultiIndex is called with verify_integrity=False

>>> pd.MultiIndex(levels, labels)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "C:\Users\simon\OneDrive\code\pandas-simonjayhawkins\pandas\core\indexes\multi.py", line 242, in __new__
    result._verify_integrity()
  File "C:\Users\simon\OneDrive\code\pandas-simonjayhawkins\pandas\core\indexes\multi.py", line 285, in _verify_integrity
    level=i))
ValueError: Level values must be unique: [1, True] on level 0
simonjayhawkins commented 6 years ago

MultiIndex.from_product() casts float to int when corresponding int is also present

it depends on the ordering:

>>>
>>> import pandas as pd
>>> from pandas.core.arrays.categorical import _factorize_from_iterables
>>> labels, levels =_factorize_from_iterables([pd.Index([1, 1., 2., 2], dtype=object), range(3)])
>>> levels
[Index([1, 2.0], dtype='object'), Int64Index([0, 1, 2], dtype='int64')]

and the index type is unchanged:

>>> labels, levels =_factorize_from_iterables([pd.Index([1, 1.], dtype=object), range(3)])
>>> levels
[Index([1], dtype='object'), Int64Index([0, 1, 2], dtype='int64')]
>>> labels, levels =_factorize_from_iterables([pd.Index([1., 1], dtype=object), range(3)])
>>> levels
[Index([1.0], dtype='object'), Int64Index([0, 1, 2], dtype='int64')]
>>>
>>>

if a list is passed as the first iterable instead of an index object, then the int is cast to a float, not the float cast to an int:

>>>
>>> labels, levels =_factorize_from_iterables([[1., 1], range(3)])
>>> levels
[Float64Index([1.0], dtype='float64'), Int64Index([0, 1, 2], dtype='int64')]
>>> labels, levels =_factorize_from_iterables([[1, 1.], range(3)])
>>> levels
[Float64Index([1.0], dtype='float64'), Int64Index([0, 1, 2], dtype='int64')]
>>>
>>>
>>> pd.MultiIndex.from_product([[1, 1., 2., 2], range(3)])
MultiIndex(levels=[[1.0, 2.0], [0, 1, 2]],
           labels=[[0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1], [0, 1, 2, 0, 1, 2, 0, 1, 2, 0, 1, 2]])
>>>
simonjayhawkins commented 6 years ago

if a list is passed as the first iterable instead of an index object, then the int is cast to a float, not the float cast to an int:

unless the list also contains booleans and then it depends on the ordering again:

>>>
>>>
>>> pd.MultiIndex.from_product([pd.Index([1, 1., 2., 2, True, False], dtype=object), range(3)])
MultiIndex(levels=[[False, 1, 2.0], [0, 1, 2]],
           labels=[[1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 1, 1, 1, 0, 0, 0], [0, 1, 2, 0, 1, 2, 0, 1, 2, 0, 1, 2, 0, 1, 2, 0, 1, 2]])
>>> pd.MultiIndex.from_product([[1, 1., 2., 2, True, False], range(3)])
MultiIndex(levels=[[False, 1, 2.0], [0, 1, 2]],
           labels=[[1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 1, 1, 1, 0, 0, 0], [0, 1, 2, 0, 1, 2, 0, 1, 2, 0, 1, 2, 0, 1, 2, 0, 1, 2]])
>>>
Paradox456 commented 2 months ago

I believe you should use pd.MultiIndex.from_product since that pd.MultiIndex creation with object dtype levels treats seemingly duplicate values (e.g., 1 and 1.0) as errors, although from_product works.


Out[2]: array([(1, 0), (1, 1), (1, 2), (1, 0), (1, 1), (1, 2)], dtype=object)

In [3]: pd.MultiIndex(levels=[pd.Index([1, 1.], dtype=object), range(3)], labels=[[0,1], [1,1]]
   ...: )