pandas-dev / pandas

Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more
https://pandas.pydata.org
BSD 3-Clause "New" or "Revised" License
43.33k stars 17.81k forks source link

BUG: pandas.Index takes multidimensional array as input #20285

Open csfarkas opened 6 years ago

csfarkas commented 6 years ago

Code Sample, a copy-pastable example if possible

idx = pd.Index(data=[[1, 2], [1, 2], [2, 3]])

# some cases where this causes error:
idx.get_duplicates()
idx.drop_duplicates()

Problem description

According to the documentation, pandas.Index takes a 1-dimensional array-like data as input, which is clearly violated in the example.

Expected Output

Option 1: pandas.Index should throw an error in this case. Option 2: the documentation of pandas.Index should be updated. In this case, methods of the Index class should be checked, since nor get_duplicates, nor drop_duplicates are prepared for this kind of input.

Output of pd.show_versions()

INSTALLED VERSIONS ------------------ commit: c818a22417c4d43f55b934caa8ba011ba814b9d5 python: 3.6.4.final.0 python-bits: 64 OS: Linux OS-release: 4.13.0-36-generic machine: x86_64 processor: x86_64 byteorder: little LC_ALL: None LANG: en_US.UTF-8 LOCALE: en_US.UTF-8 pandas: 0.23.0.dev0+484.gc818a22 pytest: 3.4.2 pip: 9.0.1 setuptools: 38.5.1 Cython: 0.27.3 numpy: 1.14.1 scipy: 1.0.0 pyarrow: 0.8.0 xarray: 0.10.1 IPython: 6.2.1 sphinx: 1.7.1 patsy: 0.5.0 dateutil: 2.6.1 pytz: 2018.3 blosc: None bottleneck: 1.2.1 tables: 3.4.2 numexpr: 2.6.4 feather: 0.4.0 matplotlib: 2.2.0 openpyxl: 2.5.0 xlrd: 1.1.0 xlwt: 1.3.0 xlsxwriter: 1.0.2 lxml: 4.1.1 bs4: 4.6.0 html5lib: 1.0.1 sqlalchemy: 1.2.4 pymysql: 0.8.0 psycopg2: None jinja2: 2.10 s3fs: 0.1.3 fastparquet: 0.1.4 pandas_gbq: None pandas_datareader: None
toobaz commented 5 years ago

Definitely option 1, as we also want to avoid

In [2]: idx = pd.Index(data=[[1, 2], [1, 2], [2, 3]])                                                                                                                                                                                         

In [3]: id(idx)                                                                                                                                                                                                                               
Out[3]: 139839285680560

In [4]: idx[0][0] = 1000                                                                                                                                                                                                                      

In [5]: idx                                                                                                                                                                                                                                   
Out[5]: Index([[1000, 2], [1, 2], [2, 3]], dtype='object')

In [6]: id(idx)                                                                                                                                                                                                                               
Out[6]: 139839285680560
toobaz commented 5 years ago

... unless we consider option 3: create a MultiIndex, as in

In [2]: pd.Index(data=[(1, 2), (1, 2), (2, 3)])                                                                                                                                                                                               
Out[2]: 
MultiIndex([(1, 2),
            (1, 2),
            (2, 3)],
           )

In any case, xref: #17246