pandas-dev / pandas

Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more
https://pandas.pydata.org
BSD 3-Clause "New" or "Revised" License
43.8k stars 17.98k forks source link

read_excel provides uninformative error message when reading from sheet with duplicate multiindex columns #19395

Open spillz opened 6 years ago

spillz commented 6 years ago

Code Sample, a copy-pastable example if possible

d=pandas.DataFrame({('A','A'): [0,1,2,3,4,5]})
d[('A2','A2')] = 1
d.columns = [['A','A'],['A','A']]
print(d)
d.to_csv('TEST.csv')
print(pandas.read_csv('TEST.csv',header=[0,1], index_col=0))
d.to_excel('TEST.xlsx')
print(pandas.read_excel('TEST.xlsx',header=[0,1], index_col=0))

Problem description

The above snippet generates the following output:

   A   
   A  A
0  0  1
1  1  1
2  2  1
3  3  1
4  4  1
5  5  1
   (A, A)  ('A', 'A').1
0       0             1
1       1             1
2       2             1
3       3             1
4       4             1
5       5             1
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-110-4745297eb588> in <module>()
      6 print(pandas.read_csv('TEST.csv',header=[0,1], index_col=0))
      7 d.to_excel('TEST.xlsx')
----> 8 print(pandas.read_excel('TEST.xlsx',header=[0,1], index_col=0))

c:\programdata\anaconda3\lib\site-packages\pandas\io\excel.py in read_excel(io, sheetname, header, skiprows, skip_footer, index_col, names, parse_cols, parse_dates, date_parser, na_values, thousands, convert_float, has_index_names, converters, dtype, true_values, false_values, engine, squeeze, **kwds)
    207         skip_footer=skip_footer, converters=converters, dtype=dtype,
    208         true_values=true_values, false_values=false_values, squeeze=squeeze,
--> 209         **kwds)
    210 
    211 

c:\programdata\anaconda3\lib\site-packages\pandas\io\excel.py in _parse_excel(self, sheetname, header, skiprows, names, skip_footer, index_col, has_index_names, parse_cols, parse_dates, date_parser, na_values, thousands, convert_float, true_values, false_values, verbose, dtype, squeeze, **kwds)
    515                 if not squeeze or isinstance(output[asheetname], DataFrame):
    516                     output[asheetname].columns = output[
--> 517                         asheetname].columns.set_names(header_names)
    518             except EmptyDataError:
    519                 # No Data, return an empty DataFrame

c:\programdata\anaconda3\lib\site-packages\pandas\core\indexes\base.py in set_names(self, names, level, inplace)
   1119         else:
   1120             idx = self._shallow_copy()
-> 1121         idx._set_names(names, level=level)
   1122         if not inplace:
   1123             return idx

c:\programdata\anaconda3\lib\site-packages\pandas\core\indexes\base.py in _set_names(self, values, level)
   1059         if len(values) != 1:
   1060             raise ValueError('Length of new names must be 1, got %d' %
-> 1061                              len(values))
   1062         self.name = values[0]
   1063 

ValueError: Length of new names must be 1, got 2

Note that read_csv will read the csv (but mangles the column index). read_excel fails without clearly indicating the nature of the error.

Expected Output

read_excel should approximately match what read_csv does. If not, it would be a lot easier to diagnose the error if the error message indicated that the problem was a duplicate column and, ideally, which columns is the cause.

Output of pd.show_versions()

INSTALLED VERSIONS ------------------ commit: None python: 3.6.1.final.0 python-bits: 64 OS: Windows OS-release: 2012ServerR2 machine: AMD64 processor: Intel64 Family 6 Model 58 Stepping 0, GenuineIntel byteorder: little LC_ALL: None LANG: None LOCALE: None.None pandas: 0.20.1 pytest: 3.0.7 pip: 9.0.1 setuptools: 27.2.0 Cython: 0.25.2 numpy: 1.12.1 scipy: 0.19.0 xarray: 0.9.6.dev IPython: 5.3.0 sphinx: 1.5.6 patsy: 0.4.1 dateutil: 2.6.0 pytz: 2017.2 blosc: None bottleneck: 1.2.1 tables: 3.2.2 numexpr: 2.6.2 feather: None matplotlib: 2.0.2 openpyxl: 2.4.7 xlrd: 1.0.0 xlwt: 1.2.0 xlsxwriter: 0.9.6 lxml: 3.7.3 bs4: 4.6.0 html5lib: 0.999 sqlalchemy: 1.1.9 pymysql: None psycopg2: None jinja2: 2.9.6 s3fs: None pandas_gbq: None pandas_datareader: None
chris-b1 commented 6 years ago

Sure, I think this could be made to work similar to CSV (though in general life is easier without duplicate columns). PR to fix would be welcome!

spillz commented 6 years ago

If I was in a position to submit PRs, I would. But as I am not, I thought a bug report would be better then nothing.

I am fine with the duplicate cols being treated as an error but keep in mind that means you can no longer open arbitrarily named datasets. Also, the read_csv behavior really isn't ideal either. The multiindex becomes a regular index and there are no warnings when duplicates are found and cols renamed.

Anyway the main reason I reported this as a bug is that it took me half an hour to figure out that the error was being caused by duplicate columns. The message is very obscure!

On Jan 25, 2018 1:21 PM, "chris-b1" notifications@github.com wrote:

Sure, I think this could be made to work similar to CSV (though in general life is easier without duplicate columns). PR to fix would be welcome!

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/pandas-dev/pandas/issues/19395#issuecomment-360554365, or mute the thread https://github.com/notifications/unsubscribe-auth/AAFd5c5WfoQyY2_9Dcagms0pWjDIfT-cks5tOMY2gaJpZM4RtPKI .

chris-b1 commented 6 years ago

Appreciate the report either way, would also take a PR for an improved error msg which would be easier

jrhaneydata commented 5 years ago

I just upgrade from Pandas 0.22 to 0.24, and my code is breaking. What's happening in my code appears related to this issue.

I have a worksheet in Excel with three tables in it. Those three tables have the same columns.

In version 0.22.0 of pandas, I could read all three of those tables into python with three read_excel statements, using the usecols argument to specify which part of the spreadsheet I wanted to read in.

In version 0.24.1 of pandas, the presence of duplicate column names in a part of the spreadsheet that I'm not reading into Python forces the column names in the imported dataframe to have ".1" suffixes appended to them.

I can adjust my code to rename the columns after import, of course, but that's ugly. Was this an intentional change to the read_excel function that hasn't been documented? Or is this a bug?