pandas-dev / pandas

Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more
https://pandas.pydata.org
BSD 3-Clause "New" or "Revised" License
43.09k stars 17.74k forks source link

Some problems with list in read_excel() #15133

Open artsiomkaltovich opened 7 years ago

artsiomkaltovich commented 7 years ago

Code Sample, a copy-pastable example if possible

train = read_excel(
    'Wilson_Schistosome Esophagus_differentially expressed transcripts.xlsx',
    sheetname=3, header=[0, 1], skip_footer=11)

OK

train = read_excel(
    'Wilson_Schistosome Esophagus_differentially expressed transcripts.xlsx',
    sheetname=3, header=[0, 1], skip_footer=11,  names=[0] + list(range(2, 12)))

ValueError: Length of new names must be 1, got 2

train = read_excel('Wilson_Schistosome Esophagus_differentially expressed transcripts.xlsx',
                   sheetname=2, header=[0, 1],  skip_footer=6)

ValueError: Length of new names must be 1, got 2

train = read_excel('Wilson_Schistosome Esophagus_differentially expressed transcripts.xlsx',
                   sheetname=[1, 3], header=[0, 1],  skip_footer=3)

ValueError: Length of new names must be 1, got 2

Problem description

I am trying to read excel file ( https://github.com/ArtyomKaltovich/ikantam/blob/master/lab1/Wilson_Schistosome%20Esophagus_differentially%20expressed%20transcripts.xlsx ). But sometimes pandas does not accept list as parameters for read_excel.

Expected Output

Output of pd.show_versions()

INSTALLED VERSIONS ------------------ commit: None python: 3.5.2.final.0 python-bits: 64 OS: Linux OS-release: 4.4.0-21-generic machine: x86_64 processor: x86_64 byteorder: little LC_ALL: None LANG: en_US.UTF-8 LOCALE: en_US.UTF-8 pandas: 0.19.2 nose: None pip: 9.0.1 setuptools: 32.3.1 Cython: None numpy: 1.11.3 scipy: 0.18.1 statsmodels: None xarray: None IPython: None sphinx: None patsy: None dateutil: 2.6.0 pytz: 2016.10 blosc: None bottleneck: None tables: None numexpr: None matplotlib: 1.5.3 openpyxl: None xlrd: 1.0.0 xlwt: None xlsxwriter: None lxml: None bs4: None html5lib: 0.999999999 httplib2: 0.9.2 apiclient: None sqlalchemy: None pymysql: None psycopg2: None jinja2: None boto: None pandas_datareader: None
TomAugspurger commented 7 years ago

Sorry, I don't entirely understand the issue. Could you include the tracebacks so we can see what's happening? Can you narrow down what the problem is?

artsiomkaltovich commented 7 years ago

train = read_excel( ... 'Wilson_Schistosome Esophagus_differentially expressed transcripts.xlsx', ... sheetname=3, header=[0, 1], skip_footer=11) OK

train = read_excel( 'Wilson_Schistosome Esophagus_differentially expressed transcripts.xlsx', sheetname=3, header=[0, 1], skip_footer=11, names=[0] + list(range(2, 12))) Traceback (most recent call last): File "", line 3, in File "/home/my/.local/lib/python3.5/site-packages/pandas/io/excel.py", line 200, in read_excel **kwds) File "/home/my/.local/lib/python3.5/site-packages/pandas/io/excel.py", line 511, in _parse_excel asheetname].columns.set_names(header_names) File "/home/my/.local/lib/python3.5/site-packages/pandas/indexes/base.py", line 1010, in set_names idx._set_names(names, level=level) File "/home/my/.local/lib/python3.5/site-packages/pandas/indexes/base.py", line 950, in _set_names len(values)) ValueError: Length of new names must be 1, got 2

train = read_excel('Wilson_Schistosome Esophagus_differentially expressed transcripts.xlsx', ... sheetname=2, header=[0, 1], skip_footer=6) Traceback (most recent call last): File "", line 2, in File "/home/my/.local/lib/python3.5/site-packages/pandas/io/excel.py", line 200, in read_excel **kwds) File "/home/my/.local/lib/python3.5/site-packages/pandas/io/excel.py", line 511, in _parse_excel asheetname].columns.set_names(header_names) File "/home/my/.local/lib/python3.5/site-packages/pandas/indexes/base.py", line 1010, in set_names idx._set_names(names, level=level) File "/home/my/.local/lib/python3.5/site-packages/pandas/indexes/base.py", line 950, in _set_names len(values)) ValueError: Length of new names must be 1, got 2

artsiomkaltovich commented 7 years ago

As documentation says I can use list as argument in read_excel(). Am I right? Or are there some restriction?

artsiomkaltovich commented 7 years ago

Hello. Do tracebacks help understand problem? As i understand sometimes MultiIndex does not create, but why?

TomAugspurger commented 7 years ago

IIRC, to pass sheetname=[list of sheets] requires that each sheet have the same structure, since the header, skipfooter, etc. are applied to each sheet. Is that true of your data?

An easy way to test is to see if

train = read_excel(
    'Wilson_Schistosome Esophagus_differentially expressed transcripts.xlsx',
    sheetname=[3], header=[0, 1], skip_footer=11)

works, since sheetname is a list with one element.

dorvak commented 7 years ago

I got this error as well, this seems to be due to some changes in version 0.19., because the code works in version 0.18.0

jeanbaptisteb commented 6 years ago

I got this error too, no matter how I set the header and sheetname parameters. I also tried on a one-sheet file, the same error is raised.

billtubbs commented 5 years ago

I have this error too. Here is a demonstration of the error with a simple spreadsheet. The error only occurs when there is an empty column to the left of the first index col.

Test 1 (fails):

File 'excel_test1.xlsx' looks like this.

input_data_filename = 'excel_test1.xlsx'
df = pd.read_excel(
    os.path.join(data_dir, input_data_filename), 
    header=[3, 4],
    index_col=[1, 2]
)

Traceback:

-----------------------------------------------------------------------
ValueError                            Traceback (most recent call last)
<ipython-input-56-a7da75b33a63> in <module>()
      3     os.path.join(data_dir, input_data_filename),
      4     header=[3, 4],
----> 5     index_col=[1, 2]
      6 )

/Users/billtubbs/anaconda/lib/python2.7/site-packages/pandas/util/_decorators.pyc in wrapper(*args, **kwargs)
    176                 else:
    177                     kwargs[new_arg_name] = new_arg_value
--> 178             return func(*args, **kwargs)
    179         return wrapper
    180     return _deprecate_kwarg

/Users/billtubbs/anaconda/lib/python2.7/site-packages/pandas/util/_decorators.pyc in wrapper(*args, **kwargs)
    176                 else:
    177                     kwargs[new_arg_name] = new_arg_value
--> 178             return func(*args, **kwargs)
    179         return wrapper
    180     return _deprecate_kwarg

/Users/billtubbs/anaconda/lib/python2.7/site-packages/pandas/io/excel.pyc in read_excel(io, sheet_name, header, names, index_col, usecols, squeeze, dtype, engine, converters, true_values, false_values, skiprows, nrows, na_values, parse_dates, date_parser, thousands, comment, skipfooter, convert_float, **kwds)
    327         skipfooter=skipfooter,
    328         convert_float=convert_float,
--> 329         **kwds)
    330 
    331 

/Users/billtubbs/anaconda/lib/python2.7/site-packages/pandas/io/excel.pyc in parse(self, sheet_name, header, names, index_col, usecols, squeeze, converters, true_values, false_values, skiprows, nrows, na_values, parse_dates, date_parser, thousands, comment, skipfooter, convert_float, **kwds)
    454                                  skipfooter=skipfooter,
    455                                  convert_float=convert_float,
--> 456                                  **kwds)
    457 
    458     def _should_parse(self, i, usecols):

/Users/billtubbs/anaconda/lib/python2.7/site-packages/pandas/io/excel.pyc in _parse_excel(self, sheet_name, header, names, index_col, usecols, squeeze, dtype, true_values, false_values, skiprows, nrows, na_values, verbose, parse_dates, date_parser, thousands, comment, skipfooter, convert_float, **kwds)
    690                 if not squeeze or isinstance(output[asheetname], DataFrame):
    691                     output[asheetname].columns = output[
--> 692                         asheetname].columns.set_names(header_names)
    693             except EmptyDataError:
    694                 # No Data, return an empty DataFrame

/Users/billtubbs/anaconda/lib/python2.7/site-packages/pandas/core/indexes/base.pyc in set_names(self, names, level, inplace)
   1419         else:
   1420             idx = self._shallow_copy()
-> 1421         idx._set_names(names, level=level)
   1422         if not inplace:
   1423             return idx

/Users/billtubbs/anaconda/lib/python2.7/site-packages/pandas/core/indexes/base.pyc in _set_names(self, values, level)
   1350         if len(values) != 1:
   1351             raise ValueError('Length of new names must be 1, got %d' %
-> 1352                              len(values))
   1353 
   1354         # GH 20527

ValueError: Length of new names must be 1, got 2

Test 2 (works):

File 'excel_test2.xlsx' looks like this.

input_data_filename = 'excel_test2.xlsx'
df = pd.read_excel(
    os.path.join(data_dir, input_data_filename), 
    header=[3, 4],
    index_col=[0, 1]
)

Output of test 2 is here.

Would be happy to see a solution to this or advice on how to fix!

mroeschke commented 3 years ago

Unfortunately it appears that this issue no longer has a excel file available for testing this bug. (Would be great if a reproducible example could be generated using pandas itself!). Going to close for now, but happy to reopen once we have a reproducible example

billtubbs commented 3 years ago

I uploaded a couple of Excel files called excel_test1.xls and excel_test2.xls (I used .xls files this time instead of .xlsx because xlrd has removed support for .xlsx files) to my GitHub here* and the test scripts (test-read-excel-1.py and test-read-excel-2.py).

I can still reproduce this error with

* Note: the folder in the GitHub repo linked above should not be named 'pandas' rename it locally to something else.

skannan-maf commented 1 year ago

I have as similar issue in 1.1.5 pandas version. But 1.5.3 works fine. I dont pass "names" parameter though

eyaler commented 5 months ago

i think the issue happened in the case the first row is missing entries for some columns. in that case using the default index_col=None would get confused. you can overcome this by setting index_col=0 and later retrieving the index as your first column