pandas-dev / pandas

Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more
https://pandas.pydata.org
BSD 3-Clause "New" or "Revised" License
43.82k stars 17.99k forks source link

read_excel throws ValueError: cannot specify usecols when specifying a multi-index header #25449

Open matthiastns opened 5 years ago

matthiastns commented 5 years ago
df=pd.read_excel(xl_file, sheet_name=xl_sheet,header=[18,19], usecols='B,D:J,L')

Problem description

df=pd.read_excel(xl_file, sheet_name=xl_sheet,header=[18,19], usecols='B,D:J,L')

throws the following error after conda update:

ValueError: cannot specify usecols when specifying a multi-index header

The script run perfectly fine before.

Expected Output

Dataframe with multi-index columns, as before.

Output of pd.show_versions()

INSTALLED VERSIONS ------------------ commit: None python: 3.7.1.final.0 python-bits: 64 OS: Windows OS-release: 7 machine: AMD64 processor: Intel64 Family 6 Model 60 Stepping 3, GenuineIntel byteorder: little LC_ALL: None LANG: None LOCALE: None.None pandas: 0.24.1 pytest: 3.3.2 pip: 19.0.3 setuptools: 40.8.0 Cython: 0.27.3 numpy: 1.15.4 scipy: None pyarrow: None xarray: None IPython: 7.3.0 sphinx: 1.6.6 patsy: 0.5.0 dateutil: 2.8.0 pytz: 2018.9 blosc: None bottleneck: 1.2.1 tables: None numexpr: None feather: None matplotlib: 2.1.2 openpyxl: 2.4.10 xlrd: 1.1.0 xlwt: 1.3.0 xlsxwriter: 1.0.2 lxml.etree: None bs4: 4.6.0 html5lib: 1.0.1 sqlalchemy: 1.2.1 pymysql: None psycopg2: None jinja2: 2.10 s3fs: None fastparquet: None pandas_gbq: None pandas_datareader: None gcsfs: None
WillAyd commented 5 years ago

Which version was this working on previously? Also can you share the relevant data from the file you are reading?

matthiastns commented 5 years ago

Hi William.

Thanks for your quick feedback.

From conda list --revisions my previous version was 0.23.1 (conda-forge)

Attached you find one example of the files I read. They are several of them, all equal in structure. It works with none of them since the update.

This worked as expected before the update. I received a dataframe with multi-index columns as expected.

Matthias

Von: William Ayd [mailto:notifications@github.com] Gesendet: Dienstag, 26. Februar 2019 17:16 An: pandas-dev/pandas pandas@noreply.github.com Cc: Hufnagel, Matthias (TSHAM) matthias.hufnagel@kantar.com; Author author@noreply.github.com Betreff: Re: [pandas-dev/pandas] read_excel throws ValueError: cannot specify usecols when specifying a multi-index header (#25449)

Which version was this working on previously? Also can you share the relevant data from the file you are reading?

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHubhttps://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_pandas-2Ddev_pandas_issues_25449-23issuecomment-2D467503309&d=DwMCaQ&c=zdK58V2JKULZdB8nuBRpog&r=viu6wFhKUAAQyfCRfccCyIRRc8exHnCfucKtObjpDexOlfGP1yLhpyPweI2b5iGT&m=RZUnM6_s4ZHW66-t95Ga16y9iX9591ZVsvQhnJfceFE&s=-Jipmzfk8h5yeoPoyqacrxkZAQnYHcqgq91GtQyH88Y&e=, or mute the threadhttps://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_notifications_unsubscribe-2Dauth_AMDlZGRWsflorXB07PihBxo-2Dbnu-2DjLwfks5vRV2-2DgaJpZM4bSbs2&d=DwMCaQ&c=zdK58V2JKULZdB8nuBRpog&r=viu6wFhKUAAQyfCRfccCyIRRc8exHnCfucKtObjpDexOlfGP1yLhpyPweI2b5iGT&m=RZUnM6_s4ZHW66-t95Ga16y9iX9591ZVsvQhnJfceFE&s=n2UslOArWjA8sk2Qvaks4VqHWLalOeo_jHY0KTTCPdI&e=.

Kantar Disclaimerhttp://www.kantar.com/disclaimer.html

matthiastns commented 5 years ago

Hi William.

As a follow up to your question about my installed version, I learned how to get back to a previous version of my conda environment. Thanks for that in the first place.

Just to let you know:

I went back to my environment that included pandas 0.23.1 by running

C:\Users\hm113045\AppData\Local\Continuum\Anaconda3\Scripts

and read_excel works perfectly with multi-index header and usecols again.

Matthias

Von: William Ayd [mailto:notifications@github.com] Gesendet: Dienstag, 26. Februar 2019 17:16 An: pandas-dev/pandas pandas@noreply.github.com Cc: Hufnagel, Matthias (TSHAM) matthias.hufnagel@kantar.com; Author author@noreply.github.com Betreff: Re: [pandas-dev/pandas] read_excel throws ValueError: cannot specify usecols when specifying a multi-index header (#25449)

Which version was this working on previously? Also can you share the relevant data from the file you are reading?

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHubhttps://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_pandas-2Ddev_pandas_issues_25449-23issuecomment-2D467503309&d=DwMCaQ&c=zdK58V2JKULZdB8nuBRpog&r=viu6wFhKUAAQyfCRfccCyIRRc8exHnCfucKtObjpDexOlfGP1yLhpyPweI2b5iGT&m=RZUnM6_s4ZHW66-t95Ga16y9iX9591ZVsvQhnJfceFE&s=-Jipmzfk8h5yeoPoyqacrxkZAQnYHcqgq91GtQyH88Y&e=, or mute the threadhttps://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_notifications_unsubscribe-2Dauth_AMDlZGRWsflorXB07PihBxo-2Dbnu-2DjLwfks5vRV2-2DgaJpZM4bSbs2&d=DwMCaQ&c=zdK58V2JKULZdB8nuBRpog&r=viu6wFhKUAAQyfCRfccCyIRRc8exHnCfucKtObjpDexOlfGP1yLhpyPweI2b5iGT&m=RZUnM6_s4ZHW66-t95Ga16y9iX9591ZVsvQhnJfceFE&s=n2UslOArWjA8sk2Qvaks4VqHWLalOeo_jHY0KTTCPdI&e=.

Kantar Disclaimerhttp://www.kantar.com/disclaimer.html

matthiastns commented 5 years ago

Just saw that I had something wrong in my clipboard, Sorry

I was running

conda install --revision N

Matthias

Von: Hufnagel, Matthias (TSHAM) Gesendet: Mittwoch, 27. Februar 2019 10:08 An: 'pandas-dev/pandas' reply@reply.github.com Betreff: AW: [pandas-dev/pandas] read_excel throws ValueError: cannot specify usecols when specifying a multi-index header (#25449)

Hi William.

As a follow up to your question about my installed version, I learned how to get back to a previous version of my conda environment. Thanks for that in the first place.

Just to let you know:

I went back to my environment that included pandas 0.23.1 by running

C:\Users\hm113045\AppData\Local\Continuum\Anaconda3\Scripts

and read_excel works perfectly with multi-index header and usecols again.

Matthias

Von: William Ayd [mailto:notifications@github.com] Gesendet: Dienstag, 26. Februar 2019 17:16 An: pandas-dev/pandas pandas@noreply.github.com<mailto:pandas@noreply.github.com> Cc: Hufnagel, Matthias (TSHAM) matthias.hufnagel@kantar.com<mailto:matthias.hufnagel@kantar.com>; Author author@noreply.github.com<mailto:author@noreply.github.com> Betreff: Re: [pandas-dev/pandas] read_excel throws ValueError: cannot specify usecols when specifying a multi-index header (#25449)

Which version was this working on previously? Also can you share the relevant data from the file you are reading?

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHubhttps://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_pandas-2Ddev_pandas_issues_25449-23issuecomment-2D467503309&d=DwMCaQ&c=zdK58V2JKULZdB8nuBRpog&r=viu6wFhKUAAQyfCRfccCyIRRc8exHnCfucKtObjpDexOlfGP1yLhpyPweI2b5iGT&m=RZUnM6_s4ZHW66-t95Ga16y9iX9591ZVsvQhnJfceFE&s=-Jipmzfk8h5yeoPoyqacrxkZAQnYHcqgq91GtQyH88Y&e=, or mute the threadhttps://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_notifications_unsubscribe-2Dauth_AMDlZGRWsflorXB07PihBxo-2Dbnu-2DjLwfks5vRV2-2DgaJpZM4bSbs2&d=DwMCaQ&c=zdK58V2JKULZdB8nuBRpog&r=viu6wFhKUAAQyfCRfccCyIRRc8exHnCfucKtObjpDexOlfGP1yLhpyPweI2b5iGT&m=RZUnM6_s4ZHW66-t95Ga16y9iX9591ZVsvQhnJfceFE&s=n2UslOArWjA8sk2Qvaks4VqHWLalOeo_jHY0KTTCPdI&e=.

Kantar Disclaimerhttp://www.kantar.com/disclaimer.html

WillAyd commented 5 years ago

I don't think your attachment came through. Can you add to a comment directly instead?

matthiastns commented 5 years ago

BugReportSample.xlsx

matthiastns commented 5 years ago

I did close the issue by mistake, sorry.

matthiastns commented 5 years ago

From the excel files, I do not read the first worksheet, only the worksheets with min/max prefix that are equal in structure .

WillAyd commented 5 years ago

Can you add code that is completely copy / pastable to reproduce?

matthiastns commented 5 years ago

Sorry, William.

Here is the code that should produce the error message if the Excel file I uploaded earlier resides in the same folder. I could not validate if this really throws the reported error at the moment, as I downgraded my environment back to pandas 0.23.1 where I had (and still have) no issues.

import pandas as pd df=pd.read_excel('BugReportSample.xlsx', sheet_name='max_brand',header=[18,19], usecols='B,D:J,L')

WillAyd commented 5 years ago

Strange - the code in the traceback hasn't been touched in 6 years. I also created a 0.23.1 environment locally and got the same error.

Can you post the output of pd.show_versions()?

Doc-Scott commented 5 years ago

I just got this same error when I went from 23.4 to 24.0, reverting fixed it.

here is my code & error message

def load_data(i_file_name=file_1):
    """reads i_file excel spreadsheet into df"""
    data = pd.read_excel((DATA_FOLDER + i_file_name), 
                     sheet_name="historical data by country", 
                     header=[15, 16], 
                     skip_rows=17,
                     usecols="G:CO",  #ValueError: cannot specify usecols when specifying a multi-index header
                     index_col = None, #line15
                    )
    return data

>>>load_data()

File "cell_impact_DF.py", line 121, in <module>
    data = load_data()
  File "cell_impact_DF.py", line 15, in load_data 
    index_col = None,
  File "...lib\site-packages\pandas\util\_decorators.py", line 188, in wrapper
    return func(*args, **kwargs)
  File "...lib\site-packages\pandas\util\_decorators.py", line 188, in wrapper
    return func(*args, **kwargs)
  File "...lib\site-packages\pandas\io\excel.py", line 375, in read_excel
    **kwds)
  File "...lib\site-packages\pandas\io\excel.py", line 718, in parse
    **kwds)
  File "...lib\site-packages\pandas\io\excel.py", line 601, in parse
    **kwds)
  File "...lib\site-packages\pandas\io\parsers.py", line 2156, in TextParser
    return TextFileReader(*args, **kwds)
  File "...lib\site-packages\pandas\io\parsers.py", line 895, in __init__
    self._make_engine(self.engine)
  File "...lib\site-packages\pandas\io\parsers.py", line 1132, in _make_engine
    self._engine = klass(self.f, **self.options)
  File "...lib\site-packages\pandas\io\parsers.py", line 2171, in __init__
    ParserBase.__init__(self, kwds)
  File "...lib\site-packages\pandas\io\parsers.py", line 1391, in __init__
    raise ValueError("cannot specify usecols when "
ValueError: cannot specify usecols when specifying a multi-index header
matthiastns commented 5 years ago

Hi William.

Thanks for taking care of this.

I posted the show_versions() output in the first place, I attach it to the mail below.

Just to clarify (as you wrote that you have the same error using pandas 0.23.1)

I see the reported error only when I use the new 0.24.1 pandas version

After I went back to 0.23.1 my code runs fine.

Matthias

Output of show_versions (taken from my posting in github as my current installation is different now, as I explained) INSTALLED VERSIONS commit: None python: 3.7.1.final.0 python-bits: 64 OS: Windows OS-release: 7 machine: AMD64 processor: Intel64 Family 6 Model 60 Stepping 3, GenuineIntel byteorder: little LC_ALL: None LANG: None LOCALE: None.None pandas: 0.24.1 pytest: 3.3.2 pip: 19.0.3 setuptools: 40.8.0 Cython: 0.27.3 numpy: 1.15.4 scipy: None pyarrow: None xarray: None IPython: 7.3.0 sphinx: 1.6.6 patsy: 0.5.0 dateutil: 2.8.0 pytz: 2018.9 blosc: None bottleneck: 1.2.1 tables: None numexpr: None feather: None matplotlib: 2.1.2 openpyxl: 2.4.10 xlrd: 1.1.0 xlwt: 1.3.0 xlsxwriter: 1.0.2 lxml.etree: None bs4: 4.6.0 html5lib: 1.0.1 sqlalchemy: 1.2.1 pymysql: None psycopg2: None jinja2: 2.10 s3fs: None fastparquet: None pandas_gbq: None pandas_datareader: None gcsfs: None

Von: William Ayd [mailto:notifications@github.com] Gesendet: Freitag, 1. März 2019 18:39 An: pandas-dev/pandas pandas@noreply.github.com Cc: Hufnagel, Matthias (TSHAM) matthias.hufnagel@kantar.com; State change state_change@noreply.github.com Betreff: Re: [pandas-dev/pandas] read_excel throws ValueError: cannot specify usecols when specifying a multi-index header (#25449)

Strange - the code in the traceback hasn't been touched in 6 years. I also created a 0.23.1 environment locally and got the same error.

Can you post the output of pd.show_versions()?

— You are receiving this because you modified the open/close state. Reply to this email directly, view it on GitHubhttps://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_pandas-2Ddev_pandas_issues_25449-23issuecomment-2D468747669&d=DwMCaQ&c=zdK58V2JKULZdB8nuBRpog&r=viu6wFhKUAAQyfCRfccCyIRRc8exHnCfucKtObjpDexOlfGP1yLhpyPweI2b5iGT&m=iX21_SrtmkrwWrb5pjWeDC0AL1m3X_IzVMJ6mDMDCDo&s=3gYCW5CPIdziZqzyrWCWntaYbLs7_UD4RLnSH4H5v_k&e=, or mute the threadhttps://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_notifications_unsubscribe-2Dauth_AMDlZM-5FJydRX70vNHmbHXTDy7iS3lR5-5Fks5vSWXIgaJpZM4bSbs2&d=DwMCaQ&c=zdK58V2JKULZdB8nuBRpog&r=viu6wFhKUAAQyfCRfccCyIRRc8exHnCfucKtObjpDexOlfGP1yLhpyPweI2b5iGT&m=iX21_SrtmkrwWrb5pjWeDC0AL1m3X_IzVMJ6mDMDCDo&s=9psaZU_NrC1suKUyUWQ0JZ1ZsxMRdfCz4R5PfGmxA7I&e=.

Kantar Disclaimerhttp://www.kantar.com/disclaimer.html

matthiastns commented 5 years ago

Hi William.

Is there any update on this? I was really happy and thankful that you did take care of this. This strange behaviour is keeping me from updateing my environment now. Would be great to know if there is a chance that this might get fixed in a future release. Any feedback highly appreciated. Again, thanks for taking care in the first place.

Kind regards

Matthias

jreback commented 5 years ago

@matthiastns we have almost 3000 issues and are all a small group of volunteers - in order to have things fixed a PR would advance the cause

matthiastns commented 5 years ago

Thanks a lot Jeff. Highly appreciated. I'm not a native speaker and I hope I was able to express how much I appreciate that all of you take care of this. At least I tried to do that in my last post. Sorry if it nevertheless came over as too much demanding or even offending.

Unfortunately I'm not quite sure what a PR is. I would initiate(?) one, if necessary. Something like an official request? Not sure either, if this is appropriate in this case or if this will be received like calling for emergency for a minor issue. It's hard to guess from outside.

jreback commented 5 years ago

@matthiastns no, it would be a patch for your issue, see here: http://pandas.pydata.org/pandas-docs/stable/development/contributing.html

this is something you or others in the community would need to do.

matthiastns commented 5 years ago

@jreback Thanks for your immidiate feedback. So, from going through the linked documentation, I guess "PR" stands for "pull request".

This is a real challange for me. I have some experience with git but I'm not an software developper in the first place. So I have no experience with common development techniques, methods and ways of working.

But I will think about it. Could be also rewarding to dive into that. My main concern is, that finding the cause of my particular issue might be really tricky. Why? As William stated earlier, the code of the function that causes the error has not been touched for 6 years. If this is true, we might have a side effect from somewhere else. My feeling is that diving into this will be a rough task to solve for a code development newbie with no experience in the pandas code base at all.

gmavrom commented 4 years ago

Has there been any update on this issue? @matthiastns have you managed to resolve this?

Any help will be greatly appreciated!

kumfupanda commented 4 years ago

I also got the same issue

akasolace commented 3 years ago

this bug still exist on version 1.2.4 ...

jreback commented 3 years ago

@akasolace and you or anyone in the community is welcome to submit a patch

andrewchancgc commented 3 years ago

Currently August 8th 2021 and I incurred the same issue with conda 4.7.11 and pyton 3.7.3

javierbg commented 2 years ago

I've been working on this issue, since I have encountered the same problem.

I have tested with the data provided in this thread as well as the data I'm using (which, sadly, I cannot share) and it looks like just deleting the check of usecols+multi-index in the parser is enough for it to work.

I've been trying to pinpoint where usecols is ultimately used to check if there is really an incompatibility with a multi-index but as far as I have been able to check, there is none. When you call read_excel, the usecols parameters is passed along and finally used by PythonParser._infer_columns(), which does the correct thing: it reads the multi-index considering only the columns specified by usecols and then constructs the levels.

The code that originally introduced the parameter incompatibility check is from 9 years ago (commit a9a89f89e13cf006f6b58da1747aa65f86f74cfb, which states "TST: disallow names, usecols, non-numeric in index_cols"), so maybe it is not a problem anymore? I'm not comfortable just deleting this check without being absolutely sure that I'm not breaking anything, but I'm afraid that I'm not familiarized enough with the pandas internals to be sure (and I'm sure that multi-index must be a horrible beast).

If anyone more knowledgeable than me can confirm that the check is not needed anymore I can go ahead and submit the PR, along with a test to ensure it works correctly.

jacobshaw42 commented 8 months ago

Not sure if anyone still cares about this, but I have also noticed that simply removing the check allows the code to read my file as expected.