pandas-dev / pandas

Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more
https://pandas.pydata.org
BSD 3-Clause "New" or "Revised" License
43.78k stars 17.97k forks source link

BUG: xl.parse index_col ignoring skiprows #50953

Open musshorn opened 1 year ago

musshorn commented 1 year ago

Pandas version checks

Reproducible Example

import pandas as pd

xl = pd.ExcelFile("TestPandas.xlsx")

df = xl.parse("Sheet1", skiprows=8, headers=[0, 1, 2], index_col=[0, 1])
print(df)

Issue Description

Run the sample code on this excel sheet. Parsing is expected to begin from row 9 however in the index column data is pulled from rows that should've been skipped. TestPandas.xlsx

Expected Behavior

Parsing to begin from the first non-skipped row

Installed Versions

>>> pd.show_versions() INSTALLED VERSIONS ------------------ commit : 7cb7592523380133f552e258f272a5694e37957a python : 3.10.0.final.0 python-bits : 64 OS : Windows OS-release : 10 Version : 10.0.19044 machine : AMD64 processor : Intel64 Family 6 Model 79 Stepping 1, GenuineIntel byteorder : little LC_ALL : None LANG : en_US.UTF-8 LOCALE : English_Australia.1252 pandas : 2.0.0.dev0+1147.g7cb7592523 numpy : 1.22.4 pytz : 2022.2.1 dateutil : 2.8.2 setuptools : 57.4.0 pip : 22.3.1 Cython : 0.29.32 pytest : 7.2.0 hypothesis : None sphinx : 4.5.0 blosc : None feather : None xlsxwriter : 3.0.1 lxml.etree : 4.9.1 html5lib : None pymysql : None psycopg2 : None jinja2 : 3.1.2 IPython : 8.1.1 pandas_datareader: None bs4 : 4.10.0 bottleneck : None brotli : 1.0.9 fastparquet : None fsspec : None gcsfs : None matplotlib : 3.5.1 numba : None numexpr : 2.8.4 odfpy : None openpyxl : 3.0.10 pandas_gbq : None pyarrow : None pyreadstat : None pyxlsb : None s3fs : None scipy : 1.8.0 snappy : None sqlalchemy : None tables : 3.7.0 tabulate : 0.8.9 xarray : None xlrd : None zstandard : 0.17.0 tzdata : None qtpy : 2.0.1 pyqt5 : None
axeleschholz commented 1 year ago

I reproduced the issue. It seems like the headers or labels are being propagated across multiple rows before they are phased out with skiprows. Honestly the best solution would be to implement the skiprows functionality at the highest level possible in the process, but that might require some refactoring. So instead what needs to happen is the skiprows offset needs to be taken into account when assigning headers. I think the issue might be present around here: https://github.com/pandas-dev/pandas/blob/2e218d10984e9919f0296931d92ea851c6a6faf5/pandas/io/excel/_base.py#L794-L833

I'll keep doing some digging on this and let you know if I find a fix.

pacificdragon commented 1 year ago

Hi,

I did some debugging and analysis on this issue. Below are my findings :

Issue 1: The argument passed in the sample code provided by @musshorn to parse the excel is incorrect as the header argument is spelled as "headers".

image

Due to this, the skip rows are non-functional in the above case.

image

However, if I read the same data using the correct parameters. This data looks meaningful.

image

Issue 2: Even if the header is not provided to the input, it should allow us to skip rows for reading the data frame.

So, for that, we may need to apply some fixes.

@axeleschholz @rhshadrach - Let me know your thoughts on this. Also, can I work on this issue?

rhshadrach commented 1 year ago

If I remove the first 8 rows of the excel file and then run

xl = pd.ExcelFile("TestPandas.xlsx")
df = xl.parse("Sheet1", header=[0, 1, 2], index_col=[0, 1])
print(df)

I get

          Unnamed: 2_level_0 WEST COAST                       
          Unnamed: 2_level_1 Merchant 1       Merchant 2      
Sale#                  Sale#  Comission Price  Comission Price
Fridges 1                NaN         82    76         64    89
        2                NaN         66    98         18    54
        3                NaN          6    48         48    81
        4                NaN         13    56         53    33
        5                NaN         60    69          7    48

This appears the same as the output in https://github.com/pandas-dev/pandas/issues/50953#issuecomment-1407468586, except for the numbers (the numbers here agree with those in the OP file, so maybe they got changed?). This makes me think skiprows is not an issue here, but I'm not certain.

In any case, passing the invalid argument header should raise instead of being ignored.

@pacificdragon

Also, can I work on this issue?

Sure! Just comment /take on the issue and it should be assigned to you.

pacificdragon commented 1 year ago

/take

pacificdragon commented 1 year ago

Issue 1 : The reason why this is being ignored is that the call to parse function is having keyword arguments (**kwds) which allows passing the headers parameter.

The sample code used to reproduce the bug directly uses the parse function to read the excel.

xl = pd.ExcelFile("TestPandas.xlsx")
df = xl.parse("Sheet1", skiprows=8, headers=[0, 1, 2], index_col=[0, 1])

However, the recommended way to perform the same action is :

pd.read_excel("TestPandas.xlsx",sheet_name='Sheet1',header=[0, 1, 2], index_col=[0, 1],skiprows=8)

The read_excel function handles all the unexpected arguments. But still, if want to raise an error in case parse function is called then there are these two possible solutions i can think of :

  1. Validate allowed arguments using provided list inside the parse function.
  2. Make the parse function internal to the class as _parse.

But, I think the real issue in this bug is that we are not able to skip rows if a header is not provided.

Sample Code to reproduce this :

df1=pd.read_excel("TestPandas.xlsx",sheet_name='Sheet1', index_col=[0, 1],skiprows=8)
print(df1)
df2=pd.read_excel("TestPandas.xlsx",sheet_name='Sheet1', index_col=[0, 1],skiprows=8,header=[0,1,2])
print(df2)

After running the above code if we check the data frame, the NORTH AMERICA and Test text 1 strings are visible in df1 which should not happen if skiprows is provided.

But the reason it's happening is due to the implementation of forward-filling values if the MultiIndex index is provided.

https://github.com/pandas-dev/pandas/blob/2e218d10984e9919f0296931d92ea851c6a6faf5/pandas/io/excel/_base.py#L835-L862

@rhshadrach Let me know if this makes sense. Which one will take higher priority in this case index_cols or skiprows.

rhshadrach commented 1 year ago

take

rhshadrach commented 1 year ago

Sorry @pacificdragon - there is no / in the command. I've assigned the issue to you (but in the future, just take will work).

if want to raise an error in case parse function is called then there are these two possible solutions i can think of :

What happens to the headers entry in kwargs when it's passed? Is it not used anywhere?

But, I think the real issue in this bug is that we are not able to skip rows if a header is not provided.

No disagreement that there may be an issue there, but if that is the case then I'm confused as to why in https://github.com/pandas-dev/pandas/issues/50953#issuecomment-1407656085 I get the same result when removing the top eight rows and the skiprows argument.

pacificdragon commented 1 year ago

What happens to the headers entry in kwargs when it's passed? Is it not used anywhere?

No its not used anywhere, the kwargs are passed from parse function to TextParser which further are passed to TextFileReader in io/parsers/readers.py

https://github.com/pandas-dev/pandas/blob/2e218d10984e9919f0296931d92ea851c6a6faf5/pandas/io/excel/_base.py#L866-L889

No disagreement that there may be an issue there, but if that is the case then I'm confused as to why in https://github.com/pandas-dev/pandas/issues/50953#issuecomment-1407656085 I get the same result when removing the top eight rows and the skiprows argument.

Yes, you are correct if we compare that is same because issue happens only when header is not provided and skiprows is provided. But in https://github.com/pandas-dev/pandas/issues/50953#issuecomment-1407656085 you are providing header argument.

@rhshadrach - Let me know if its clear now.

rhshadrach commented 1 year ago

I see - thanks. Agreed on this issue. Also passing invalid kwargs like headers should raise.

pacificdragon commented 1 year ago

Just to align on the changes. I did these tests again.

File : TestPandas.xlsx

Case 1:

df = xl.parse("Sheet1", skiprows=8)

Comments:

  1. skiprows working fine, data got skipped.
  2. Doesn't change source data (eg. no forward filling)

Case 2:

df = xl.parse("Sheet1", skiprows=8, header=[0, 1, 2])

df.columns :

MultiIndex([('Unnamed: 0_level_0', 'Unnamed: 0_level_1', 'Unnamed: 0_level_2'), ('Unnamed: 1_level_0', 'Unnamed: 1_level_1', 'Sale#'), ('Unnamed: 2_level_0', 'Unnamed: 2_level_1', 'Sale#'), ( 'WEST COAST', 'Merchant 1', 'Comission'), ( 'WEST COAST', 'Merchant 1', 'Price'), ( 'WEST COAST', 'Merchant 2', 'Comission'), ( 'WEST COAST', 'Merchant 2', 'Price')], )

Comments:

  1. Forward filling happened but only after skipping rows that is why we can see some 'Unnamed' indexes in starting.
  2. Skiprows taking the highest priority than 'header' (Looks good).

Case 3:

df = xl.parse("Sheet1", skiprows=8, index_col=[0, 1])

df.index

MultiIndex([('NORTH AMERICA', 'Test text 1'), ('NORTH AMERICA', 'Sale#'), ( 'Fridges', 1), ( 'Fridges', 2), ( 'Fridges', 3), ( 'Fridges', 4), ( 'Fridges', 5)], names=['NORTH AMERICA', 'Test text 1'])

Comments :

  1. Source Data Got Changed as index_col is specified. Forward Filling for Multiindex brought "Test text 1", "NORTH AMERICA" to row 9th.
  2. Skiprows were applied after forward filling happened.
    • This makes me think maybe it's correct if indexes can't be blank strings so bringing value from the top is needed.
    • But this shows index_col parameter is taking higher priority than skip rows.

@rhshadrach - Please suggest if my understanding is correct and if implementing skip rows at the top of the hierarchy will be a good option or not.

I see - thanks. Agreed on this issue. Also passing invalid kwargs like headers should raise.

Sure, creating a PR for this.

marcel-goldschen-ohm commented 2 months ago

I'm not sure if I am 100% following, but there is definitely a problem in read_excel with either the header or skiprows arguments or both in terms of row indexing, and how the indexing is done for index columns vs other columns. I suspect that the row indices in those arguments do NOT always map to the rows in the excel file, which in my opinion is HORRIBLY NON-INTUITIVE. For example, see https://github.com/pandas-dev/pandas/pull/58899#issuecomment-2294990252.

To motivate why one might want to do something like described in https://github.com/pandas-dev/pandas/pull/58899#issuecomment-2294990252, consider an excel sheet with multiple datasets having their own headers and also a shared set of headers:

Class, A, A, B, B
ID, 0, 1, 2, 3
 , , , , 
X1, Y1, Y1, Y1, Y1
0, a, b, c, d
1, e, f, g, h
 , , , , 
X2, Y2, Y2, Y2, Y2
0, i, j, k, l
1, m, n, o, p

Ideally, one should be able to load this data into two data frames, the first with headers Class, ID, and X1/Y1 and data rows including a-h, and the second with headers Class, ID, and X2/Y2 and data rows including i-p. This should intuitively be possible with the commands:

y1 = pd.read_excel(..., index_col=0, header=[0,1,3], skiprows = lambda row: row not in [0,1,3,4,5])
y2 = pd.read_excel(..., index_col=0, header=[0,1,7], skiprows = lambda row: row not in [0,1,7,8,9])

However, as described in https://github.com/pandas-dev/pandas/pull/58899#issuecomment-2294990252, this does NOT work.

marcel-goldschen-ohm commented 2 months ago

I'd just note that everything works fine if the header rows are contiguous and are the first few rows of the file. So this seems to be an implicit assumption that is limiting.

marcel-goldschen-ohm commented 2 months ago

One more observation. If the first data row (non-header row) has no values in any of the non-index columns, but a value in the index column, then that row is treated as the name of the index? I find this non-intuitive. Is this behavior described?