Open musshorn opened 1 year ago
I reproduced the issue. It seems like the headers or labels are being propagated across multiple rows before they are phased out with skiprows. Honestly the best solution would be to implement the skiprows functionality at the highest level possible in the process, but that might require some refactoring. So instead what needs to happen is the skiprows offset needs to be taken into account when assigning headers. I think the issue might be present around here: https://github.com/pandas-dev/pandas/blob/2e218d10984e9919f0296931d92ea851c6a6faf5/pandas/io/excel/_base.py#L794-L833
I'll keep doing some digging on this and let you know if I find a fix.
Hi,
I did some debugging and analysis on this issue. Below are my findings :
Issue 1: The argument passed in the sample code provided by @musshorn to parse the excel is incorrect as the header argument is spelled as "headers".
Due to this, the skip rows are non-functional in the above case.
However, if I read the same data using the correct parameters. This data looks meaningful.
Issue 2: Even if the header is not provided to the input, it should allow us to skip rows for reading the data frame.
So, for that, we may need to apply some fixes.
@axeleschholz @rhshadrach - Let me know your thoughts on this. Also, can I work on this issue?
If I remove the first 8 rows of the excel file and then run
xl = pd.ExcelFile("TestPandas.xlsx")
df = xl.parse("Sheet1", header=[0, 1, 2], index_col=[0, 1])
print(df)
I get
Unnamed: 2_level_0 WEST COAST
Unnamed: 2_level_1 Merchant 1 Merchant 2
Sale# Sale# Comission Price Comission Price
Fridges 1 NaN 82 76 64 89
2 NaN 66 98 18 54
3 NaN 6 48 48 81
4 NaN 13 56 53 33
5 NaN 60 69 7 48
This appears the same as the output in https://github.com/pandas-dev/pandas/issues/50953#issuecomment-1407468586, except for the numbers (the numbers here agree with those in the OP file, so maybe they got changed?). This makes me think skiprows
is not an issue here, but I'm not certain.
In any case, passing the invalid argument header
should raise instead of being ignored.
@pacificdragon
Also, can I work on this issue?
Sure! Just comment /take
on the issue and it should be assigned to you.
/take
Issue 1 : The reason why this is being ignored is that the call to parse function is having keyword arguments (**kwds) which allows passing the headers
parameter.
The sample code used to reproduce the bug directly uses the parse function to read the excel.
xl = pd.ExcelFile("TestPandas.xlsx")
df = xl.parse("Sheet1", skiprows=8, headers=[0, 1, 2], index_col=[0, 1])
However, the recommended way to perform the same action is :
pd.read_excel("TestPandas.xlsx",sheet_name='Sheet1',header=[0, 1, 2], index_col=[0, 1],skiprows=8)
The read_excel
function handles all the unexpected arguments. But still, if want to raise an error in case parse function is called then there are these two possible solutions i can think of :
But, I think the real issue in this bug is that we are not able to skip rows if a header is not provided.
Sample Code to reproduce this :
df1=pd.read_excel("TestPandas.xlsx",sheet_name='Sheet1', index_col=[0, 1],skiprows=8)
print(df1)
df2=pd.read_excel("TestPandas.xlsx",sheet_name='Sheet1', index_col=[0, 1],skiprows=8,header=[0,1,2])
print(df2)
After running the above code if we check the data frame, the NORTH AMERICA
and Test text 1
strings are visible in df1
which should not happen if skiprows
is provided.
But the reason it's happening is due to the implementation of forward-filling values if the MultiIndex index is provided.
@rhshadrach Let me know if this makes sense. Which one will take higher priority in this case index_cols or skiprows.
take
Sorry @pacificdragon - there is no / in the command. I've assigned the issue to you (but in the future, just take
will work).
if want to raise an error in case parse function is called then there are these two possible solutions i can think of :
What happens to the headers
entry in kwargs
when it's passed? Is it not used anywhere?
But, I think the real issue in this bug is that we are not able to skip rows if a header is not provided.
No disagreement that there may be an issue there, but if that is the case then I'm confused as to why in https://github.com/pandas-dev/pandas/issues/50953#issuecomment-1407656085 I get the same result when removing the top eight rows and the skiprows
argument.
What happens to the
headers
entry inkwargs
when it's passed? Is it not used anywhere?
No its not used anywhere, the kwargs are passed from parse function to TextParser which further are passed to TextFileReader in io/parsers/readers.py
No disagreement that there may be an issue there, but if that is the case then I'm confused as to why in https://github.com/pandas-dev/pandas/issues/50953#issuecomment-1407656085 I get the same result when removing the top eight rows and the skiprows argument.
Yes, you are correct if we compare that is same because issue happens only when header
is not provided and skiprows
is provided. But in https://github.com/pandas-dev/pandas/issues/50953#issuecomment-1407656085 you are providing header
argument.
@rhshadrach - Let me know if its clear now.
I see - thanks. Agreed on this issue. Also passing invalid kwargs like headers
should raise.
Just to align on the changes. I did these tests again.
File : TestPandas.xlsx
Case 1:
df = xl.parse("Sheet1", skiprows=8)
Comments:
Case 2:
df = xl.parse("Sheet1", skiprows=8, header=[0, 1, 2])
df.columns
:
MultiIndex([('Unnamed: 0_level_0', 'Unnamed: 0_level_1', 'Unnamed: 0_level_2'), ('Unnamed: 1_level_0', 'Unnamed: 1_level_1', 'Sale#'), ('Unnamed: 2_level_0', 'Unnamed: 2_level_1', 'Sale#'), ( 'WEST COAST', 'Merchant 1', 'Comission'), ( 'WEST COAST', 'Merchant 1', 'Price'), ( 'WEST COAST', 'Merchant 2', 'Comission'), ( 'WEST COAST', 'Merchant 2', 'Price')], )
Comments:
Case 3:
df = xl.parse("Sheet1", skiprows=8, index_col=[0, 1])
df.index
MultiIndex([('NORTH AMERICA', 'Test text 1'), ('NORTH AMERICA', 'Sale#'), ( 'Fridges', 1), ( 'Fridges', 2), ( 'Fridges', 3), ( 'Fridges', 4), ( 'Fridges', 5)], names=['NORTH AMERICA', 'Test text 1'])
Comments :
@rhshadrach - Please suggest if my understanding is correct and if implementing skip rows at the top of the hierarchy will be a good option or not.
I see - thanks. Agreed on this issue. Also passing invalid kwargs like headers should raise.
Sure, creating a PR for this.
I'm not sure if I am 100% following, but there is definitely a problem in read_excel
with either the header or skiprows arguments or both in terms of row indexing, and how the indexing is done for index columns vs other columns. I suspect that the row indices in those arguments do NOT always map to the rows in the excel file, which in my opinion is HORRIBLY NON-INTUITIVE. For example, see https://github.com/pandas-dev/pandas/pull/58899#issuecomment-2294990252.
To motivate why one might want to do something like described in https://github.com/pandas-dev/pandas/pull/58899#issuecomment-2294990252, consider an excel sheet with multiple datasets having their own headers and also a shared set of headers:
Class, A, A, B, B
ID, 0, 1, 2, 3
, , , ,
X1, Y1, Y1, Y1, Y1
0, a, b, c, d
1, e, f, g, h
, , , ,
X2, Y2, Y2, Y2, Y2
0, i, j, k, l
1, m, n, o, p
Ideally, one should be able to load this data into two data frames, the first with headers Class, ID, and X1/Y1 and data rows including a-h, and the second with headers Class, ID, and X2/Y2 and data rows including i-p. This should intuitively be possible with the commands:
y1 = pd.read_excel(..., index_col=0, header=[0,1,3], skiprows = lambda row: row not in [0,1,3,4,5])
y2 = pd.read_excel(..., index_col=0, header=[0,1,7], skiprows = lambda row: row not in [0,1,7,8,9])
However, as described in https://github.com/pandas-dev/pandas/pull/58899#issuecomment-2294990252, this does NOT work.
I'd just note that everything works fine if the header rows are contiguous and are the first few rows of the file. So this seems to be an implicit assumption that is limiting.
One more observation. If the first data row (non-header row) has no values in any of the non-index columns, but a value in the index column, then that row is treated as the name of the index? I find this non-intuitive. Is this behavior described?
Pandas version checks
[X] I have checked that this issue has not already been reported.
[X] I have confirmed this bug exists on the latest version of pandas.
[X] I have confirmed this bug exists on the main branch of pandas.
Reproducible Example
Issue Description
Run the sample code on this excel sheet. Parsing is expected to begin from row 9 however in the index column data is pulled from rows that should've been skipped. TestPandas.xlsx
Expected Behavior
Parsing to begin from the first non-skipped row
Installed Versions