pandas-dev / pandas

Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more
https://pandas.pydata.org
BSD 3-Clause "New" or "Revised" License
43.19k stars 17.77k forks source link

BUG: read_excel leads to segfault #40321

Closed tharwan closed 3 years ago

tharwan commented 3 years ago

Note: Please read this guide detailing how to provide the necessary information for us to reproduce your bug.

Code Sample, a copy-pastable example

from pathlib import Path
import pandas as pd

filename = 'file.xlsm'
template_df = pd.read_excel(filename, sheet_name="Muster", usecols="A:F", skiprows=1)
template_df.index = template_df["Beschreibung"]

path = 'file2.xlsx'
df = pd.read_excel(path, sheet_name="Januar", usecols="A:F", skiprows=1)
df.index = df["Beschreibung"]

template_df.loc['Fee EPEX ID Kundenhandel', "Short/Kauf/NEG"] = 0
template_df.update(df)
template_df.loc["Direktvermarktung Vertragspreis", "Long/Verkauf/POS"]

Problem description

First of all, I know that the example is not a proper minimal snippet. I also can not provide the excel files in question. However I have no idea how to debug this issue any further.

If I run the script above with python -X faulthandler I get:

Current thread 0x00007f76b30a3280 (most recent call first):
  File ".../.venv/lib/python3.8/site-packages/pandas/core/frame.py", line 3133 in _get_value
  File ".../.venv/lib/python3.8/site-packages/pandas/core/indexing.py", line 888 in __getitem__
  File ".../debug.py", line 22 in <module>
fish: “python -X faulthandler .../de…” terminated by signal SIGSEGV (Address boundary error)

From what I can deduce there is something wrong with the template_df DataFrame. I was trying to generate an equivivalent copy of it without copying the memory so I ran it through

tempalte_df = pd.Dataframe.from_dict(template_df.to_dict()). 

This does indeed seem to help. However by accident I discovered just doing

d = template_df.to_dict() 

also fixes it. As if to_dict() would somehow alter the DataFrame?

So I might be completely on the wrong track. Any help on how to drill down on this would be appreciated.

Expected Output

No segfault :-)

Output of pd.show_versions()

INSTALLED VERSIONS ------------------ commit : f2c8480af2f25efdbd803218b9d87980f416563e python : 3.8.2.final.0 python-bits : 64 OS : Linux OS-release : 5.4.0-65-generic Version : #73-Ubuntu SMP Mon Jan 18 17:25:17 UTC 2021 machine : x86_64 processor : x86_64 byteorder : little LC_ALL : None LANG : en_US.UTF-8 LOCALE : en_US.UTF-8 pandas : 1.2.3 numpy : 1.20.1 pytz : 2021.1 dateutil : 2.8.1 pip : 21.0.1 setuptools : 52.0.0 Cython : None pytest : None hypothesis : None sphinx : None blosc : None feather : None xlsxwriter : 1.3.7 lxml.etree : 4.6.2 html5lib : None pymysql : None psycopg2 : None jinja2 : 2.11.3 IPython : 7.21.0 pandas_datareader: None bs4 : None bottleneck : None fsspec : None fastparquet : None gcsfs : None matplotlib : 3.3.4 numexpr : None odfpy : None openpyxl : 3.0.6 pandas_gbq : None pyarrow : None pyxlsb : None s3fs : None scipy : None sqlalchemy : 1.3.23 tables : None tabulate : None xarray : None xlrd : 1.2.0 xlwt : None numba : None
mzeitlin11 commented 3 years ago

Thanks for the report @tharwan. My best guess upon a quick glance is that the issue here is related to the line df.index = df["Beschreibung"] (see #34364 for something similar). The same workaround should hopefully work here (using set_index instead of directly assigning). Since the index is immutable, direct assignment with a mutable object should generally be avoided (but clearly we should handle it more gracefully than a segfault).

mzeitlin11 commented 3 years ago

@tharwan going to close in favor of #34364, but please reopen if you think it is a different issue.