pandas-dev / pandas

Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more
https://pandas.pydata.org
BSD 3-Clause "New" or "Revised" License
43.91k stars 18.03k forks source link

BUG: Segmentation faults in pd.read_csv for large files #35051

Open Krytic opened 4 years ago

Krytic commented 4 years ago

Code Sample, a copy-pastable example

import numpy as np
import pandas as pd

def from_file(filepath):
    name_hints = []
    name_hints.extend(['m1','m2','a0','e0'])
    name_hints.extend(['weight','evolution_age','rejuvenation_age'])

    df = pd.read_csv(filepath,
                     nrows=None,
                     names=name_hints,
                     sep=r'\s+',
                     engine='python',
                     dtype=np.float64)

    return df

df = from_file('data/big_data_set.dat.gz')

Problem description

When loading sufficiently big files (mine are >800 MB on average), pd.read_csv fails with a Bus Error. I am loading in .gz files as I believe pandas is able to decompress these automatically.

I cannot reproduce this with a smaller dataset (have tried on files with ~364,000 lines). I have run this code with gdb and obtain the following stack trace:

Thread 0x00002aaab53d0700 (most recent call first): File "/opt/nesi/CS400_centos7_bdw/Python/3.8.2-gimkl-2020a/lib/python3.8/threading.py", line 306 in wait File "/opt/nesi/CS400_centos7_bdw/Python/3.8.2-gimkl-2020a/lib/python3.8/threading.py", line 558 in wait File "/home/sric560/.local/lib/python3.8/site-packages/tqdm/_monitor.py", line 69 in run File "/opt/nesi/CS400_centos7_bdw/Python/3.8.2-gimkl-2020a/lib/python3.8/threading.py", line 932 in _bootstrap_inner File "/opt/nesi/CS400_centos7_bdw/Python/3.8.2-gimkl-2020a/lib/python3.8/threading.py", line 890 in _bootstrap Current thread 0x00002aaaaaaea9c0 (most recent call first): File "/home/sric560/.local/lib/python3.8/site-packages/pandas/io/parsers.py", line 1763 in _infer_types File "/home/sric560/.local/lib/python3.8/site-packages/pandas/io/parsers.py", line 1708 in _convert_to_ndarrays File "/home/sric560/.local/lib/python3.8/site-packages/pandas/io/parsers.py", line 2528 in _convert_data File "/home/sric560/.local/lib/python3.8/site-packages/pandas/io/parsers.py", line 2464 in read File "/home/sric560/.local/lib/python3.8/site-packages/pandas/io/parsers.py", line 1133 in read File "/home/sric560/.local/lib/python3.8/site-packages/pandas/io/parsers.py", line 454 in _read File "/home/sric560/.local/lib/python3.8/site-packages/pandas/io/parsers.py", line 676 in parser_f File "/scale_wlg_persistent/filesets/home/sric560/masters/takahe/takahe/load.py", line 14 in from_file File "/scale_wlg_persistent/filesets/home/sric560/masters/takahe/takahe/load.py", line 35 in from_directory /var/spool/slurm/job13442044/slurm_script: line 10: 67748 Segmentation fault (core dumped) gdb -ex r --args python BigData.py File "BigData.py", line 11 in

Expected Output

I would expect the data to be read successfully.

Output of pd.show_versions()

INSTALLED VERSIONS ------------------ commit : None python : 3.8.2.final.0 python-bits : 64 OS : Linux OS-release : 3.10.0-693.2.2.el7.x86_64 machine : x86_64 processor : x86_64 byteorder : little LC_ALL : None LANG : en_NZ.UTF-8 LOCALE : en_NZ.UTF-8 pandas : 1.0.5 numpy : 1.18.1 pytz : 2019.3 dateutil : 2.8.1 pip : 19.2.3 setuptools : 45.2.0 Cython : 0.29.15 pytest : 5.3.5 hypothesis : None sphinx : None blosc : None feather : None xlsxwriter : None lxml.etree : 4.5.0 html5lib : None pymysql : None psycopg2 : 2.8.4 (dt dec pq3 ext lo64) jinja2 : 2.11.1 IPython : 7.13.0 pandas_datareader: None bs4 : None bottleneck : 1.3.1 fastparquet : None gcsfs : None lxml.etree : 4.5.0 matplotlib : 3.1.3 numexpr : 2.7.1 odfpy : None openpyxl : 3.0.3 pandas_gbq : None pyarrow : None pytables : None pytest : 5.3.5 pyxlsb : None s3fs : None scipy : 1.4.1 sqlalchemy : 1.3.13 tables : None tabulate : 0.8.6 xarray : 0.15.0 xlrd : 1.2.0 xlwt : None xlsxwriter : None numba : 0.48.0
gfyoung commented 4 years ago

@Krytic :

SultanOrazbayev commented 4 years ago

I encountered a similar problem and the underlying cause was an inconsistent quoting in some of the CSV rows. Looking at your separator setting, is it possible that your data has problems with quoting? If so, you could try different quoting options in pd.read_csv:

quotingint or csv.QUOTE_* instance, default 0
Control field quoting behavior per csv.QUOTE_* constants. Use one of QUOTE_MINIMAL (0), QUOTE_ALL (1), QUOTE_NONNUMERIC (2) or QUOTE_NONE (3).

If the data is public, kindly share the link to investigate further.

Krytic commented 4 years ago

@gfyoung

   0.2799383E+02   0.4983019E+02   0.3888486E+03   0.7298623E-01   0.5966560E-03   0.3004870E+07   0.1782353E+07
   0.2799383E+02   0.4983019E+02   0.4011484E+03   0.9574787E-01   0.5966560E-03   0.3004870E+07   0.1782353E+07
   0.2799383E+02   0.4983019E+02   0.4047721E+03   0.1047314E+00   0.5966560E-03   0.3004870E+07   0.1782353E+07
   0.2799383E+02   0.4983019E+02   0.4262738E+03   0.1530199E+00   0.5966560E-03   0.3004870E+07   0.1782353E+07
   0.2799383E+02   0.4983019E+02   0.3646421E+03   0.8201706E-02   0.5966560E-03   0.3004870E+07   0.1782353E+07
   0.2799383E+02   0.4983019E+02   0.3868649E+03   0.6432938E-01   0.5966560E-03   0.3004870E+07   0.1782353E+07
   0.2799383E+02   0.4983019E+02   0.4359797E+03   0.1679820E+00   0.5966560E-03   0.3004870E+07   0.1782353E+07
   0.2799383E+02   0.4983019E+02   0.3920825E+03   0.7787435E-01   0.5966560E-03   0.3004870E+07   0.1782353E+07
   0.2799383E+02   0.4983019E+02   0.4022169E+03   0.1043048E+00   0.5966560E-03   0.3004870E+07   0.1782353E+07
   0.2799383E+02   0.4983019E+02   0.3837477E+03   0.5791900E-01   0.5966560E-03   0.3004870E+07   0.1782353E+07

@SultanOrazbayev

SultanOrazbayev commented 4 years ago

@Krytic: OK, I'll give it a shot now.

SultanOrazbayev commented 4 years ago

I tried running the code you pasted above on the file and after 10+ minutes I interrupted the loading process (it's taking too long) and tried an alternative way: decompressing and loading into pandas. This took about 2 minutes and loaded into memory without problems. Here's the code I ran:

import numpy as np
import pandas as pd
import gzip

def from_file(filepath):
    name_hints = []
    name_hints.extend(["m1", "m2", "a0", "e0"])
    name_hints.extend(["weight", "evolution_age", "rejuvenation_age"])

    df = pd.read_csv(
        filepath,
        names=name_hints,
        sep=r"\s+",
    )
    return df

filename="Remnant-Birth-bin-imf135_300-z008_StandardJJ.dat.gz"

with open(filename, 'rb') as f:
    df = from_file(gzip.GzipFile(fileobj=f))

Here's my pd.show_versions().

``` INSTALLED VERSIONS ------------------ commit : None python : 3.8.3.final.0 python-bits : 64 OS : Darwin OS-release : 19.3.0 machine : x86_64 processor : i386 byteorder : little LC_ALL : None LANG : en_GB.UTF-8 LOCALE : en_GB.UTF-8 pandas : 1.0.5 numpy : 1.18.5 pytz : 2020.1 dateutil : 2.8.1 pip : 20.1.1 setuptools : 47.3.1.post20200616 Cython : None pytest : None hypothesis : None sphinx : None blosc : None feather : None xlsxwriter : None lxml.etree : None html5lib : None pymysql : None psycopg2 : None jinja2 : 2.11.2 IPython : 7.15.0 pandas_datareader: None bs4 : None bottleneck : None fastparquet : 0.4.0 gcsfs : None lxml.etree : None matplotlib : 3.2.1 numexpr : None odfpy : None openpyxl : None pandas_gbq : None pyarrow : 0.17.1 pytables : None pytest : None pyxlsb : None s3fs : None scipy : 1.4.1 sqlalchemy : None tables : None tabulate : None xarray : None xlrd : None xlwt : None xlsxwriter : None numba : 0.48.0 ```

To me this suggests that there is a problem with decompression on the fly for large files (since you report the script runs without problems on smaller files).

Krytic commented 4 years ago

I concur. Here's some more datasets to confirm: