BUG: Segmentation faults in pd.read_csv for large files

Krytic commented 4 years ago

[x] I have checked that this issue has not already been reported.
[x] I have confirmed this bug exists on the latest version of pandas.
[ ] (optional) I have confirmed this bug exists on the master branch of pandas.

Code Sample, a copy-pastable example

import numpy as np
import pandas as pd

def from_file(filepath):
    name_hints = []
    name_hints.extend(['m1','m2','a0','e0'])
    name_hints.extend(['weight','evolution_age','rejuvenation_age'])

    df = pd.read_csv(filepath,
                     nrows=None,
                     names=name_hints,
                     sep=r'\s+',
                     engine='python',
                     dtype=np.float64)

    return df

df = from_file('data/big_data_set.dat.gz')

Problem description

When loading sufficiently big files (mine are >800 MB on average), pd.read_csv fails with a Bus Error. I am loading in .gz files as I believe pandas is able to decompress these automatically.

I cannot reproduce this with a smaller dataset (have tried on files with ~364,000 lines). I have run this code with gdb and obtain the following stack trace:

Thread 0x00002aaab53d0700 (most recent call first): File "/opt/nesi/CS400_centos7_bdw/Python/3.8.2-gimkl-2020a/lib/python3.8/threading.py", line 306 in wait File "/opt/nesi/CS400_centos7_bdw/Python/3.8.2-gimkl-2020a/lib/python3.8/threading.py", line 558 in wait File "/home/sric560/.local/lib/python3.8/site-packages/tqdm/_monitor.py", line 69 in run File "/opt/nesi/CS400_centos7_bdw/Python/3.8.2-gimkl-2020a/lib/python3.8/threading.py", line 932 in _bootstrap_inner File "/opt/nesi/CS400_centos7_bdw/Python/3.8.2-gimkl-2020a/lib/python3.8/threading.py", line 890 in _bootstrap Current thread 0x00002aaaaaaea9c0 (most recent call first): File "/home/sric560/.local/lib/python3.8/site-packages/pandas/io/parsers.py", line 1763 in _infer_types File "/home/sric560/.local/lib/python3.8/site-packages/pandas/io/parsers.py", line 1708 in _convert_to_ndarrays File "/home/sric560/.local/lib/python3.8/site-packages/pandas/io/parsers.py", line 2528 in _convert_data File "/home/sric560/.local/lib/python3.8/site-packages/pandas/io/parsers.py", line 2464 in read File "/home/sric560/.local/lib/python3.8/site-packages/pandas/io/parsers.py", line 1133 in read File "/home/sric560/.local/lib/python3.8/site-packages/pandas/io/parsers.py", line 454 in _read File "/home/sric560/.local/lib/python3.8/site-packages/pandas/io/parsers.py", line 676 in parser_f File "/scale_wlg_persistent/filesets/home/sric560/masters/takahe/takahe/load.py", line 14 in from_file File "/scale_wlg_persistent/filesets/home/sric560/masters/takahe/takahe/load.py", line 35 in from_directory /var/spool/slurm/job13442044/slurm_script: line 10: 67748 Segmentation fault (core dumped) gdb -ex r --args python BigData.py File "BigData.py", line 11 in

Expected Output

I would expect the data to be read successfully.

Output of `pd.show_versions()`

INSTALLED VERSIONS ------------------ commit : None python : 3.8.2.final.0 python-bits : 64 OS : Linux OS-release : 3.10.0-693.2.2.el7.x86_64 machine : x86_64 processor : x86_64 byteorder : little LC_ALL : None LANG : en_NZ.UTF-8 LOCALE : en_NZ.UTF-8 pandas : 1.0.5 numpy : 1.18.1 pytz : 2019.3 dateutil : 2.8.1 pip : 19.2.3 setuptools : 45.2.0 Cython : 0.29.15 pytest : 5.3.5 hypothesis : None sphinx : None blosc : None feather : None xlsxwriter : None lxml.etree : 4.5.0 html5lib : None pymysql : None psycopg2 : 2.8.4 (dt dec pq3 ext lo64) jinja2 : 2.11.1 IPython : 7.13.0 pandas_datareader: None bs4 : None bottleneck : 1.3.1 fastparquet : None gcsfs : None lxml.etree : 4.5.0 matplotlib : 3.1.3 numexpr : 2.7.1 odfpy : None openpyxl : 3.0.3 pandas_gbq : None pyarrow : None pytables : None pytest : 5.3.5 pyxlsb : None s3fs : None scipy : 1.4.1 sqlalchemy : 1.3.13 tables : None tabulate : 0.8.6 xarray : 0.15.0 xlrd : 1.2.0 xlwt : None xlsxwriter : None numba : 0.48.0

gfyoung commented 4 years ago

@Krytic :

Could you share the stacktrace with the bus error?
Are all of the rows in your >800 MB files correctly-formed CSV?

SultanOrazbayev commented 4 years ago

I encountered a similar problem and the underlying cause was an inconsistent quoting in some of the CSV rows. Looking at your separator setting, is it possible that your data has problems with quoting? If so, you could try different quoting options in pd.read_csv:

quotingint or csv.QUOTE_* instance, default 0
Control field quoting behavior per csv.QUOTE_* constants. Use one of QUOTE_MINIMAL (0), QUOTE_ALL (1), QUOTE_NONNUMERIC (2) or QUOTE_NONE (3).

If the data is public, kindly share the link to investigate further.

Krytic commented 4 years ago

@gfyoung

Apologies but apparently not. It's been cleared from my environment on the cluster I use and was apparently not backed up.
Yes - it's output from another script, and all obeys the same format. Here's a sample of the input data - it's all seven columns and any missing data is set to 0.0000000E+00 by the script running before mine.

   0.2799383E+02   0.4983019E+02   0.3888486E+03   0.7298623E-01   0.5966560E-03   0.3004870E+07   0.1782353E+07
   0.2799383E+02   0.4983019E+02   0.4011484E+03   0.9574787E-01   0.5966560E-03   0.3004870E+07   0.1782353E+07
   0.2799383E+02   0.4983019E+02   0.4047721E+03   0.1047314E+00   0.5966560E-03   0.3004870E+07   0.1782353E+07
   0.2799383E+02   0.4983019E+02   0.4262738E+03   0.1530199E+00   0.5966560E-03   0.3004870E+07   0.1782353E+07
   0.2799383E+02   0.4983019E+02   0.3646421E+03   0.8201706E-02   0.5966560E-03   0.3004870E+07   0.1782353E+07
   0.2799383E+02   0.4983019E+02   0.3868649E+03   0.6432938E-01   0.5966560E-03   0.3004870E+07   0.1782353E+07
   0.2799383E+02   0.4983019E+02   0.4359797E+03   0.1679820E+00   0.5966560E-03   0.3004870E+07   0.1782353E+07
   0.2799383E+02   0.4983019E+02   0.3920825E+03   0.7787435E-01   0.5966560E-03   0.3004870E+07   0.1782353E+07
   0.2799383E+02   0.4983019E+02   0.4022169E+03   0.1043048E+00   0.5966560E-03   0.3004870E+07   0.1782353E+07
   0.2799383E+02   0.4983019E+02   0.3837477E+03   0.5791900E-01   0.5966560E-03   0.3004870E+07   0.1782353E+07

@SultanOrazbayev

Here is a link to the largest file I have (947 MB zipped)

SultanOrazbayev commented 4 years ago

@Krytic: OK, I'll give it a shot now.

SultanOrazbayev commented 4 years ago

I tried running the code you pasted above on the file and after 10+ minutes I interrupted the loading process (it's taking too long) and tried an alternative way: decompressing and loading into pandas. This took about 2 minutes and loaded into memory without problems. Here's the code I ran:

import numpy as np
import pandas as pd
import gzip

def from_file(filepath):
    name_hints = []
    name_hints.extend(["m1", "m2", "a0", "e0"])
    name_hints.extend(["weight", "evolution_age", "rejuvenation_age"])

    df = pd.read_csv(
        filepath,
        names=name_hints,
        sep=r"\s+",
    )
    return df

filename="Remnant-Birth-bin-imf135_300-z008_StandardJJ.dat.gz"

with open(filename, 'rb') as f:
    df = from_file(gzip.GzipFile(fileobj=f))

Here's my pd.show_versions().

``` INSTALLED VERSIONS ------------------ commit : None python : 3.8.3.final.0 python-bits : 64 OS : Darwin OS-release : 19.3.0 machine : x86_64 processor : i386 byteorder : little LC_ALL : None LANG : en_GB.UTF-8 LOCALE : en_GB.UTF-8 pandas : 1.0.5 numpy : 1.18.5 pytz : 2020.1 dateutil : 2.8.1 pip : 20.1.1 setuptools : 47.3.1.post20200616 Cython : None pytest : None hypothesis : None sphinx : None blosc : None feather : None xlsxwriter : None lxml.etree : None html5lib : None pymysql : None psycopg2 : None jinja2 : 2.11.2 IPython : 7.15.0 pandas_datareader: None bs4 : None bottleneck : None fastparquet : 0.4.0 gcsfs : None lxml.etree : None matplotlib : 3.2.1 numexpr : None odfpy : None openpyxl : None pandas_gbq : None pyarrow : 0.17.1 pytables : None pytest : None pyxlsb : None s3fs : None scipy : 1.4.1 sqlalchemy : None tables : None tabulate : None xarray : None xlrd : None xlwt : None xlsxwriter : None numba : 0.48.0 ```

To me this suggests that there is a problem with decompression on the fly for large files (since you report the script runs without problems on smaller files).

Krytic commented 4 years ago

I concur. Here's some more datasets to confirm:

Small (5MB zipped) - this works
Medium (43 MB zipped) - this works, but is slower than it probably should be.

pandas-dev / pandas