Open Krytic opened 4 years ago
@Krytic :
I encountered a similar problem and the underlying cause was an inconsistent quoting in some of the CSV rows. Looking at your separator setting, is it possible that your data has problems with quoting? If so, you could try different quoting
options in pd.read_csv
:
quotingint or csv.QUOTE_* instance, default 0
Control field quoting behavior per csv.QUOTE_* constants. Use one of QUOTE_MINIMAL (0), QUOTE_ALL (1), QUOTE_NONNUMERIC (2) or QUOTE_NONE (3).
If the data is public, kindly share the link to investigate further.
@gfyoung
0.0000000E+00
by the script running before mine. 0.2799383E+02 0.4983019E+02 0.3888486E+03 0.7298623E-01 0.5966560E-03 0.3004870E+07 0.1782353E+07
0.2799383E+02 0.4983019E+02 0.4011484E+03 0.9574787E-01 0.5966560E-03 0.3004870E+07 0.1782353E+07
0.2799383E+02 0.4983019E+02 0.4047721E+03 0.1047314E+00 0.5966560E-03 0.3004870E+07 0.1782353E+07
0.2799383E+02 0.4983019E+02 0.4262738E+03 0.1530199E+00 0.5966560E-03 0.3004870E+07 0.1782353E+07
0.2799383E+02 0.4983019E+02 0.3646421E+03 0.8201706E-02 0.5966560E-03 0.3004870E+07 0.1782353E+07
0.2799383E+02 0.4983019E+02 0.3868649E+03 0.6432938E-01 0.5966560E-03 0.3004870E+07 0.1782353E+07
0.2799383E+02 0.4983019E+02 0.4359797E+03 0.1679820E+00 0.5966560E-03 0.3004870E+07 0.1782353E+07
0.2799383E+02 0.4983019E+02 0.3920825E+03 0.7787435E-01 0.5966560E-03 0.3004870E+07 0.1782353E+07
0.2799383E+02 0.4983019E+02 0.4022169E+03 0.1043048E+00 0.5966560E-03 0.3004870E+07 0.1782353E+07
0.2799383E+02 0.4983019E+02 0.3837477E+03 0.5791900E-01 0.5966560E-03 0.3004870E+07 0.1782353E+07
@SultanOrazbayev
@Krytic: OK, I'll give it a shot now.
I tried running the code you pasted above on the file and after 10+ minutes I interrupted the loading process (it's taking too long) and tried an alternative way: decompressing and loading into pandas. This took about 2 minutes and loaded into memory without problems. Here's the code I ran:
import numpy as np
import pandas as pd
import gzip
def from_file(filepath):
name_hints = []
name_hints.extend(["m1", "m2", "a0", "e0"])
name_hints.extend(["weight", "evolution_age", "rejuvenation_age"])
df = pd.read_csv(
filepath,
names=name_hints,
sep=r"\s+",
)
return df
filename="Remnant-Birth-bin-imf135_300-z008_StandardJJ.dat.gz"
with open(filename, 'rb') as f:
df = from_file(gzip.GzipFile(fileobj=f))
Here's my pd.show_versions()
.
To me this suggests that there is a problem with decompression on the fly for large files (since you report the script runs without problems on smaller files).
I concur. Here's some more datasets to confirm:
[x] I have checked that this issue has not already been reported.
[x] I have confirmed this bug exists on the latest version of pandas.
[ ] (optional) I have confirmed this bug exists on the master branch of pandas.
Code Sample, a copy-pastable example
Problem description
When loading sufficiently big files (mine are >800 MB on average),
pd.read_csv
fails with a Bus Error. I am loading in.gz
files as I believe pandas is able to decompress these automatically.I cannot reproduce this with a smaller dataset (have tried on files with ~364,000 lines). I have run this code with
gdb
and obtain the following stack trace:Expected Output
I would expect the data to be read successfully.
Output of
pd.show_versions()