queryverse / TextParse.jl

A bunch of fast text parsing tools
Other
57 stars 20 forks source link

Segfault when parsing data #27

Closed andreasnoack closed 7 years ago

andreasnoack commented 7 years ago
julia> raw = JuliaDB.loadfiles(joinpath.(datadir, files), '\t', header_exists = false,
           colnames = colnames, usecache = false, indexcols = colnames[1:2],
           colparsers = [String, dateformat"yyyymmddHHMMSS", Int, String, String,
                         String, String, String, String, String,
                         String, String, String, String, String,
                         Float32, String, String, String, String,
                         String, String, String, String, String,
                         String, String]);
Metadata for 0 / 201 files can be loaded from cache.
Reading 201 csv files totalling 3.393 GiB...
Could not determine which type to promote column to.
Error reading file /data/andreasnoack/gdelt/20160102043000.gkg.csv
ERROR:
signal (11): Segmentation fault
while loading no file, in expression starting on line 0
getlineat at /home/andreasnoack/.julia/v0.6/TextParse/src/util.jl:182
unknown function (ip: 0x7fccde005db5)
jl_call_fptr_internal at /home/centos/buildbot/slave/package_tarball64/build/src/julia_internal.h:339
[inlined]
jl_call_method_internal at /home/centos/buildbot/slave/package_tarball64/build/src/julia_internal.h:3$
8 [inlined]
jl_apply_generic at /home/centos/buildbot/slave/package_tarball64/build/src/gf.c:1933
showerror at /home/andreasnoack/.julia/v0.6/TextParse/src/csv.jl:408

on

julia> versioninfo()
Pkg.Julia Version 0.6.0
Commit 9036443 (2017-06-19 13:05 UTC)
Platform Info:
  OS: Linux (x86_64-pc-linux-gnu)
  CPU: Intel(R) Xeon(R) CPU E7- 8850  @ 2.00GHz
  WORD_SIZE: 64
  BLAS: libopenblas (USE64BITINT DYNAMIC_ARCH NO_AFFINITY Nehalem)
  LAPACK: libopenblas64_
  LIBM: libopenlibm
  LLVM: libLLVM-3.9.1 (ORCJIT, westmere)

julia> Pkg.status("TextParse")
 - TextParse                     0.1.8              master
shashi commented 7 years ago

It seems the segfault generated is due to a WeakRefString being captured in a CapturedException and then serializing, causing the pointer to be zero-ed.

The actual error is

Could not determine which type to promote column to.
Error reading file ./20160102043000.gkg.csv

Parse error at line 1306 at char 12476:
...U/VRVlo2XZZII/AAAAAAAAAW4/7lirGJoV3XA/s1600/3.PNG;...
____________________________________________________^
CSV column 20 is expected to be: TextParse.Field{String,TextParse.Quoted{String,TextParse.StringToken{String}}}("<string>", true, true, false, false, Nullable{Char}('\t'))
shashi commented 7 years ago

Sure enough, that line got truncated at 20 fields... The error is misleading. Fix upcoming.

shashi commented 7 years ago

The issue is fixed in that the error messages are correct now. I wonder if this is a problem with the data export itself... Should TextParse treat the remaining 7 fields as null? Or is an error saying "Expected more fields at the end of line" justified?

shashi commented 7 years ago

I was able to fix the issue by joining line 1307 and 1308. It seems there's a newline half way through a string field. Could it be possible that this was accidentally introduced?

andreasnoack commented 7 years ago

Great. Just looked at the raw strings and the newline is definitely wrong.