Open bkamins opened 5 years ago
I can replicate it. Has this worked better in the past and regressed? Or did this never work?
I cannot tell, as I have increased size of tests only recently, as the old ones were just to small to show anything meaningful.
I've stumbled upon this issue, so some comments for reference.
MWE
bigdf = DataFrame(rand(Bool, 10^5, 500))
bigdf[!, 1] = Int.(bigdf[!, 1])
bigdf[!, 2] = bigdf[!, 2] .+ 0.5
bigdf[!, 3] = string.(bigdf[!, 3], ", as string")
csvfileswrite1 = bigdf |> save(joinpath(@__DIR__, "bigdf2.csv"))
load(joinpath(@__DIR__, "bigdf2.csv")) |> DataFrame # Here it fails to load
I've tried to vary the number of rows and columns and get following results
# 10^2 x 50 = 973.310 μs (30338 allocations: 3.26 MiB)
# 10^2 x 500 = 58.562 ms (307888 allocations: 238.42 MiB)
# 10^3 x 500 = 605.530 ms (2551391 allocations: 2.21 GiB)
# 10^4 x 500 = 21.693 s (24961891 allocations: 22.03 GiB)
So it's more or less linear in time (supposedly 10^3 -> 10^4 nonlinear increase may be related to the fact that I run out of memory and os start swapping).
Profiling shows the following
bigdf = DataFrame(rand(Bool, 10^2, 500))
bigdf[!, 1] = Int.(bigdf[!, 1])
bigdf[!, 2] = bigdf[!, 2] .+ 0.5
bigdf[!, 3] = string.(bigdf[!, 3], ", as string")
csvfileswrite1 = bigdf |> save(joinpath(@__DIR__, "bigdf2.csv"))
load(joinpath(@__DIR__, "bigdf2.csv")) |> DataFrame
Profile.clear()
@profile load(joinpath(@__DIR__, "bigdf2.csv")) |> DataFrame
Profile.print(format = :flat, sortedby = :count)
omitting noise
126 ./tuple.jl 24 getindex
133 /home/skoffer/.julia/dev/TextParse/src/record.jl 38 macro expansion
134 /home/skoffer/.julia/dev/TextParse/src/record.jl 50 tryparsesetindex(::TextParse.Record{Tuple{TextParse....
136 /home/skoffer/.julia/dev/TextParse/src/csv.jl 337 #_csvread_internal#52(::Bool, ::Char, ::Char, ::Noth...
136 /home/skoffer/.julia/dev/TextParse/src/csv.jl 600 parsefill!(::TextParse.VectorBackedUTF8String, ::Tex...
155 /home/skoffer/.julia/dev/TextParse/src/util.jl 27 macro expansion
157 ./io.jl 298 #open#271(::Base.Iterators.Pairs{Union{},Union{},Tup...
157 ./io.jl 296 open
157 /home/skoffer/.julia/dev/TextParse/src/csv.jl 116 (::TextParse.var"#38#40"{Base.Iterators.Pairs{Symbol...
157 /home/skoffer/.julia/dev/TextParse/src/csv.jl 113 #_csvread_f#36
157 /home/skoffer/.julia/dev/TextParse/src/csv.jl 80 #csvread#16(::Base.Iterators.Pairs{Symbol,UnionAll,T...
157 /home/skoffer/.julia/packages/CSVFiles/C68zw/src/CSVFiles.jl 103 _loaddata(::CSVFiles.CSVFile)
157 /home/skoffer/.julia/packages/CSVFiles/C68zw/src/CSVFiles.jl 116 get_columns_copy_using_missing(::CSVFiles.CSVFile)
It looks like main problem is actually in TextParse
, specifically in tryparsesetindex
function of the record.jl
I've used last master
version of the TextParse
, commit "8f9ac08ee110467ba43e52d3449c74ab34391f06"
In https://github.com/bkamins/Julia-DataFrames-Tutorial/blob/master/04_loadsave.ipynb I had to disable CSVFiles.jl file reading tests as it failed to load a small file (that reads in a few seconds otherwise) in any reasonable time.
The file read has 500 columns and 500'000 so it is relatively small.
@davidanthoff - do you think this issue is solvable?