queryverse / CSVFiles.jl

FileIO.jl integration for CSV files
Other
51 stars 13 forks source link

File reading time #58

Open bkamins opened 5 years ago

bkamins commented 5 years ago

In https://github.com/bkamins/Julia-DataFrames-Tutorial/blob/master/04_loadsave.ipynb I had to disable CSVFiles.jl file reading tests as it failed to load a small file (that reads in a few seconds otherwise) in any reasonable time.

The file read has 500 columns and 500'000 so it is relatively small.

@davidanthoff - do you think this issue is solvable?

davidanthoff commented 5 years ago

I can replicate it. Has this worked better in the past and regressed? Or did this never work?

bkamins commented 5 years ago

I cannot tell, as I have increased size of tests only recently, as the old ones were just to small to show anything meaningful.

Arkoniak commented 4 years ago

I've stumbled upon this issue, so some comments for reference.

MWE

bigdf = DataFrame(rand(Bool, 10^5, 500))
bigdf[!, 1] = Int.(bigdf[!, 1])
bigdf[!, 2] = bigdf[!, 2] .+ 0.5
bigdf[!, 3] = string.(bigdf[!, 3], ", as string")

csvfileswrite1 = bigdf |> save(joinpath(@__DIR__, "bigdf2.csv"))

load(joinpath(@__DIR__, "bigdf2.csv")) |> DataFrame  # Here it fails to load

I've tried to vary the number of rows and columns and get following results

# 10^2 x 50 = 973.310 μs (30338 allocations: 3.26 MiB)
# 10^2 x 500 = 58.562 ms (307888 allocations: 238.42 MiB)
# 10^3 x 500 = 605.530 ms (2551391 allocations: 2.21 GiB)
# 10^4 x 500 = 21.693 s (24961891 allocations: 22.03 GiB)

So it's more or less linear in time (supposedly 10^3 -> 10^4 nonlinear increase may be related to the fact that I run out of memory and os start swapping).

Profiling shows the following

bigdf = DataFrame(rand(Bool, 10^2, 500))
bigdf[!, 1] = Int.(bigdf[!, 1])
bigdf[!, 2] = bigdf[!, 2] .+ 0.5
bigdf[!, 3] = string.(bigdf[!, 3], ", as string")

csvfileswrite1 = bigdf |> save(joinpath(@__DIR__, "bigdf2.csv"))

load(joinpath(@__DIR__, "bigdf2.csv")) |> DataFrame

Profile.clear()
@profile load(joinpath(@__DIR__, "bigdf2.csv")) |> DataFrame
Profile.print(format = :flat, sortedby = :count)

omitting noise

 126 ./tuple.jl                                                            24 getindex                                               
   133 /home/skoffer/.julia/dev/TextParse/src/record.jl                      38 macro expansion                                        
   134 /home/skoffer/.julia/dev/TextParse/src/record.jl                      50 tryparsesetindex(::TextParse.Record{Tuple{TextParse....
   136 /home/skoffer/.julia/dev/TextParse/src/csv.jl                        337 #_csvread_internal#52(::Bool, ::Char, ::Char, ::Noth...
   136 /home/skoffer/.julia/dev/TextParse/src/csv.jl                        600 parsefill!(::TextParse.VectorBackedUTF8String, ::Tex...
   155 /home/skoffer/.julia/dev/TextParse/src/util.jl                        27 macro expansion                                        
   157 ./io.jl                                                              298 #open#271(::Base.Iterators.Pairs{Union{},Union{},Tup...
   157 ./io.jl                                                              296 open                                                   
   157 /home/skoffer/.julia/dev/TextParse/src/csv.jl                        116 (::TextParse.var"#38#40"{Base.Iterators.Pairs{Symbol...
   157 /home/skoffer/.julia/dev/TextParse/src/csv.jl                        113 #_csvread_f#36                                         
   157 /home/skoffer/.julia/dev/TextParse/src/csv.jl                         80 #csvread#16(::Base.Iterators.Pairs{Symbol,UnionAll,T...
   157 /home/skoffer/.julia/packages/CSVFiles/C68zw/src/CSVFiles.jl         103 _loaddata(::CSVFiles.CSVFile)                          
   157 /home/skoffer/.julia/packages/CSVFiles/C68zw/src/CSVFiles.jl         116 get_columns_copy_using_missing(::CSVFiles.CSVFile)     

It looks like main problem is actually in TextParse, specifically in tryparsesetindex function of the record.jl

I've used last master version of the TextParse, commit "8f9ac08ee110467ba43e52d3449c74ab34391f06"