invenia / JLSO.jl

Julia Serialized Object (JLSO) file format for storing checkpoint data.
MIT License
90 stars 5 forks source link

Crashes for very large dataframes #21

Open oxinabox opened 5 years ago

oxinabox commented 5 years ago

I've not looked into this, but @xiaodaigh reports that on the Fannie Mae 2004Q3 data (2.7 Gb) on julia 1.2.0-rc1.0 that JLSO crashes.

Gist for reproducting is https://gist.github.com/xiaodaigh/2b9c1b6eb068fb8b3dcd1b1f2a55facd

rofinn commented 5 years ago

A stacktrace would have been helpful as those snippets aren't the most general.

oxinabox commented 5 years ago

Yes, I was just getting the info down before it was lost to slack's blackhole

rofinn commented 5 years ago

NOTE: Fannie Mae 2004Q3 data (2.7 Gb) isn't what's in the gist and I don't see a 2.7 Gb file in the dataset documented.

Running the code on Julia 1.0.3 with Performance_2000Q4.txt (1.0 Gb) seems to run fine. I'll try testing on 1.2 as it's possible that there's a bug in the more recent julia releases that need to be fixed.

JLSO is one of the slowest to read and write, but it might be worth updating the benchmarks to consider file size though because by default we're compressing our data.

write_perf = [0.0, 0.0, 74.5112, 61.946, 147.795, 91.5526, 23.0511, 10.9649, 485.467, 795.006]
read_perf = [0.0, 0.0, 19.1525, 60.1645, 0.00758378, 0.00147431, 16.6633, 12.3439, 96.9803, 49.7399]

The last 2 entries are both jlso files.

rofinn commented 5 years ago

I also couldn't get it to throw an error with a 9Gb CSV to push things a little more. https://www.kaggle.com/city-of-seattle/seattle-library-collection-inventory

Fun fact, the default JLSO file format compresses the 9.1G file down to < 800M while JLD2 only compresses it to 2.6G on disk. I guess this is why JLSO is still pretty handy for our use case (e.g., upload lots of files to S3).

oxinabox commented 5 years ago

Oooh, a chance to use DataDepsGenerators

julia> using DataDepsGenerators

julia> println(generate("https://www.kaggle.com/city-of-seattle/seattle-library-collection-inventory"))
register(DataDep(
    "Seattle Library Collection Inventory",
    """
    Dataset: Seattle Library Collection Inventory
    Website: https://www.kaggle.com/city-of-seattle/seattle-library-collection-inventory
    Author: City of Seattle
    Date of Publication: August 1, 2019
    License: https://creativecommons.org/publicdomain/zero/1.0/

    ### Content

    The Seattle Public Library's collection inventory.

    ### Context

    This is a dataset hosted by the City of Seattle. The city has an open data platform found [here](https://data.seattle.gov/) and they update their information according the amount of data that is brought in. Explore the City of Seattle using Kaggle and all of the data sources available through the City of Seattle [organization page](https://www.kaggle.com/city-of-seattle)!

    * Update Frequency: This dataset is updated monthly.

    ### Acknowledgements

    This dataset is maintained using Socrata's API and Kaggle's API. [Socrata](https://socrata.com/) has assisted countless organizations with hosting their open data and has been an integral part of the process of bringing more data to the public.

    [Cover photo](https://unsplash.com/photos/VphuLHwuyks) by [Alexandra Kirr](https://unsplash.com/@alexkirrthegirl) on [Unsplash](https://unsplash.com/)
    _Unsplash Images are distributed under a unique [Unsplash License](https://unsplash.com/license)._
    """,
    ["https://www.kaggle.com/city-of-seattle/seattle-library-collection-inventory/downloads/seattle-library-collection-inventory.zip/3", "https://www.kaggle.com/city-of-seattle/seattle-library-collection-inventory/downloads/CollectionInventory_Codes_EXCLUDED_INCLUDED.xlsx/3", "https://www.kaggle.com/city-of-seattle/seattle-library-collection-inventory/downloads/Library Collection Inventory FAQs.pdf/3", "https://www.kaggle.com/city-of-seattle/seattle-library-collection-inventory/downloads/library-collection-inventory.csv/3", "https://www.kaggle.com/city-of-seattle/seattle-library-collection-inventory/downloads/socrata_metadata.json/3"],
))
rofinn commented 5 years ago

Yep, can't reproduce on 1.2 either.

write_perf = [8.97624532, 6.488284407, 37.267838932, 24.014278148, 67.626921068, 64.134563347, 23.585749921, 12.960553185, 311.955533758, 318.826512276]
read_perf = [11.179993041, 8.72340904, 21.159888766, 10.411692382, 0.887963974, 0.001172795, 14.963994836, 7.80005972, 43.608864438, 43.488846143]

Looks like the performance is more consistent at least :)

My best guess is that this is a windows specific issue... possibly with the compression library.

xiaodaigh commented 5 years ago

NOTE: Fannie Mae 2004Q3 data (2.7 Gb) isn't what's in the gist and I don't see a 2.7 Gb file in the dataset documented.

This is the link to file that contains 2004Q3: http://rapidsai-data.s3-website.us-east-2.amazonaws.com/notebook-mortgage-data/mortgage_2000-2007.tgz

In general, more data can be sourced from https://docs.rapids.ai/datasets/mortgage-data

xiaodaigh commented 5 years ago

This is the error I get in 1.3-rc2

ERROR: InexactError: trunc(Int32, 2738917277)
Stacktrace:
 [1] throw_inexacterror(::Symbol, ::Type{Int32}, ::Int64) at .\boot.jl:560
 [2] checked_trunc_sint at .\boot.jl:582 [inlined]
 [3] toInt32 at .\boot.jl:619 [inlined]
 [4] Int32 at .\boot.jl:709 [inlined]
 [5] bson_primitive at C:\Users\ZJ.DAI\.julia\packages\BSON\XPZLD\src\write.jl:14 [inlined]
 [6] bson_pair(::Base.GenericIOBuffer{Array{UInt8,1}}, ::String, ::Array{UInt8,1}) at C:\Users\ZJ.DAI\.julia\packages\BSON\XPZLD\src\write.jl:22
 [7] bson_doc(::Base.GenericIOBuffer{Array{UInt8,1}}, ::Array{Pair{String,Array{UInt8,1}},1}) at C:\Users\ZJ.DAI\.julia\packages\BSON\XPZLD\src\write.jl:28
 [8] bson_primitive(::Base.GenericIOBuffer{Array{UInt8,1}}, ::Array{Any,1}) at C:\Users\ZJ.DAI\.julia\packages\BSON\XPZLD\src\write.jl:37
 [9] bson_pair(::Base.GenericIOBuffer{Array{UInt8,1}}, ::String, ::Array{Any,1}) at C:\Users\ZJ.DAI\.julia\packages\BSON\XPZLD\src\write.jl:22
 [10] bson_doc(::Base.GenericIOBuffer{Array{UInt8,1}}, ::Array{Pair{String,Array{Any,1}},1}) at C:\Users\ZJ.DAI\.julia\packages\BSON\XPZLD\src\write.jl:28
 [11] bson_primitive(::Base.GenericIOBuffer{Array{UInt8,1}}, ::Array{Any,1}) at C:\Users\ZJ.DAI\.julia\packages\BSON\XPZLD\src\write.jl:37
 [12] bson_pair(::Base.GenericIOBuffer{Array{UInt8,1}}, ::Symbol, ::Array{Any,1}) at C:\Users\ZJ.DAI\.julia\packages\BSON\XPZLD\src\write.jl:22
 [13] bson_doc(::Base.GenericIOBuffer{Array{UInt8,1}}, ::Dict{Symbol,Any}) at C:\Users\ZJ.DAI\.julia\packages\BSON\XPZLD\src\write.jl:28
 [14] bson_primitive at C:\Users\ZJ.DAI\.julia\packages\BSON\XPZLD\src\write.jl:36 [inlined]
 [15] bson_pair(::Base.GenericIOBuffer{Array{UInt8,1}}, ::String, ::Dict{Symbol,Any}) at C:\Users\ZJ.DAI\.julia\packages\BSON\XPZLD\src\write.jl:22
 [16] bson_doc(::Base.GenericIOBuffer{Array{UInt8,1}}, ::Array{Pair{String,Dict{Symbol,Any}},1}) at C:\Users\ZJ.DAI\.julia\packages\BSON\XPZLD\src\write.jl:28
 [17] bson_primitive(::Base.GenericIOBuffer{Array{UInt8,1}}, ::Array{Any,1}) at C:\Users\ZJ.DAI\.julia\packages\BSON\XPZLD\src\write.jl:37
 [18] bson_pair(::Base.GenericIOBuffer{Array{UInt8,1}}, ::String, ::Array{Any,1}) at C:\Users\ZJ.DAI\.julia\packages\BSON\XPZLD\src\write.jl:22
 [19] bson_doc(::Base.GenericIOBuffer{Array{UInt8,1}}, ::Array{Pair{String,Array{Any,1}},1}) at C:\Users\ZJ.DAI\.julia\packages\BSON\XPZLD\src\write.jl:28
 [20] bson_primitive(::Base.GenericIOBuffer{Array{UInt8,1}}, ::Array{Any,1}) at C:\Users\ZJ.DAI\.julia\packages\BSON\XPZLD\src\write.jl:37
 [21] bson_pair(::Base.GenericIOBuffer{Array{UInt8,1}}, ::Symbol, ::Array{Any,1}) at C:\Users\ZJ.DAI\.julia\packages\BSON\XPZLD\src\write.jl:22
 [22] bson_doc(::IOStream, ::Dict{Symbol,Any}) at C:\Users\ZJ.DAI\.julia\packages\BSON\XPZLD\src\write.jl:28
 [23] bson_primitive(::IOStream, ::Dict{Symbol,Any}) at C:\Users\ZJ.DAI\.julia\packages\BSON\XPZLD\src\write.jl:36
 [24] bson(::IOStream, ::Dict{String,Dict{String,V} where V}) at C:\Users\ZJ.DAI\.julia\packages\BSON\XPZLD\src\write.jl:81
 [25] write(::IOStream, ::JLSOFile) at C:\Users\ZJ.DAI\.julia\packages\JLSO\1C0be\src\file_io.jl:7
 [26] #save#4 at C:\Users\ZJ.DAI\.julia\packages\JLSO\1C0be\src\file_io.jl:59 [inlined]
 [27] save at C:\Users\ZJ.DAI\.julia\packages\JLSO\1C0be\src\file_io.jl:59 [inlined]
 [28] #7 at C:\Users\ZJ.DAI\.julia\packages\JLSO\1C0be\src\file_io.jl:61 [inlined]
 [29] #open#271(::Base.Iterators.Pairs{Union{},Union{},Tuple{},NamedTuple{(),Tuple{}}}, ::typeof(open), ::JLSO.var"##7#8"{Base.Iterators.Pairs{Union{},Union{},Tuple{},NamedTuple{(),Tuple{}}},Tuple{DataFrames.DataFrame}}, ::String, ::Vararg{String,N} where N) at .\io.jl:298
 [30] open at .\io.jl:296 [inlined]
 [31] #save#6 at C:\Users\ZJ.DAI\.julia\packages\JLSO\1C0be\src\file_io.jl:61 [inlined]
 [32] save(::String, ::DataFrames.DataFrame) at C:\Users\ZJ.DAI\.julia\packages\JLSO\1C0be\src\file_io.jl:61
 [33] top-level scope at util.jl:155
rofinn commented 5 years ago

Interesting, looks like this is a known issue with the BSON spec. Apparently, array types are indexed with an Int32. I'm not entirely sure why I wasn't able to hit the condition with the other large files maybe difference in efficiency of the serialization/compression on windows? One option could be to split these large array primitives into multiple parts as suggested in the BSON.jl issues.

samuela commented 3 years ago

Just ran into this issue in #74. It would be nice just to get a more informative error message here, possibly linking to this issue. Right now it's really inscrutable.

rofinn commented 3 years ago

Alright, I think I came up with an actual solution to this problem in #75 (vs just improving the error message). The gist is that we can just drop the serialized object bytes from the BSON doc and manually write them after, allowing us to save much larger files.