mbraceproject / FsPickler

A fast multi-format message serializer for .NET
http://mbraceproject.github.io/FsPickler/
MIT License
324 stars 52 forks source link

Support for serializing large datasets #38

Closed Rickasaurus closed 8 years ago

Rickasaurus commented 9 years ago

I was trying to use FsPickler to pull a largish (~27 million records) to disk and found that about 30 minutes in it failed with:

Nessos.FsPicker.FsPicklerException: Error serializing instance of type System.String[] ---> System.Runtime.Serialization.SerializationException: The internal array cannot expand to greater than Int32.MaxValue elements.

This occurs in a call to ObjectIDGenerator.Rehash() from ObjectIDGenerator.GetID(Object obj, Boolean& firstTime) which is called in FsPickler in CompositePickler`1.Write(WriteState state, String tag, T value) in CompositePickler.fs on line 189

It turns out this is a common problem in .NET, due to the internal use of ObjectIDGenerator to look for cycles. However, as this is a straight pull of a SQL database I can guarantee there's no cycles in this case.

On a side note, the behavior of ObjectIDGenerator is poor in the failure case. It does not release the memory from its table thus leaving a big chunk of ram in use.

eiriktsarpalis commented 9 years ago

Hi Rick,

I had a look at this issue a while back but gave up on it soon because of the difficulty of the undertaking. At the moment caching is too ingrained in the library logic to remove without substantial refactoring. I'll come back at this as soon as I find me the time to do it.

In the meantime, here are some alternative courses of action:

let fsp = FsPickler.CreateBinary()
let chunkSize = 100000

let serialize (inputs : seq<'T>) (target : Stream) =
    let count = ref 0
    for window in Seq.windowed chunkSize inputs do
        incr count
        fsp.Serialize(targer, window, leaveOpen = true)
   !count

let deserialize count (source : Stream) = seq {
    for i in 1 .. count do
        yield! fsp.Deserialize<'T[]>(stream, leaveOpen = true)
}

Thoughts?

eiriktsarpalis commented 9 years ago

Btw, have you tried using the .SerializeSequence methods? I would be interested to see how they behave in your case.

Rickasaurus commented 9 years ago

SerializeSequence gives exactly the same error.

eiriktsarpalis commented 9 years ago

I added a few adjustments as of yesterday (nuget version >= 1.0.9). Is this behaviour occuring there?

On Thursday, February 5, 2015, Rick Minerich notifications@github.com wrote:

SerializeSequence gives exactly the same error.

— Reply to this email directly or view it on GitHub https://github.com/nessos/FsPickler/issues/38#issuecomment-73116797.

Sent from Gmail Mobile

eiriktsarpalis commented 9 years ago

w.r.t. SerializeSequence

On Thursday, February 5, 2015, Rick Minerich notifications@github.com wrote:

SerializeSequence gives exactly the same error.

— Reply to this email directly or view it on GitHub https://github.com/nessos/FsPickler/issues/38#issuecomment-73116797.

Sent from Gmail Mobile

Rickasaurus commented 9 years ago

I'm trying the chunked version now, but it seems to have stopped writing (or gotten extremely slow) after it wrote about 900MB (the same place where the non-chunked give exceptions).

It's going to take some time before I can move a new build into our locked down environment to test. Maybe a tool to just autogen some random data would be worth it here.

Rickasaurus commented 9 years ago

I'm convinced that the chunked version is somehow just spinning its wheels now. It's been using a full core for 10 minutes and still hasn't written anything more out. The other possibility is that it's doing a lot of catching exceptions inside somewhere.

eiriktsarpalis commented 9 years ago

Good point, there should probably be a test of this, writing large sequences to '/dev/null' or something.

By the way, what is the element type of your data set? String? F# ADT? class? struct?

On Thursday, February 5, 2015, Rick Minerich notifications@github.com wrote:

I'm trying the chunked version now, but it seems to have stopped writing (or gotten extremely slow) after it wrote about 900MB (the same place where the non-chunked give exceptions).

It's going to take some time before I can move a new build into our locked down environment to test. Maybe a tool to just autogen some random data would be worth it here.

— Reply to this email directly or view it on GitHub https://github.com/nessos/FsPickler/issues/38#issuecomment-73118240.

Sent from Gmail Mobile

Rickasaurus commented 9 years ago

It's a 2-3 level deep record tree with some structs, char arrays and strings in it.

Rickasaurus commented 9 years ago

You could probably reproduce this with just a one member class with random strings though. It seems to be all about the number of objects.

Rickasaurus commented 9 years ago

Oh, and I didn't mention before and it probably doesn't matter because the issue is in the generic code, but I've been trying to use the binary serializer.

Rickasaurus commented 9 years ago

It looks like the chunked method is working (it just spit out another 900MB), but it seems like it's taking increasingly more time per record as it goes on.

eiriktsarpalis commented 9 years ago

Ok, so here's what I tried out:

type Tree<'T> = Leaf | Branch of 'T * Tree<'T> * Tree<'T>

let rec mkTree (f : int -> 'T) n =
    if n = 0 then Leaf
    else Branch(f n, (mkTree f (n - 1)) , (mkTree f (n - 1)))

let large N = seq { for i in 1 .. N -> mkTree (fun i -> "textfield" + string i) 3 } 

open System.IO
let fsp = FsPickler.CreateBinary()

// test 1 : quickly ate up all my memory
let eagerSeqPickler = Pickler.seq Pickler.auto<Tree<string>>
fsp.Serialize(eagerSeqPickler, Stream.Null, large 30000000)

// test 2 : time scales proportionally to input size, memory usage remains constant.
// Real: 00:08:22.967, CPU: 00:08:55.718, GC gen0: 5620, gen1: 2384, gen2: 792
fsp.SerializeSequence(Stream.Null, large 30000000)

The tests were run on my machine (Windows 8 VM running on a core i5 laptop with 4GB).

I think the problem here is clearly with ObjectIdGenerator.Rehash(), which becomes ridiculously expensive as the number of objects increases. Have a look at its implementation:

http://referencesource.microsoft.com/#mscorlib/system/runtime/serialization/objectidgenerator.cs,145

This would explain I think both the devouring of memory and the intermittent stalls in IO.

Rickasaurus commented 9 years ago

Ahh, it all makes sense now. Certainly it would be ideal to have something improved, but I understand that can be a lot of work. I'd be pretty happy with just a way to bypass it.

eiriktsarpalis commented 9 years ago

SerializeSequence should be a safe bet. I actually just pushed a package update (1.0.11) that fine tunes performance with respect to ObjectIdGenerator after the benchmarks I just ran.

Rickasaurus commented 9 years ago

Awesome! I'll submit it for scanning and hopefully I'll have it in our environment in a week or so.

Richard Minerich Microsoft MVP (F#) @Rickasaurus richard.minerich@gmail.com phone: 860-922-3456

On Thu, Feb 5, 2015 at 7:44 PM, Eirik Tsarpalis notifications@github.com wrote:

SerializeSequence should be a safe bet. I actually just pushed a package update (1.0.11) that fine tunes performance with respect to ObjectIdGenerator after the benchmarks I just ran.

— Reply to this email directly or view it on GitHub https://github.com/nessos/FsPickler/issues/38#issuecomment-73160971.