iigorr / pgn.net

Portable Game Notation (PGN) implementation in .NET
Other
33 stars 24 forks source link

Reading large files #14

Closed mwo-dk closed 9 years ago

mwo-dk commented 9 years ago

Hi,

Just tried (crazily maybe) to run the parsing of a really large db (http://www.top-5000.nl/pgn.htm - the million 2.2 base), and to no-ones surprise I get an OutOfMemoryException. Is it possible to use FParsec differently, such that you do a different reading/parsing/handing on a line-by-line or other more lazy handling of large data?

The exception is raised immediately from:

member this.ReadFromStream(stream: System.IO.Stream) = let parserResult = runParserOnStream pDatabase () "pgn" stream System.Text.Encoding.UTF8

iigorr commented 9 years ago

Hi mwo-dk,

thanks for reporting. Wow, 279 MB. When I started, I thought about just saving a few games... ;) I'll have to have a closer look into this issue, we sure can do better.

Cheers,

Igor

mwo-dk commented 9 years ago

Wonderful :-) I'd love to use this utility. PGN/FEN parsing has held me back for yrs in writing some clients and servers that I'd love to do. BTW. I see in your code, that you're using ILMerge. Seen this one: https://libz.codeplex.com/?

iigorr commented 9 years ago

Looks like the CharStream implementation from FParsec is failing to handle large files. Even when creating a CharStream with bufferSize of 1, the construction of a stream from the million 2.2 base failes with Out of Memory.

new CharStream<'a>(stream, true, System.Text.Encoding.UTF8, false, 1);

No parsing even done here!

It seems the solution is to use the FParsec BigData-Edition (https://www.nuget.org/packages/fparsec-big-data-edition). I'm looking into it, but don't have much time. Please bear with me.

mwo-dk commented 9 years ago

Thanks man. It is not a business critical issue :-), just curious, since I've been looking after a thing like this for a few yrs now.

iigorr commented 9 years ago

Fixed. Using the big-data version of FParsec helped. I have created methods, that yield games as soon as they are read from the stream/file. See https://github.com/iigorr/pgn.net/commit/fd2ccea22ec61ffc7bcce9de77b8310f45ae531d

    // ilf.pgn.PgnReader
    public IEnumerable<Game> ReadGamesFromFile(string file)
    public IEnumerable<Game> ReadGamesFromStream(Stream stream)

With this methods on my machine I could parse 10,000 Games in 2:23 Minutes, that's around 70 Games per second. There is some room for optimization, but I think this is OK for now.

iigorr commented 9 years ago

Feel free to try it out: