stephan-tolksdorf / fparsec

A parser combinator library for F#
523 stars 45 forks source link

Trying to parse a 1.7GB text file throws ArgumentOutOfRangeException #48

Open atlemann opened 4 years ago

atlemann commented 4 years ago

Is there some limitation to how big a file FParsec supports? What I could find out from the code is that it reads by chunks, but I cannot seem to find which StringBuilder.Append is failing.

System.ArgumentOutOfRangeException: The length cannot be greater than the capacity. (Parameter 'valueCount')
   at System.Text.StringBuilder.Append(Char* value, Int32 valueCount)
   at System.Text.StringBuilder.Append(Char[] value, Int32 startIndex, Int32 charCount)
   at FParsec.CharStream.StreamConstructorContinue(Stream stream, Boolean leaveOpen, Encoding encoding, Boolean detectEncodingFromByteOrderMarks, Int32 byteBufferLength)
   at FParsec.CharStream..ctor(String path, Encoding encoding, Boolean detectEncodingFromByteOrderMarks, Int32 byteBufferLength)
   at FParsec.CharStream..ctor(String path, Encoding encoding)
   at FParsec.CharStream`1..ctor(String path, Encoding encoding)
   at FParsec.CharParsers.runParserOnFile[a,u](FSharpFunc`2 parser, u ustate, String path, Encoding encoding)

File.ReadAllText on the same file throws System.OutOfMemoryException: Insufficient memory to continue the execution of the program. so I have to parse it in chunks.

stephan-tolksdorf commented 4 years ago

The version of FParsec that is shipped in the FParsec NuGet package can't parse arbitrarily long streams, see http://www.quanttec.com/fparsec/download-and-installation.html#nuget-packages The FParsec-Big-Data-Edition version does, but unfortunately it hasn't yet been ported to .NET Core.

atlemann commented 4 years ago

Ok, thanks! Will it require a lot of code change to make it netstandard2.0 or is it more or less a update project files job? I could probably contribute with that although it seems Enrico has done that job already maybe?

stephan-tolksdorf commented 4 years ago

AFAIK, the biggest issue is that the encoding decoders in .NET Core are not serializable, which breaks the non-low-trust implementation of CharStream.