tspence / csharp-csv-reader

A lightweight, high performance, zero dependency, streaming CSV reading library for CSharp.
http://tedspence.com
Apache License 2.0
59 stars 18 forks source link

Very long lines don't parse #63

Closed wvdvegt closed 2 months ago

wvdvegt commented 1 year ago

I'm having a file with LabJs output consisting of 2 records, an id and a 200k json object in string format. The lastest CSVFile wont parse it and hangs when reading the lines using the iterator of CSVReader.

What i can see is that 64K is read from a line of ~200K after which NeedsMoreText never returns true. So the parser never sees a line ending and won't return the line and stalls the application using CSVFile.

wvdvegt commented 1 year ago

I noticed a BufferSize and would expect a warning when above happens. Enlarging it so a line fits seems to do the job.

It's however not 'useful' as you never know the size of the longest line.

Better would be to either just read the line using a StringBuilder instead of a sized char array or better make parsing work without buffering (guess a lotta work and bad for performance).

The buffersize setting might be handy to use it as chunksize appended to the line (stringbuilers) though. The smaller the less overhead you will get but trade it with performance.

wvdvegt commented 1 year ago

After enlarging the buffer to 0.5M i see that after two lines the parser statemachine has issues wt line 163-173 of CSVStateMachine.ParseChunk (p2, the position of the TextQualifier keeps returning -1 and the method returns null (and the _position start ping-ponging/backtracking between the values 19 & 20 in my case).

I downgraded CSVFile to v3.1.1 and that version works (and has no BufferSize setting).

sofcal commented 11 months ago

I've just encountered similar behaviour today after we upgraded to 3.1.2. It looks like this change (https://github.com/tspence/csharp-csv-reader/commit/ff2b19174012a4ac838243c76ca0beb04a985e98) has modified the way lines are read from the file. Where previously it used ReadLineAsync, it now uses ReadBlobAsync with a specified buffer size.

Unfortunately, if the last line of the buffer is incomplete the check for NeedsMoreText always fails to determine that it's out of data and so it repeatedly calls ParseChunk on the same incomplete line, which returns null each time.

while (machine.State == CSVState.CanKeepGoing)
{
    var line = string.Empty;
    if (machine.NeedsMoreText() && !inStream.EndOfStream)
    {
        var readChars = await inStream.ReadBlockAsync(buffer, 0, bufferSize);
        line = new string(buffer, 0, readChars);
    }
    var row = machine.ParseChunk(line, inStream.EndOfStream);
    if (row != null)
    {
        yield return row;
    } 
}

I don't have time to do a PR for this at the moment (so we've just downgraded) but will if I get a chance. The subsequent call to ReadBlockAsync will need to be modified too though, to ensure it doesn't discard the incomplete line it had from the previous read

tspence commented 2 months ago

Not sure how I missed the notification for this issue, but thank you all for the detailed investigation and repro! I'll get on this right away.

tspence commented 2 months ago

Found the issue - if the rules of the CSV Settings did not allow the line to end, and the chunk finished reading, it would loop rather than ending without a final line.

tspence commented 2 months ago

I've shipped version 3.2.0 which includes a fix for this issue.

wvdvegt commented 2 months ago

Txt for fixing, I'll upgrade csv-reader in my projects at the next change/release.