svenvc / NeoCSV

NeoCSV is an elegant and efficient standalone Smalltalk framework to read and write CSV converting to or from Smalltalk objects.
MIT License
29 stars 12 forks source link

Skipping Lines without side effects #20

Open JoachimTuchel opened 3 years ago

JoachimTuchel commented 3 years ago

I am not sure if the change I am suggesting here is going beyond CSV Reading. I will explain anyways, since it is not a big change to NeoCSVReader but made my life a lot easier.

First, let me explain the problem: We need to read transactions from a number of sources that claim to export CSV data. In fact, most of the sources export a body of CSV data augmented with a few header lines and some also add a few lines to the end of the file. This is mostly information that displays nicely in Excel, and may or may not be CSV at all. Most of the time it is not, or it has another number of columns than the "payload CSV" (I'll stick to this term from now on, and I mean the real CSV data in the file.

In most cases, all we know is the number of header lines and the number of footer lines after the payload. The total number of lines in the file is of course variable. What sounds easy, is easy in many cases. Open a ReadStream, ignore the first X lines and start Parsing CSV ... until we are Y lines before the end of the file....

And this is where things get complicated. Skipping a line sounds easy, but in CSV, you may have quoted fields which contain a line break. So you cannot simply use a Stream's #nextLine. It would cut a line off at every crlf or lf or cr. This leads to a total mess.

NeoCSVReader has #readHeader, which handles this quoted stuff quite easily. But #readHeader has a side effect: it sets the fieldCoun based on the number of fields it found in a line.

So what I needed was a way to skip a line, respecting possible line breaks in quoted fields and do nothing else.

It turns out that is easy. I extracted just the reading part from #readHeader into its own method and I use it from #readHeader but also for my skipping purposes.

So I changed #readHeader to:

readHeader
    "Read a record, presumably a header and return the header field names.
    This should normally be called only at the beginning and only once.
    This sets the fieldCount (but fieldAccessors overrides fieldCount)."

    | names |
    names := self readNextLineAsArray.
    fieldCount := names size.
    ^names

by extracting the Array streamContents: part to its own method named #readNextLineAsArray:

readNextLineAsArray
    ^Array streamContents: 
            [:out | 
            [self atEnd or: [self readEndOfLine]] whileFalse: 
                    [out nextPut: self readField.
                    (self readSeparator and: [self atEnd or: [self peekEndOfLine]]) 
                        ifTrue: [out nextPut: emptyFieldValue]]]

So now I have a way to read any line in a Reader's stream, and I can use the contents or ignore them. There is just one more thing I had to do to make my case work: I needed a way to let NeoCSVReader reset the position of its Stream:

resetStream
        "go back to psotion 0 of the readStream, e.g. when parsing the Stream in a two-pass process"
    readStream reset.

These three changes to NeoCSVReader allowed me to parse an only partially conforming CSV file and skip the header and footer lines correctly. Here is some pseudo-code for this import job:

importFrom: stream
       "this is more or less pseudo code"

    reader := self csvReader on: stream.

    "Count the number of lines in the Stream, taking into account quoted fields with crlf inside 
    This is not a good solution for huge files, of course"

    noOfLines := 0.
    [reader atEnd] whileFalse: [
            reader readNextLineAsArray. "We don't care about the contents in this case" 
            noOfLines := noOfLines +1].
    reader resetStream.

    "Ignore the header lines"
    self noOfHeaderLines timesRepeat: [reader readNextLineAsArray].

    noOfPayloadLines := noOfLines - (self noOfHeaderLines ifNil: [0])- (self noOfFooterLines  ifNil: [0]).

    resultObjects := OrderedCollection new.

    [noOfPayloadLines timesRepeat: [resultObjects add: reader next]  ]
        on: Error
        do: [:ex |
            Error signal: 'CSV import failed with: ' , ex description.
            ex exitWith: nil].

What do you think about these changes? Do they make sense for more than my case? Would it make sense to include such an extension to NeoCSVReader or is this too much of noise around CSV parsing?

svenvc commented 3 years ago

Hi,

Thx for the detailed report. I am very busy for the rest of the month, but I will get back to this issue later.

Sven

svenvc commented 3 years ago

There is now the following commit:

https://github.com/svenvc/NeoCSV/commit/0acc2270b382f52533c478f2f1585341e390d4b5

which should address a couple of issues.

On 19 Jan 2021, at 18:27, JoachimTuchel @.***> wrote:

I am not sure if the change I am suggesting here is going beyond CSV Reading. I will explain anyways, since it is not a big change to NeoCSVReader but made my life a lot easier.

First, let me explain the problem: We need to read transactions from a number of sources that claim to export CSV data. In fact, most of the sources export a body of CSV data augmented with a few header lines and some also add a few lines to the end of the file. This is mostly information that displays nicely in Excel, and may or may not be CSV at all. Most of the time it is not, or it has another number of columns than the "payload CSV" (I'll stick to this term from now on, and I mean the real CSV data in the file.

In most cases, all we know is the number of header lines and the number of footer lines after the payload. The total number of lines in the file is of course variable. What sounds easy, is easy in many cases. Open a ReadStream, ignore the first X lines and start Parsing CSV ... until we are Y lines before the end of the file....

And this is where things get complicated. Skipping a line sounds easy, but in CSV, you may have quoted fields which contain a line break. So you cannot simply use a Stream's #nextLine. It would cut a line off at every crlf or lf or cr. This leads to a total mess.

NeoCSVReader has #readHeader, which handles this quoted stuff quite easily. But #readHeader has a side effect: it sets the fieldCoun based on the number of fields it found in a line.

So what I needed was a way to skip a line, respecting possible line breaks in quoted fields and do nothing else.

It turns out that is easy. I extracted just the reading part from #readHeader into its own method and I use it from #readHeader but also for my skipping purposes.

So I changed #readHeader to:

readHeader "Read a record, presumably a header and return the header field names. This should normally be called only at the beginning and only once. This sets the fieldCount (but fieldAccessors overrides fieldCount)."

| names | names := self readNextLineAsArray. fieldCount := names size. ^names

by extracting the Array streamContents: part to its own method named #readNextLineAsArray:

readNextLineAsArray ^Array streamContents: [:out | [self atEnd or: [self readEndOfLine]] whileFalse: [out nextPut: self readField. (self readSeparator and: [self atEnd or: [self peekEndOfLine]]) ifTrue: [out nextPut: emptyFieldValue]]]

So now I have a way to read any line in a Reader's stream, and I can use the contents or ignore them. There is just one more thing I had to do to make my case work: I needed a way to let NeoCSVReader reset the position of its Stream:

resetStream "go back to psotion 0 of the readStream, e.g. when parsing the Stream in a two-pass process" readStream reset.

These three changes to NeoCSVReader allowed me to parse an only partially conforming CSV file and skip the header and footer lines correctly. Here is some pseudo-code for this import job:

importFrom: stream "this is more or less pseudo code"

reader := self csvReader on: stream.

"Count the number of lines in the Stream, taking into account quoted fields with crlf inside This is not a good solution for huge files, of course"

noOfLines := 0. [reader atEnd] whileFalse: [ reader readNextLineAsArray. "We don't care about the contents in this case" anzahlZeilen := anzahlZeilen +1]. reader resetStream.

"Ignore the header lines" self noOfHeaderLines timesRepeat: [reader readNextLineAsArray].

noOfPayloadLines := noOfLines - (self noOfHeaderLines ifNil: [0])- (self noOfFooterLines ifNil: [0]).

resultObjects := OrderedCollection new.

[noOfPayloadLines timesRepeat: [umsatzWrapper add: reader next] ] on: Error do: [:ex | Error signal: 'CSV import failed with: ' , ex description. ex exitWith: nil].

What do you think about these changes? Do they make sense for more than my case? Would it make sense to include such an extension to NeoCSVReader or is this too much of noise around CSV parsing?

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub, or unsubscribe.

-- Sven Van Caekenberghe - @.*** Beta Nine - software engineering - http://www.beta9.be

svenvc commented 3 years ago

There is now the following commit:

https://github.com/svenvc/NeoCSV/commit/0acc2270b382f52533c478f2f1585341e390d4b5

which should address a couple of issues.