jehugaleahsa / FlatFiles

Reads and writes CSV, fixed-length and other flat file formats with a focus on schema definition, configuration and speed.
The Unlicense
357 stars 64 forks source link

Get current record as string in RecordError event? #71

Closed drmcclelland closed 3 years ago

drmcclelland commented 4 years ago

I see the PhysicalRecordNumber/LogicalRecordNumber properties are available on the RecordContext, but for logging I would like to output the actual bad records for troubleshooting. Is there a way to get the current record as a string in the RecordError event?

jehugaleahsa commented 4 years ago

I was playing around with this last night. I encountered several challenges/concerns which I will share list below that lead me to wonder if I should just leave this be for now.

First, let me say that I saw a potential workaround: there are RecordRead, RecordPartitioned and RecordParsed events that you can listen to, depending on whether it's a fixed-length format or delimited. Depending on where you are in the read process, they expose the full record text (I try not to refer to them as lines)(fixed-length only), the unprocessed substrings making up the record or the processed/parsed objects after the schema conversions take place. For delimited files, I never exposed the original string. Instead, you could string.join or use the writer to write out the individual string values; however, doing so may not result in the original string since I could apply quoting in different places, wipe out whitespace, etc.

In order to accurately capture the original string, I played around with adding a StringBuilder to the RetyReader class. This effectively doubles the memory usage of the parser. Most of the time this is insignificant and will have little impact on performance. Eventually, if I can use Memory objects and System.IO.Pipelines, I might be able to eliminate the overhead completely. This also means the delimited parser has to return two values whenever it reads a value - for now I am just returning a ValueTuple<string, string[]> where the first string the record text and the string array are the unprocessed column values. I need to see how this impacts performance, if at all.

If I added this information, I would most likely add it to the IRecordContext interface, which is where the PhysicalRecordNumber and LogicalRecordNumber live today. One thing that I noticed was that the events I mentioned above also expose the IRecordContext but also have properties for the same information. This might be fine - one would just be implemented in terms of the other.

During implementation, I found a couple places where I introduced subtle bugs, where the record text and values weren't cleared out and so would have appeared in the RecordProcessingException incorrectly in some cases. This made me realize I needed to slow down and think more carefully about where to capture and clear this information as it comes. I will probably take a fresh look at it tonight and reevaluate.

dbmadison commented 4 years ago

Thank you for looking at this. We too need the raw record in each the fixed and delimited cases. I did find that I could get the raw record out of the RecordRead event for the FixedWidth but not for the Delimited implementation. As you acknowledged, simply doing a string.join isn't sufficient and re-generating the record using the write doesn't provide the original; it generates a record that may match the original.

Do you have an update on how you're coming a solution for this?

Exposing it in the RecordContext would be great and it would work best, I believe. That way it is available directly when a parse error occurs. Currently, with getting it out of the RecordRead or RecordParsed event requires one to 'tuck it away for later' so that it can be used when checking for any errors in the errorDetailList (from your example code).

jehugaleahsa commented 4 years ago

Sorry, I was mostly waiting on some feedback and am otherwise just hesitant.

I have a working version of this on my machine. However, it makes me extremely nervous that it could include breaking changes or degrade performance in some unforeseen way (all my unit tests are green and my benchmarks seem happy, though). One option would be for me to create a separate branch and let you check it out. You could give me an idea of whether it seems to be behaving correctly and then I can merge it into master.

The other thing that makes me nervous is the duplication across IRecordContext and the event args. You could check out what that looks like and tell me what you think. Sound acceptable?

dbmadison commented 4 years ago

Sure... Happy to help with this.

jehugaleahsa commented 4 years ago

Please see: https://github.com/jehugaleahsa/FlatFiles/pull/72

dbmadison commented 4 years ago

Hi Travis,

Thank you for letting me review this release.

The short, It looks good, where this provides access to the raw record during the error handling events in both the delimited and fixed length cases.

What we are doing is saving all the bad records to an error file with comments as to what is detected wrong with them. So, in the error event handlers, I’m tucking the raw line away in a variable to be used in the while Read loop.

What would be nice would be to have access to the raw line when the reader.Read is successful where the reader.Current is accessible.

So have something like:

reader.Current

reader.CurrentRaw

Something like this below:

        // read the data values

        while (reader.Read())

        {

            if (errorDetailList.Any())

            {

                  …

                LineWithError = reader.CurrentRaw;

            }

              else

            {

                 …

                parsedRecord = (BhnHeader)reader.Current;

            }

            …               

        }

From: Travis Parks notifications@github.com Sent: Thursday, November 19, 2020 7:56 PM To: jehugaleahsa/FlatFiles FlatFiles@noreply.github.com Cc: dbmadison dmadison@equatekinteractive.com; Comment comment@noreply.github.com Subject: Re: [jehugaleahsa/FlatFiles] Get current record as string in RecordError event? (#71)

Please see: #72 https://github.com/jehugaleahsa/FlatFiles/pull/72

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/jehugaleahsa/FlatFiles/issues/71#issuecomment-730732630 , or unsubscribe https://github.com/notifications/unsubscribe-auth/ADHDEKCJWTCOMW7JB4OCW4LSQW5DVANCNFSM4S7J62GA .

dbmadison commented 4 years ago

Hey Travis,

I worked some more with this and found that I don’t need the ‘CurrentRaw’ in the look like I expressed below. I refactored my code and so having access to the raw record in the RecordError event works well.

-Thank you again. I’m looking forward to the update, then I can get rid of tucking away the raw record so I can use it in the RecordError event.

-Dave

From: David Madison dmadison@equatekinteractive.com Sent: Tuesday, December 1, 2020 9:09 PM To: 'jehugaleahsa/FlatFiles' reply@reply.github.com; 'jehugaleahsa/FlatFiles' FlatFiles@noreply.github.com Cc: 'Comment' comment@noreply.github.com Subject: RE: [jehugaleahsa/FlatFiles] Get current record as string in RecordError event? (#71)

Hi Travis,

Thank you for letting me review this release.

The short, It looks good, where this provides access to the raw record during the error handling events in both the delimited and fixed length cases.

What we are doing is saving all the bad records to an error file with comments as to what is detected wrong with them. So, in the error event handlers, I’m tucking the raw line away in a variable to be used in the while Read loop.

What would be nice would be to have access to the raw line when the reader.Read is successful where the reader.Current is accessible.

So have something like:

reader.Current

reader.CurrentRaw

Something like this below:

        // read the data values

        while (reader.Read())

        {

            if (errorDetailList.Any())

            {

                  …

                LineWithError = reader.CurrentRaw;

            }

              else

            {

                 …

                parsedRecord = (BhnHeader)reader.Current;

            }

            …               

        }

From: Travis Parks <notifications@github.com mailto:notifications@github.com > Sent: Thursday, November 19, 2020 7:56 PM To: jehugaleahsa/FlatFiles <FlatFiles@noreply.github.com mailto:FlatFiles@noreply.github.com > Cc: dbmadison <dmadison@equatekinteractive.com mailto:dmadison@equatekinteractive.com >; Comment <comment@noreply.github.com mailto:comment@noreply.github.com > Subject: Re: [jehugaleahsa/FlatFiles] Get current record as string in RecordError event? (#71)

Please see: #72 https://github.com/jehugaleahsa/FlatFiles/pull/72

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/jehugaleahsa/FlatFiles/issues/71#issuecomment-730732630 , or unsubscribe https://github.com/notifications/unsubscribe-auth/ADHDEKCJWTCOMW7JB4OCW4LSQW5DVANCNFSM4S7J62GA .

jehugaleahsa commented 4 years ago

v4.13.0 has been released to NuGet. Let me know if you run into any issues; otherwise, please close the PR. Thanks!