jehugaleahsa / FlatFiles

Reads and writes CSV, fixed-length and other flat file formats with a focus on schema definition, configuration and speed.
The Unlicense
357 stars 64 forks source link

Question: Validation mode? #36

Closed pbolduc closed 6 years ago

pbolduc commented 6 years ago

On a project coming up, we will be receiving multiple files from different sources. We need to validate these files before processing them. For example, ensure number fields have numbers and date/time fields have date/times.

Approaches:

  1. Just parse normally and handle RecordProcessingException. This would work but I would only get the first failed field in the first failed record. Attempts to continue to read will throw InvalidOperationException: An attempt was made to work with a parser after an invalid record was encountered.
  2. Second approach that comes to mind would be to perform two passes over the file. The first pass would use no schema with validate each string field for correctly formatted. The second pass would then use correctly strongly typed schema.

In my use case, I would want to fail the entire file. However, I can think other use case would be people wanting to skip invalid records.

What is your take on validating a flat file for a typed schema?

I will have to think how I would expect an API to work, ie separate SeparatedValueValidatorReader, or option to throw on parse error, add/avoid more responsibilties on the IReaderWithMetadata types.

jehugaleahsa commented 6 years ago

A nice feature would be to raise an event when column parsing or formatting fails, just like when a whole record fails. There'd be an IsHandled property as well as a Substitution property (with default (T) as the default).

Thoughts?

Originally, I was thinking you could use the Preprocessor on each colum definition, but that feels more like a hack.

pbolduc commented 6 years ago

I was looking at the changes you have committed. I like the direction you are taking.

jehugaleahsa commented 6 years ago

Yeah, unfortunately that includes some breaking changes, but I think it's good to rename the Error event to RecordError for clarity anyway. I also added similar events to the writer classes, since currently it looks like I am handling record-level errors by throwing ArgumentException. Yuck.

Benchmarks indicate the extra column events do not have any impact on performance when not used, which is good. I just need to create some additional tests and I think I will be happy.

I am debating doing another major release right after v3 came out. I might just make these 3.1.0 or something.

pbolduc commented 6 years ago

I think due to renaming Error to RecordError breaks semver, so 3.1 maybe not a great for consumers, however, there a few downloaders at present (less than 50 per version). It could be ok in the short term. From a consumer perspective, it would be good to know the maintainer(s) understand the impact of making break changes in minor version releases.

jehugaleahsa commented 6 years ago

With the IsNullable enhancements at the IColumnDefinition level, I think a v4 is justified. I'm still working on some of the naming (What do you call a method that returns a default value when a null is encountered for a non-nullable column?) but code-wise I think it is complete. I'll start taking a look at your PRs and slate a release once everything's in place.

pbolduc commented 6 years ago

A method that returns a default value when a null is encountered for a non-nullable column:

in Linq these methods are SingleOrDefault(), FirstOrDefault(), etc If these are public API points, this naming convention should feel natural for users.

The C# guide may provide some ideas default value expressions (C# programming guide)

jehugaleahsa commented 6 years ago

I guess I am thinking of it more like a DEFAULT constraint in a database:

CREATE TABLE Person
(
    -- ...
    CreatedOn DATETIME NULL DEFAULT(GETUTCDATE())
)

Then I just call the method to get the substitute GetDefaultValue or something like that. I have all the code working and am stuck coming up with a name. I'll figure it out.

pbolduc commented 6 years ago

A method that returns the default value from a column definition? i.e., IColumnDefinition.GetDefaultValue() ? Is there a way to set that on a per instance of the column definition? Different columns, I could see I would want to have different default values based on the column. the base default would be default(T).

    new StringColumn("FirstName") // default is string.Empty
    new StringColumn("Occupation", DefaultValue = "not specified");

Property or function for the default value?