jehugaleahsa / FlatFiles

Reads and writes CSV, fixed-length and other flat file formats with a focus on schema definition, configuration and speed.
The Unlicense
357 stars 64 forks source link

Support adding custom calculated or static contextual columns in the schema definition #24

Closed misternichols closed 6 years ago

misternichols commented 6 years ago

Firstly, what a great library, well done!

Context

I'm presently using FlatFiles in the reading side of an ETL solution and consuming the data by IDataReader. I'm manually defining the columns in the Schema.

There are times when I need to add additional contextual information as extra columns into the output rows for later processing. (Examples: program job number, parent folder name of the text file, record number.) This information is not available within the text file data itself but is known to FlatFiles or the program orchestrating the processing. Since I'm using IDataReader there is no intermediary object or class that I can define additional context or properties on.

As far as I can tell there isn't a clean way to add additional custom static or calculated columns and include them in the schema?

Possible Implementation

In my scenario it would be great to be able to define and add extra custom columns to the Schema that require an evaluation function (a bit like IColumnDefinition.Preprocessor.) The evaluation function would allow the program to calculate and/or inject contextual data as a column in the output rows during text file reading. In a way this is the opposite of IgnoredColumn.

My scenario's don't require this but for additional utility, the custom column values could be evaluated after all other normal column values are parsed with the results being passed through to the evaluation functions. Possible evaluation function definition: Func<object[], int, object> where the string array is the parsed data (with nulls for custom columns) and the int is the current record count. The return could be an object that doesn't require further parsing and should be of type IColumnDefinition.ColumnType. If this is too complicated Func<string, int, object> would suffice, where string is the raw record string.

Would this best to be a new IColumnDefinition class implementation similar to IgnoredColumn or something else. How should the writing side of things be handled? Most likely the best thing to do would be completely exclude them from the output.

Possible Workaround

In theory I could decorate the FlatFile's IDataReader with another IDataReader that injects the extra custom columns as necessary. This quickly gets messy though. Any cleaner ideas?

I hope I've made sense. In your eyes is this something that would add value to the library?

jehugaleahsa commented 6 years ago

Opposite of Ignored, indeed. I do see the value in the ability to produce additional, computed values in the result set. Originally I was thinking you could create your own derived ColumnDefinition<T> class, but that serves to parse/format custom types, aka, it's associated to a column in your file.

Are you using type mappers or just readers? With classes, you could always implement unmapped properties to perform calculations.

At the reader level, I'd ideally parse all the columns in a row and then provide you the opportunity to generate additional column values based on what was parsed. Such columns would only be useful when reading files, completely meaningless when writing files. Such computed values could be passed along to the type mapper layer and treated like any other column, oblivious to their source.

The index I passed to such a column could be the physical row in the file or the row ignoring skipped rows. Something else to think about...

misternichols commented 6 years ago

My scenario requires the schema to be completely configurable so classes are not the right tool for the job. I don't know the file format or output requirements at compile time. My consumer objects also expect an IDataReader as input. These are the reasons why I headed down the readers route.

Agreed with writing files, the column values are derived after all.

I did look into implementing this. There are a few roadblocks, and I thought it would best to reach out for insight. That and I'm cautious of over-complicating such an intuitive library with features that don't fit. One of the main road-blocks is here: https://github.com/jehugaleahsa/FlatFiles/blob/071743d3bfc2b28777fc47c6d962460225b9f5b1/src/FlatFiles/SeparatedValueReader.cs#L161-L164 The readers would need to be modified to be aware that not all columns are sourced from the text file. As far as I can tell there is also presently a restriction that the output column ordinals exactly match the input column ordinals which is perfectly fine without these new requirements.

I think the separate concerns of input column schemas and output column schemas is the root of the problem.

The resulting FlatFiles IDataReader would still need to expose the output column schema which contains all the columns including the custom columns in the correct order, etc.

The more I think about it, implementing an IDataReader decorator might actually make sense. The decorator would need to handle mapping the input columns from the decorated FlatFiles IDataReader to the appropriate output columns as well as calling the evaluation function on each of the custom computed columns on demand.

jehugaleahsa commented 6 years ago

I am going to consolidate this issue with #29. I should be providing a solution to map metadata in and out shortly.

jehugaleahsa commented 6 years ago

Sorry for the long delay, btw. I think I just needed time to digest what all needed to be done.