Open ilyasse05 opened 1 year ago
@ilyasse05 - Thanks for opening the issue, would you like to contribute?
I think initially we could persist only the rows that were discarded, the rejection path could be a delta table with generic columns, something like this,
columns:
@MrPowers - Thoughts on this?
@brayanjuls why not but i am not expert in scala langage.
For columns, i think for performance, we have to keep the same columns from table source + technical column, it will be helpful if we need to recycle discarded rows and also add partition column like the target delta table.
@ilyasse05 - Sorry for taking too long to response. My inicial proposal was to keep the columns of the source table as a Struct and have a single dead letter table for all the tables, but rethinking it again It would be hard to mantain multiple schemas in a single table that will be intended for analysis and reprocessing, so we could have one "dead letter table" per target table.
Example, if the target table have the following schema
root
|-- pkey: integer (nullable = true)
|-- attr1: string (nullable = true)
|-- attr2: string (nullable = true)
|-- is_current: boolean (nullable = true)
|-- effective_time: timestamp (nullable = true)
|-- end_time: timestamp (nullable = true)
Then the dead letter table will have the following schema(the prefix dl in the columns stands for dead letter),
root
|-- pkey: integer (nullable = true)
|-- attr1: string (nullable = true)
|-- attr2: string (nullable = true)
|-- is_current: boolean (nullable = true)
|-- effective_time: timestamp (nullable = true)
|-- end_time: timestamp (nullable = true)
|-- dl_function_name: string (nullable = false)
|-- dl_origin_table: string (nullable = false)
|-- dl_timestamp: timestamp (nullable = false)
@brayanjuls yes that exactly what we need to do, for the path we create the same table, with the same columns + technical columns like you suggested.
@ilyasse05 - do you want to send a pull request ?
I think it will be interesting to add optionnal parameter ["PathRejects"], to write deduplicated rows, if we need to do some analyse of DataQuality when we have DuplicatedRow from source.
And also return count of rows inserted, Updates, rejected.
Originally posted by @ilyasse05 in https://github.com/MrPowers/jodie/issues/47#issuecomment-1430876570