mrpowers-io / jodie

Delta lake and filesystem helper methods
MIT License
49 stars 11 forks source link

Add optionnal param "PathRejects" in appendWithoutDuplicates for ignored rows #49

Open ilyasse05 opened 1 year ago

ilyasse05 commented 1 year ago

I think it will be interesting to add optionnal parameter ["PathRejects"], to write deduplicated rows, if we need to do some analyse of DataQuality when we have DuplicatedRow from source.

And also return count of rows inserted, Updates, rejected.

Originally posted by @ilyasse05 in https://github.com/MrPowers/jodie/issues/47#issuecomment-1430876570

brayanjuls commented 1 year ago

@ilyasse05 - Thanks for opening the issue, would you like to contribute?

I think initially we could persist only the rows that were discarded, the rejection path could be a delta table with generic columns, something like this,

columns:

@MrPowers - Thoughts on this?

ilyasse05 commented 1 year ago

@brayanjuls why not but i am not expert in scala langage.

For columns, i think for performance, we have to keep the same columns from table source + technical column, it will be helpful if we need to recycle discarded rows and also add partition column like the target delta table.

brayanjuls commented 1 year ago

@ilyasse05 - Sorry for taking too long to response. My inicial proposal was to keep the columns of the source table as a Struct and have a single dead letter table for all the tables, but rethinking it again It would be hard to mantain multiple schemas in a single table that will be intended for analysis and reprocessing, so we could have one "dead letter table" per target table.

Example, if the target table have the following schema

root
 |-- pkey: integer (nullable = true)
 |-- attr1: string (nullable = true)
 |-- attr2: string (nullable = true)
 |-- is_current: boolean (nullable = true)
 |-- effective_time: timestamp (nullable = true)
 |-- end_time: timestamp (nullable = true)

Then the dead letter table will have the following schema(the prefix dl in the columns stands for dead letter),

root
 |-- pkey: integer (nullable = true)
 |-- attr1: string (nullable = true)
 |-- attr2: string (nullable = true)
 |-- is_current: boolean (nullable = true)
 |-- effective_time: timestamp (nullable = true)
 |-- end_time: timestamp (nullable = true)
 |-- dl_function_name: string (nullable = false)
 |-- dl_origin_table: string (nullable = false)
 |-- dl_timestamp: timestamp (nullable = false)
ilyasse05 commented 1 year ago

@brayanjuls yes that exactly what we need to do, for the path we create the same table, with the same columns + technical columns like you suggested.

brayanjuls commented 1 year ago

@ilyasse05 - do you want to send a pull request ?