mrpowers-io / jodie

Delta lake and filesystem helper methods
MIT License
49 stars 11 forks source link

Duplication allowed in appendWithoutDuplicates when it comes in the input dataframe #47

Closed brayanjuls closed 1 year ago

brayanjuls commented 1 year ago

Duplication is allowed when the duplication happens in the dataframe and is not in the table. I.E

Let's say we have the following table:

+----+---------+---------+
|  id|firstname| lastname|
+----+---------+---------+
|   1|   Benito|  Jackson|
|   4|    Maria|     Pitt|
|   6|  Rosalia|     Pitt|
+----+---------+---------+

And we want to insert this new dataframe:

+----+---------+---------+
|  id|firstname| lastname|
+----+---------+---------+
|   3|     Jose| Travolta|
|   8|     Jose| Travolta|
+----+---------+---------+

Calling the function with the following parameters will not avoid duplication in the table:

DeltaHelpers.appendWithoutDuplicates(deltaTable = deltaTable,appendData = newDataDF, primaryKeysColumns = Seq("firstname","lastname"))

The resulting table will be:

+----+---------+---------+
|  id|firstname| lastname|
+----+---------+---------+
|   1|   Benito|  Jackson|
|   4|    Maria|     Pitt|
|   6|  Rosalia|     Pitt|
|   3|     Jose| Travolta|
|   8|     Jose| Travolta|
+----+---------+---------+

We should also deduplicate the dataframe before trying to append the new data.

ilyasse05 commented 1 year ago

I think it will be interesting to add optionnal parameter ["PathRejects"], to write deduplicated rows, if we need to do some analyse of DataQuality when we have DuplicatedRow from source.

And also return count of rows inserted, Updates, rejected.

brayanjuls commented 1 year ago

@ilyasse05 - That seems to be a good feature to me, please open a new issue to brainstorm there how we can implement this.