mrpowers-io / levi

Delta Lake helper methods. No Spark dependency.
MIT License
22 stars 8 forks source link

added drop_duplicates and drop_duplicates_pkey + unit tests #27

Closed mrjsj closed 8 months ago

mrjsj commented 8 months ago

Closes #17 Closes #18

The drop_duplicates just takes just takes the first occurence of a duplicate to keep. We should consider if user should be allowed to pass parameter to take first or last occurrence based on some sorting column.

For drop_duplicates_pkey it is kinda the same. However in the mack version it's just the duplicate with the lowest primary key which is kept. Again we could let the user decided whether to keep the duplicate with lowest or highest primary key.