mrpowers-io / jodie

Delta lake and filesystem helper methods
MIT License
49 stars 11 forks source link

Possible deduplication solution that doesn't require a primary key #62

Open MrPowers opened 1 year ago

MrPowers commented 1 year ago
val duplicates = df
  .select(<pk cols>)
  .withColumn("__file_path", col("_metadata.file_path"))
  .withColumn("__row_index", col("_metadata.row_index"))
  .withColumn(
    "rank", 
    row_number().over(
      Window()
        .partitionBy(<pk cols>)
        .orderBy(<pk cols>)))
  .filter("rank > 1")
  .drop("rank")

And then:

df.alias("old")
  .merge(
    duplicates.alias("new"),
    "old.<pk1> = new.<pk1> AND ... AND old.<pkn> = new.<pkn>" +
      " AND old._metadata.file_path = new.__file_path" +
      " AND old._metadata.row_index = new.__row_index")
  .whenMatchedDelete()
  .execute()
brayanjuls commented 1 year ago

@MrPowers - When you say it does not require a primary key, do you mean that we can infer the primary key ?