tanussingh / Big-Data-Management-Analytics-Project

Final Project for CS 6350.001 - Large Scale Data Collection and preprocessing in Spark
3 stars 2 forks source link

Make overall structure for deduplication algorithm #10

Open ishansharma opened 5 years ago

ishansharma commented 5 years ago

Issues to figure out:

  1. What to compare on?
  2. Order of comparison? Right now, we plan to look at NER first, then UDPipe and then doc2vec vector similarity with Jacquard/cosine similarity.

https://www.druva.com/blog/understanding-data-deduplication/

ishansharma commented 5 years ago

This may be helpful: https://github.com/rnowling/article-deduplication

ishansharma commented 5 years ago

Here's another one: https://towardsdatascience.com/deduplication-using-sparks-mllib-4a08f65e5ab9