Deduplication - Githubissues

uvacw / inca

24 stars 6 forks source link

Deduplication #395

Open jeroenGF opened 6 years ago

jeroenGF commented 6 years ago

There is already a function called "myinca.analysis.cosine_similarity", but it only reports similarity, but we need a function which actually deduplicates documents - thus deletes certain documents when the cosine similarity function reports a certain treshhold.

jeroenGF commented 6 years ago

Possibly, add extra key to indicate that article X is a duplicate

jeroenGF commented 6 years ago

Specific usecase: something is scraped twice (by accident) and consequently added twice to the database. This needs to be corrected. However, this is only true if both articles actually have the same (or a similar???) timestamp.

jeroenGF commented 6 years ago

Probably better approach: see https://qbox.io/blog/minimizing-document-duplication-in-elasticsearch

Let elasticsearch determine what the duplicates are, return them, maybe save to JSON. Have inca/Python-fuction to do sth with this (either deleting or flagging all except one)

damian0604 commented 6 years ago

It seems that there is already a tool available that does this: https://github.com/deric/es-dedupe

If running on a text field, fielddata needs to be enabled (https://www.elastic.co/guide/en/elasticsearch/reference/current/fielddata.html) which seems to be not very advisable in general terms.

curl -X PUT "localhost:9200/my_index/_mapping/_doc" -H 'Content-Type: application/json' -d'
{
  "properties": {
    "my_field": { 
      "type":     "text",
      "fielddata": true
    }
  }
}
'

Anyhow, it seems a better solution to checkout the existing dedupe tool then to reinvent the wheel