Open jeroenGF opened 6 years ago
Possibly, add extra key to indicate that article X is a duplicate
Specific usecase: something is scraped twice (by accident) and consequently added twice to the database. This needs to be corrected. However, this is only true if both articles actually have the same (or a similar???) timestamp.
Probably better approach: see https://qbox.io/blog/minimizing-document-duplication-in-elasticsearch
Let elasticsearch determine what the duplicates are, return them, maybe save to JSON. Have inca/Python-fuction to do sth with this (either deleting or flagging all except one)
It seems that there is already a tool available that does this: https://github.com/deric/es-dedupe
If running on a text field, fielddata needs to be enabled (https://www.elastic.co/guide/en/elasticsearch/reference/current/fielddata.html) which seems to be not very advisable in general terms.
curl -X PUT "localhost:9200/my_index/_mapping/_doc" -H 'Content-Type: application/json' -d'
{
"properties": {
"my_field": {
"type": "text",
"fielddata": true
}
}
}
'
Anyhow, it seems a better solution to checkout the existing dedupe tool then to reinvent the wheel
There is already a function called "myinca.analysis.cosine_similarity", but it only reports similarity, but we need a function which actually deduplicates documents - thus deletes certain documents when the cosine similarity function reports a certain treshhold.