idigbio-api-hackathon / dedup

Specimen dedup code
MIT License
0 stars 0 forks source link

genomic matching #5

Closed Bouteloua closed 8 years ago

Bouteloua commented 9 years ago

Soundex is a phonetic algorithm -> http://en.wikipedia.org/wiki/Soundex I used Key collision metaphone3 method, which is a way to transform tokens into the way they are pronounced. Example: Parque nacional de gama Parque nacional do iguacu Parque nacional do itatiaia Parque Nacional de Itatiaia

Kolmogorov complexity -> http://en.wikipedia.org/wiki/Kolmogorov_complexity to estimate 'similarity' between strings and has been widely applied to the comparison of strings originating from DNA sequencing. Example: Podocarpus National Park, Cajanuma at Casa de Pedesur Podocarpus National Park, Cajanuma, at Casa de predesur

levenshtein -> http://en.wikipedia.org/wiki/Levenshtein_distance Nearest neighbor, distance function Example: Salto Iguazu Salto tguazu