globalbioticinteractions / nomer

maps identifiers and names to other identifiers and names
GNU General Public License v3.0
18 stars 3 forks source link

Implementation of fuzzy name match #78

Open zedomel opened 2 years ago

zedomel commented 2 years ago

Hi @jhpoelen

As we talked the ideia is to implement a fuzzy match in nomer.

Here an example from GlobalNames echo -e "\tHomo sapens" | nomer append globalnames

Output:

Homo sapens SIMILAR_TO  ITIS:180092 Homo sapiens    Species     Animalia | Bilateria | Deuterostomia | Chordata | Vertebrata | Gnathostomata | Tetrapoda | Mammalia | Theria | Eutheria | Primates | Haplorrhini | Simiiformes | Hominoidea | Hominidae | Homininae | Homo | Homo sapiens   ITIS:202423 | ITIS:914154 | ITIS:914156 | ITIS:158852 | ITIS:331030 | ITIS:914179 | ITIS:914181 | ITIS:179913 | ITIS:179916 | ITIS:179925 | ITIS:180089 | ITIS:943773 | ITIS:943778 | ITIS:943782 | ITIS:180090 | ITIS:943805 | ITIS:180091 | ITIS:180092   Kingdom | Subkingdom | Infrakingdom | Phylum | Subphylum | Infraphylum | Superclass | Class | Subclass | Infraclass | Order | Suborder | Infraorder | Superfamily | Family | Subfamily | Genus | Species    http://www.itis.gov/servlet/SingleRpt/SingleRpt?search_topic=TSN&search_value=180092    
    Homo sapens SIMILAR_TO  NCBI:9606   Homo sapiens    species     | Eukaryota | Opisthokonta | Metazoa | Eumetazoa | Bilateria | Deuterostomia | Chordata | Craniata | Vertebrata | Gnathostomata | Teleostomi | Euteleostomi | Sarcopterygii | Dipnotetrapodomorpha | Tetrapoda | Amniota | Mammalia | Theria | Eutheria | Boreoeutheria | Euarchontoglires | Primates | Haplorrhini | Simiiformes | Catarrhini | Hominoidea | Hominidae | Homininae | Homo | Homo sapiens   NCBI:131567 | NCBI:2759 | NCBI:33154 | NCBI:33208 | NCBI:6072 | NCBI:33213 | NCBI:33511 | NCBI:7711 | NCBI:89593 | NCBI:7742 | NCBI:7776 | NCBI:117570 | NCBI:117571 | NCBI:8287 | NCBI:1338369 | NCBI:32523 | NCBI:32524 | NCBI:40674 | NCBI:32525 | NCBI:9347 | NCBI:1437010 | NCBI:314146 | NCBI:9443 | NCBI:376913 | NCBI:314293 | NCBI:9526 | NCBI:314295 | NCBI:9604 | NCBI:207598 | NCBI:9605 | NCBI:9606    | superkingdom | clade | kingdom | clade | clade | clade | phylum | subphylum | clade | clade | clade | clade | superclass | clade | clade | clade | class | clade | clade | clade | superorder | order | suborder | infraorder | parvorder | superfamily | family | subfamily | genus | species    https://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?id=9606

While for offline line matchers we got as result: echo -e "\tHomo sapens" | nomer append gbif-web

Homo sapens NONE        Homo sapens

Note: GBIF matcher can be easily incremented to use the GBIF API match endpoint which implements a fuzzy match. Maybe it can be an additional matcher gbif-web-fuzzy. However, it is still an online matcher :-(

A reference which we might use: Taxamatch, an Algorithm for Near (‘Fuzzy’) Matching of Scientific Names in Taxonomic Databases

Some implementations of fuzzy matchers:

Let me know how you are planing to do that and I can help with the implementation.

thks.

jhpoelen commented 2 years ago

@zedomel thanks for sharing your use case and examples.

@dimus given your experience in matching names, I was wondering - do you think any tips on how to implement fast offline fuzzy matching systems for taxonomic names. Would Lucene do the job? Or would you rather go for something like Postgres, or other ? Are there already existing tools that index name relations and provide a "fuzzy" view on them?

jhpoelen commented 2 years ago

and I am sure that @deepreef @jar398 and others might have some additional ideas for fast, offline fuzzy taxonomic name matching.

deepreef commented 2 years ago

I would DEFINITELY defer to @dimus on this! As far as I'm concerned, he's the world authority on taxonomic name fuzzy matching. But I agree -- having a suite of tools to use locally (offline) would be extremely valuable in some situations.

dimus commented 2 years ago

@jhpoelen I do use Lucene for fuzzy matching in https://resolver.globalnames.org

However these days I prefer using prefix trees automata directly https://en.wikipedia.org/wiki/Trie. The algorithm is the same as Lucene uses (at least from the last time I checked). The algortithm is simple enough to implement it by hand, but there are also enough libraries around.

Related code in gnmatcher (matching library I use for https://verifier.globalnames.org ) is https://github.com/gnames/gnmatcher/tree/master/io/trie

GNmatcher has API: https://apidoc.globalnames.org/gnmatcher

dimus commented 2 years ago

I use stemmed version of names, because sometimes suffixes of specific epithets are not stable, mostly because people make a mistake with gender of the epithet and others do fix it later. The stemmed algorithm is in gnparser: https://github.com/gnames/gnparser/blob/master/ent/stemmer/stemmer.go

dimus commented 2 years ago

@djpmapleferryman a while ago analyzed fuzzy matching when he worked on https://bdj.pensoft.net/articles.php?id=8080

He found that edit distance more than 1 creates more trouble than it is worth (a lot of false positives), and, conveniently, tries are much slower with edit distance 2 than with 1. I do allow edit distances bigger, if the difference comes from suffixes.

dimus commented 2 years ago

I do not fuzzy match uninomials, because it also generates a lot of false positives

jar398 commented 2 years ago

Like @deepreef I too recommend deferring to @dimus :) but Tony and Markus are both pretty approachable if you have questions for them. I agree about stemming; I've been using the stemming feature of gnparse with good results.

dimus commented 2 years ago

@deepreef and @jar398 I am now all warm and fuzzy :)