Open zedomel opened 2 years ago
@zedomel thanks for sharing your use case and examples.
@dimus given your experience in matching names, I was wondering - do you think any tips on how to implement fast offline fuzzy matching systems for taxonomic names. Would Lucene do the job? Or would you rather go for something like Postgres, or other ? Are there already existing tools that index name relations and provide a "fuzzy" view on them?
and I am sure that @deepreef @jar398 and others might have some additional ideas for fast, offline fuzzy taxonomic name matching.
I would DEFINITELY defer to @dimus on this! As far as I'm concerned, he's the world authority on taxonomic name fuzzy matching. But I agree -- having a suite of tools to use locally (offline) would be extremely valuable in some situations.
@jhpoelen I do use Lucene for fuzzy matching in https://resolver.globalnames.org
However these days I prefer using prefix trees automata directly https://en.wikipedia.org/wiki/Trie. The algorithm is the same as Lucene uses (at least from the last time I checked). The algortithm is simple enough to implement it by hand, but there are also enough libraries around.
Related code in gnmatcher (matching library I use for https://verifier.globalnames.org ) is https://github.com/gnames/gnmatcher/tree/master/io/trie
GNmatcher has API: https://apidoc.globalnames.org/gnmatcher
I use stemmed version of names, because sometimes suffixes of specific epithets are not stable, mostly because people make a mistake with gender of the epithet and others do fix it later. The stemmed algorithm is in gnparser: https://github.com/gnames/gnparser/blob/master/ent/stemmer/stemmer.go
@djpmapleferryman a while ago analyzed fuzzy matching when he worked on https://bdj.pensoft.net/articles.php?id=8080
He found that edit distance more than 1 creates more trouble than it is worth (a lot of false positives), and, conveniently, tries are much slower with edit distance 2 than with 1. I do allow edit distances bigger, if the difference comes from suffixes.
I do not fuzzy match uninomials, because it also generates a lot of false positives
Like @deepreef I too recommend deferring to @dimus :) but Tony and Markus are both pretty approachable if you have questions for them. I agree about stemming; I've been using the stemming feature of gnparse with good results.
@deepreef and @jar398 I am now all warm and fuzzy :)
Hi @jhpoelen
As we talked the ideia is to implement a fuzzy match in nomer.
Here an example from GlobalNames
echo -e "\tHomo sapens" | nomer append globalnames
Output:
While for offline line matchers we got as result:
echo -e "\tHomo sapens" | nomer append gbif-web
Note: GBIF matcher can be easily incremented to use the GBIF API
match endpoint
which implements a fuzzy match. Maybe it can be an additional matchergbif-web-fuzzy
. However, it is still an online matcher :-(A reference which we might use: Taxamatch, an Algorithm for Near (‘Fuzzy’) Matching of Scientific Names in Taxonomic Databases
Some implementations of fuzzy matchers:
Let me know how you are planing to do that and I can help with the implementation.
thks.