Closed joelnitta closed 2 years ago
@joelnitta Thanks for opening this issue on fuzzy matching features in Nomer.
Currently, Nomer support fuzzy matching to a web-api integration with globalnames:
$ echo -e "\tHomo sapients" | nomer append globalnames
[main] INFO org.globalbioticinteractions.nomer.match.TermMatcherRegistry - using matcher [globi-globalnames]
Homo sapients SIMILAR_TO ITIS:180092 Homo sapiens Species Animalia | Bilateria | Deuterostomia | Chordata | Vertebrata | Gnathostomata | Tetrapoda | Mammalia | Theria | Eutheria | Primates | Haplorrhini | Simiiformes | Hominoidea | Hominidae | Homininae | Homo | Homo sapiens ITIS:202423 | ITIS:914154 | ITIS:914156 | ITIS:158852 | ITIS:331030 | ITIS:914179 | ITIS:914181 | ITIS:179913 | ITIS:179916 | ITIS:179925 | ITIS:180089 | ITIS:943773 | ITIS:943778 | ITIS:943782 | ITIS:180090 | ITIS:943805 | ITIS:180091 | ITIS:180092 Kingdom | Subkingdom | Infrakingdom | Phylum | Subphylum | Infraphylum | Superclass | Class | Subclass | Infraclass | Order | Suborder | Infraorder | Superfamily | Family | Subfamily | Genus | Species http://www.itis.gov/servlet/SingleRpt/SingleRpt?search_topic=TSN&search_value=180092
Homo sapients SIMILAR_TO NCBI:9606 Homo sapiens species | Eukaryota | Opisthokonta | Metazoa | Eumetazoa | Bilateria | Deuterostomia | Chordata | Craniata | Vertebrata | Gnathostomata | Teleostomi | Euteleostomi | Sarcopterygii | Dipnotetrapodomorpha | Tetrapoda | Amniota | Mammalia | Theria | Eutheria | Boreoeutheria | Euarchontoglires | Primates | Haplorrhini | Simiiformes | Catarrhini | Hominoidea | Hominidae | Homininae | Homo | Homo sapiens NCBI:131567 | NCBI:2759 | NCBI:33154 | NCBI:33208 | NCBI:6072 | NCBI:33213 | NCBI:33511 | NCBI:7711 | NCBI:89593 | NCBI:7742 | NCBI:7776 | NCBI:117570 | NCBI:117571 | NCBI:8287 | NCBI:1338369 | NCBI:32523 | NCBI:32524 | NCBI:40674 | NCBI:32525 | NCBI:9347 | NCBI:1437010 | NCBI:314146 | NCBI:9443 | NCBI:376913 | NCBI:314293 | NCBI:9526 | NCBI:314295 | NCBI:9604 | NCBI:207598 | NCBI:9605 | NCBI:9606 | superkingdom | clade | kingdom | clade | clade | clade | phylum | subphylum | clade | clade | clade | clade | superclass | clade | clade | clade | class | clade | clade | clade | superorder | order | suborder | infraorder | parvorder | superfamily | family | subfamily | genus | specieshttps://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?id=9606
Homo sapients SIMILAR_TO IRMNG:10857762 Homo sapiens species Animalia | Chordata | Mammalia | Primates | Hominidae | Homo | Homo sapiens IRMNG:11 | IRMNG:148 | IRMNG:1310 | IRMNG:11338 | IRMNG:104701 | IRMNG:1035772 | IRMNG:10857762 kingdom | phylum | class | order | family | genus | species https://www.irmng.org/aphia.php?p=taxdetails&id=10857762
Homo sapients SIMILAR_TO WORMS:1455977 Homo sapiens Species Biota | Animalia | Chordata | Vertebrata | Gnathostomata | Tetrapoda | Mammalia | Primates | Hominidae | Homo | Homo sapiens WORMS:1 | WORMS:2 | WORMS:1821 | WORMS:146419 | WORMS:1828 | WORMS:1831 | WORMS:1837 | WORMS:1455974 | WORMS:1455975 | WORMS:1455976 | WORMS:1455977 | Kingdom | Phylum | Subphylum | Infraphylum | Superclass | Class | Order | Family | Genus | Species https://www.marinespecies.org/aphia.php?p=taxdetails&id=1455977
Homo sapients SIMILAR_TO GBIF:2436436 Homo sapiens species Animalia | Chordata | Mammalia | Primates | Hominidae | Homo | Homo sapiens GBIF:1 | GBIF:44 | GBIF:359 | GBIF:798 | GBIF:5483 | GBIF:2436435 | GBIF:2436436 kingdom | phylum | class | order | family | genus | species http://www.gbif.org/species/2436436
Homo sapients SIMILAR_TO OTT:770315 Homo sapiens species | | Eukaryota | Opisthokonta | Holozoa | Metazoa | Eumetazoa | Bilateria | Deuterostomia | Chordata | Craniata | Vertebrata | Gnathostomata | Teleostomi | Euteleostomi | Sarcopterygii | Dipnotetrapodomorpha | Tetrapoda | Amniota | Mammalia | Theria | Eutheria | Boreoeutheria | Euarchontoglires | Primates | Haplorrhini | Simiiformes | Catarrhini | Hominoidea | Hominidae | Homininae | Homo | Homo sapiens OTT:805080 | OTT:93302 | OTT:304358 | OTT:332573 | OTT:5246131 | OTT:691846 | OTT:641038 | OTT:117569 | OTT:147604 | OTT:125642 | OTT:947318 | OTT:801601 | OTT:278114 | OTT:114656 | OTT:114654 | OTT:458402 | OTT:4940726 | OTT:229562 | OTT:229560 | OTT:244265 | OTT:229558 | OTT:683263 | OTT:5334778 | OTT:392222 | OTT:913935 | OTT:702152 | OTT:386195 | OTT:842867 | OTT:386191 | OTT:770311 | OTT:312031 | OTT:770309 | OTT:770315 no rank | no rank | domain | no rank | no rank | kingdom | no rank | no rank | no rank | phylum | subphylum | subphylum | superclass | no rank | no rank | superclass | no rank | superclass | no rank | class | subclass | no rank | no rank | superorder | order | suborder | infraorder | parvorder | superfamily | family | subfamily | genus | specieshttps://tree.opentreeoflife.org/opentree/ottol@770315
And, as you probably suspect, the Nomer's globalnames
matcher relies on the globalnames resolver's web api.
Offline enabled matchers like itis
, ncbi
etc. currently do exact matches only.
And, I've heard many others ask for fuzzy matching requests.
How do you imagine using a fuzzy matching functionality? Which fuzzy matching algorithms would you imagine using? How would you want to quantify the relation between the fuzzy match and the provided terms?
Curious to hear your thoughts! And, a detailed example of what you had in mind helps me better understand your desires.
Thanks for the quick reply, I'm glad you're interested in this topic.
How do you imagine using a fuzzy matching functionality?
(see example below)
Which fuzzy matching algorithms would you imagine using?
Some thing that is designed to work with scientific names, like gnames. It would be great if it could account for the typically lower variation in species name (Genus + specific epithet) as opposed to much greater variation in taxonomic author name. For example, if the accepted name is Foogenus barspecies (Foo B.) Barbar
, typical variations might include
Foogenus barspecies Barbar
(large variation in author name in terms of number of characters)Foogenus barspecis (Foo B.) Barbar
(small variation in species name)
But notFoogenus bawhathetheckhappened (Foo B.) Barbar
(large variation in species name, identical author)How would you want to quantify the relation between the fuzzy match and the provided terms?
Some sort of string distance metric, eg, methods here? I haven't thought about that too much, TBH.
I want to be able to query a set of names against a custom, local (cached) reference taxonomy that adheres to Darwin Core standards. I want to be able to match the query fuzzily, especially because of variation in taxonomic author names.
I am developing an R package to do this (taxastand), but rather than (re)write all of the fuzzy matching code in R, I am calling outside programs because others have already figured out how to do that. Currently, I am using taxon-tools, but I am looking for other options as well (hence looking into gnames).
Pseudocode
ref_taxonomy.csv
would be a CSV file of reference taxonomic names following Darwin Core format. e.g., this)match --input "/tGonocormus minutum Bosch" --ref "ref_taxonomy.csv"
query | matched_name | matched_status | match_type |
---|---|---|---|
Gonocormus minutum | Gonocormus minutus (Bl.) Bosch | synonym | fuzzy |
I hope that clarifies things somewhat!
Thanks for the detailed example.
I associate fuzzy matching with associating typos with known names. . . and I can see how taxonomic name parsing can be included in that.
Nomer incorporates an older version of @dimus gnames parser . This is used in the globi-correct
matcher.
E.g.,
echo -e "\tFoogenus barspecies Barbar, 1974" | nomer append globi-correct
...
Foogenus barspecies Barbar, 1974 SAME_AS Foogenus barspecies
and
echo -e "\tFoogenus barspecis (Foo B.) Barbar" | nomer append globi-correct
...
Foogenus barspecis (Foo B.) Barbar SAME_AS Foogenus barspecis
and
$ echo -e '\tFoogenus bawhathetheckhappened (Foo B.) Barbar' | nomer append globi-correct
Foogenus bawhathetheckhappened (Foo B.) Barbar SAME_AS Foogenus bawhathetheckhappened
after this attempt to extract the canonical name from some string, comparing becomes a little easier.
Note, however, that the relation "SAME_AS" should really be "HAS_CANONICAL_NAME" or similar. Also, note that there no similarity metric based on overlapping elements in the name (same author, same genus, different epithet).
I wondering how you'd imagine extending the functionality to include the comparisons you hint to .
It looks like globi-correct is parsing the name (using gnparser, probably?) and returning the canonical form. As you say, that would be involved in fuzzy matching that takes into account scientific name parts, but parsing alone doesn’t solve what I’m trying to do.
Another way to say it is that I’m happy with the fuzzy matching capabilities of gnames. What I want to do is decouple the fuzzy matching from the database selection. Currently in gnames you are limited to choosing from existing online databases. I want to be able to provide a local, custom database (and assume the easiest way to do so is if it is DWC format). Sorry if that wasn’t clear from the example.
From my trial and error, fuzzy matching stops being useful at some point, generating too many false positives.
So my current approach to fuzzy matching is:
in general I use stemmed versions for all matching, and calculate edit distance after a fuzzy match is done.
Fuzzy matching is useful, because errors in names happen quite often during OCR or data input.
Nomer incorporates an older version of @dimus gnames parser .
Go version of gnparser moved forward quite a bit compare to old Scalla version.\
@jhpoelen, would you be adventurous enough to use clib bidnings of Go parser to make binding for Java via ffi? and release Java version of it like there is Ruby and Java Script versions?
I agree with @dimus that fuzzy matching will almost always end up with some amount of false positives, if the query set is large enough. Whatever cutoff you set will not work for everything. So it is important to check the fuzzily matched results.
To expand a bit more about why fuzzy matching is useful, here are some examples "in the wild". As I alluded to above, I think fuzzy matching is most important for matching author names, which can have a lot of variation. Author names are needed because the same Genus + specific epithet may refer to different entities according to different authors. For example,
Trichomanes bifidum
:
Trichomanes bifidum C. Presl
is a synonym for Trichomanes idoneum C. V. MortonTrichomanes bifidum Willd
is a synonym for Trichomanes rigidum SwSo we need author names to correctly resolve taxa.
One example of author names having a lot variation, even in "standard" taxonomic databases. For example the same name appears as:
So we need fuzzy matching (and/or some sort of scientific name-aware parsing) to match these.
If this is all outside the scope of nomer, that's fine. It wasn't entirely clear to me if fuzzy matching is something nomer is trying to do.
@joelnitta I wonder if https://apidoc.globalnames.org/gnmatcher might help you with matching 'local' datasets. See also https://github.com/gnames/gnmatcher/issues/39
@joelnitta, I would say authors are beyond the scope of fuzzy matching. Variety in authors is such, that fuzzy matching would create huge amount of false positives if we try to tackle in on this level. I use modifications of algorithm developed by @pleary for uBio to score results by authors similarity and sort resuls by that score.
Also freshly developed faceted search https://apidoc.globalnames.org/gnames-beta might help in some hard situations with authors.
@dimus gnames/gnmatcher#39 is essentially the same thing I want to do. Thanks for linking that issue. Sounds like it is a distant possibility, but nothing I should expect soon. I had a look at https://apidoc.globalnames.org/gnmatcher and https://apidoc.globalnames.org/gnames-beta but as they both apparently require use of online reference databases, I don't think they will help achieve what I'm trying to do.
RE: fuzzy matching of author names. I think the approach taken by taxon-tools, which applies "taxonomic logic" besides just pure string distances, can help get at this. So for the example I gave above, if Cephalomanes auriculatum (Blume) Bosch
is the query and Cephalmonaes auriculatum Bosch
is the reference, it would match by auto_basio+
(identical after removing basionym author from query).
@joelnitta, may be @camwebb taxon-tools would cover your and @abubelinha usecase?
Another possible solution is to extract matching algorithms of gnmatcher
and gnames
into some separate offline project called gndiff
for example :)
Something like gndiff names1.txt names2.txt > diff.txt
with an optional online name verification
may be @camwebb taxon-tools would cover your and @abubelinha usecase?
Yes, that's why I'm using it, as I mentioned in gnames issue 22 :)
The reason I'm interested in gnames
for this is that taxon-tools
is designed based on the rules of botanical nomenclature, so it may not work as well as a general tool. Also, I think gnames
might be faster (but have not conducted any benchmarks). I would like to be able to offer the matching capabilities of either gnames
or taxon-tools
in my taxastand
R package.
That said, the gndiff
idea you propose sounds great. I don't understand enough about how gnames
works, but if the fuzzy matching algorithm could be easily split off into an offline CLI , that would be perfect.
(@jhpoelen apologies if this discussion is drifting completely out of the scope of nomer... I hope it's useful for you though!)
as the conversation is mostly abot gnames #22 I'll continue this part of discussion there.
duplicate of related #78
@joelnitta thanks for being patient as we are figuring this out. @zedomel also requested a similar feature in #78 .
Sorry if this is documented somewhere and I'm missing it, but does nomer do fuzzy matching (like gnames)?
(for context, I was directed to nomer from this issue on gnames by @dimus)