globalbioticinteractions / nomer

maps identifiers and names to other identifiers and names
GNU General Public License v3.0
19 stars 3 forks source link

fuzzy matching? #68

Closed joelnitta closed 2 years ago

joelnitta commented 2 years ago

Sorry if this is documented somewhere and I'm missing it, but does nomer do fuzzy matching (like gnames)?

(for context, I was directed to nomer from this issue on gnames by @dimus)

jhpoelen commented 2 years ago

@joelnitta Thanks for opening this issue on fuzzy matching features in Nomer.

Currently, Nomer support fuzzy matching to a web-api integration with globalnames:

$ echo -e "\tHomo sapients" | nomer append globalnames
[main] INFO org.globalbioticinteractions.nomer.match.TermMatcherRegistry - using matcher [globi-globalnames]
    Homo sapients   SIMILAR_TO  ITIS:180092 Homo sapiens    Species     Animalia | Bilateria | Deuterostomia | Chordata | Vertebrata | Gnathostomata | Tetrapoda | Mammalia | Theria | Eutheria | Primates | Haplorrhini | Simiiformes | Hominoidea | Hominidae | Homininae | Homo | Homo sapiens   ITIS:202423 | ITIS:914154 | ITIS:914156 | ITIS:158852 | ITIS:331030 | ITIS:914179 | ITIS:914181 | ITIS:179913 | ITIS:179916 | ITIS:179925 | ITIS:180089 | ITIS:943773 | ITIS:943778 | ITIS:943782 | ITIS:180090 | ITIS:943805 | ITIS:180091 | ITIS:180092   Kingdom | Subkingdom | Infrakingdom | Phylum | Subphylum | Infraphylum | Superclass | Class | Subclass | Infraclass | Order | Suborder | Infraorder | Superfamily | Family | Subfamily | Genus | Species    http://www.itis.gov/servlet/SingleRpt/SingleRpt?search_topic=TSN&search_value=180092    
    Homo sapients   SIMILAR_TO  NCBI:9606   Homo sapiens    species     | Eukaryota | Opisthokonta | Metazoa | Eumetazoa | Bilateria | Deuterostomia | Chordata | Craniata | Vertebrata | Gnathostomata | Teleostomi | Euteleostomi | Sarcopterygii | Dipnotetrapodomorpha | Tetrapoda | Amniota | Mammalia | Theria | Eutheria | Boreoeutheria | Euarchontoglires | Primates | Haplorrhini | Simiiformes | Catarrhini | Hominoidea | Hominidae | Homininae | Homo | Homo sapiens   NCBI:131567 | NCBI:2759 | NCBI:33154 | NCBI:33208 | NCBI:6072 | NCBI:33213 | NCBI:33511 | NCBI:7711 | NCBI:89593 | NCBI:7742 | NCBI:7776 | NCBI:117570 | NCBI:117571 | NCBI:8287 | NCBI:1338369 | NCBI:32523 | NCBI:32524 | NCBI:40674 | NCBI:32525 | NCBI:9347 | NCBI:1437010 | NCBI:314146 | NCBI:9443 | NCBI:376913 | NCBI:314293 | NCBI:9526 | NCBI:314295 | NCBI:9604 | NCBI:207598 | NCBI:9605 | NCBI:9606    | superkingdom | clade | kingdom | clade | clade | clade | phylum | subphylum | clade | clade | clade | clade | superclass | clade | clade | clade | class | clade | clade | clade | superorder | order | suborder | infraorder | parvorder | superfamily | family | subfamily | genus | specieshttps://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?id=9606    
    Homo sapients   SIMILAR_TO  IRMNG:10857762  Homo sapiens    species     Animalia | Chordata | Mammalia | Primates | Hominidae | Homo | Homo sapiens IRMNG:11 | IRMNG:148 | IRMNG:1310 | IRMNG:11338 | IRMNG:104701 | IRMNG:1035772 | IRMNG:10857762 kingdom | phylum | class | order | family | genus | species https://www.irmng.org/aphia.php?p=taxdetails&id=10857762    
    Homo sapients   SIMILAR_TO  WORMS:1455977   Homo sapiens    Species     Biota | Animalia | Chordata | Vertebrata | Gnathostomata | Tetrapoda | Mammalia | Primates | Hominidae | Homo | Homo sapiens    WORMS:1 | WORMS:2 | WORMS:1821 | WORMS:146419 | WORMS:1828 | WORMS:1831 | WORMS:1837 | WORMS:1455974 | WORMS:1455975 | WORMS:1455976 | WORMS:1455977    | Kingdom | Phylum | Subphylum | Infraphylum | Superclass | Class | Order | Family | Genus | Species    https://www.marinespecies.org/aphia.php?p=taxdetails&id=1455977 
    Homo sapients   SIMILAR_TO  GBIF:2436436    Homo sapiens    species     Animalia | Chordata | Mammalia | Primates | Hominidae | Homo | Homo sapiens GBIF:1 | GBIF:44 | GBIF:359 | GBIF:798 | GBIF:5483 | GBIF:2436435 | GBIF:2436436    kingdom | phylum | class | order | family | genus | species http://www.gbif.org/species/2436436 
    Homo sapients   SIMILAR_TO  OTT:770315  Homo sapiens    species     |  | Eukaryota | Opisthokonta | Holozoa | Metazoa | Eumetazoa | Bilateria | Deuterostomia | Chordata | Craniata | Vertebrata | Gnathostomata | Teleostomi | Euteleostomi | Sarcopterygii | Dipnotetrapodomorpha | Tetrapoda | Amniota | Mammalia | Theria | Eutheria | Boreoeutheria | Euarchontoglires | Primates | Haplorrhini | Simiiformes | Catarrhini | Hominoidea | Hominidae | Homininae | Homo | Homo sapiens  OTT:805080 | OTT:93302 | OTT:304358 | OTT:332573 | OTT:5246131 | OTT:691846 | OTT:641038 | OTT:117569 | OTT:147604 | OTT:125642 | OTT:947318 | OTT:801601 | OTT:278114 | OTT:114656 | OTT:114654 | OTT:458402 | OTT:4940726 | OTT:229562 | OTT:229560 | OTT:244265 | OTT:229558 | OTT:683263 | OTT:5334778 | OTT:392222 | OTT:913935 | OTT:702152 | OTT:386195 | OTT:842867 | OTT:386191 | OTT:770311 | OTT:312031 | OTT:770309 | OTT:770315    no rank | no rank | domain | no rank | no rank | kingdom | no rank | no rank | no rank | phylum | subphylum | subphylum | superclass | no rank | no rank | superclass | no rank | superclass | no rank | class | subclass | no rank | no rank | superorder | order | suborder | infraorder | parvorder | superfamily | family | subfamily | genus | specieshttps://tree.opentreeoflife.org/opentree/ottol@770315    

And, as you probably suspect, the Nomer's globalnames matcher relies on the globalnames resolver's web api.

Offline enabled matchers like itis, ncbi etc. currently do exact matches only.

And, I've heard many others ask for fuzzy matching requests.

How do you imagine using a fuzzy matching functionality? Which fuzzy matching algorithms would you imagine using? How would you want to quantify the relation between the fuzzy match and the provided terms?

Curious to hear your thoughts! And, a detailed example of what you had in mind helps me better understand your desires.

joelnitta commented 2 years ago

Thanks for the quick reply, I'm glad you're interested in this topic.

How do you imagine using a fuzzy matching functionality?

(see example below)

Which fuzzy matching algorithms would you imagine using?

Some thing that is designed to work with scientific names, like gnames. It would be great if it could account for the typically lower variation in species name (Genus + specific epithet) as opposed to much greater variation in taxonomic author name. For example, if the accepted name is Foogenus barspecies (Foo B.) Barbar, typical variations might include

How would you want to quantify the relation between the fuzzy match and the provided terms?

Some sort of string distance metric, eg, methods here? I haven't thought about that too much, TBH.

Detailed example

I want to be able to query a set of names against a custom, local (cached) reference taxonomy that adheres to Darwin Core standards. I want to be able to match the query fuzzily, especially because of variation in taxonomic author names.

I am developing an R package to do this (taxastand), but rather than (re)write all of the fuzzy matching code in R, I am calling outside programs because others have already figured out how to do that. Currently, I am using taxon-tools, but I am looking for other options as well (hence looking into gnames).

Pseudocode

match --input "/tGonocormus minutum Bosch" --ref "ref_taxonomy.csv"
query matched_name matched_status match_type
Gonocormus minutum Gonocormus minutus (Bl.) Bosch synonym fuzzy

I hope that clarifies things somewhat!

jhpoelen commented 2 years ago

Thanks for the detailed example.

I associate fuzzy matching with associating typos with known names. . . and I can see how taxonomic name parsing can be included in that.

Nomer incorporates an older version of @dimus gnames parser . This is used in the globi-correct matcher.

E.g.,

 echo -e "\tFoogenus barspecies Barbar, 1974" | nomer append globi-correct
...
    Foogenus barspecies Barbar, 1974    SAME_AS     Foogenus barspecies             

and

echo -e "\tFoogenus barspecis (Foo B.) Barbar" | nomer append globi-correct
...
    Foogenus barspecis (Foo B.) Barbar  SAME_AS     Foogenus barspecis              

and

$ echo -e '\tFoogenus bawhathetheckhappened (Foo B.) Barbar' | nomer append globi-correct
    Foogenus bawhathetheckhappened (Foo B.) Barbar  SAME_AS     Foogenus bawhathetheckhappened      

after this attempt to extract the canonical name from some string, comparing becomes a little easier.

Note, however, that the relation "SAME_AS" should really be "HAS_CANONICAL_NAME" or similar. Also, note that there no similarity metric based on overlapping elements in the name (same author, same genus, different epithet).

I wondering how you'd imagine extending the functionality to include the comparisons you hint to .

joelnitta commented 2 years ago

It looks like globi-correct is parsing the name (using gnparser, probably?) and returning the canonical form. As you say, that would be involved in fuzzy matching that takes into account scientific name parts, but parsing alone doesn’t solve what I’m trying to do.

Another way to say it is that I’m happy with the fuzzy matching capabilities of gnames. What I want to do is decouple the fuzzy matching from the database selection. Currently in gnames you are limited to choosing from existing online databases. I want to be able to provide a local, custom database (and assume the easiest way to do so is if it is DWC format). Sorry if that wasn’t clear from the example.

dimus commented 2 years ago

From my trial and error, fuzzy matching stops being useful at some point, generating too many false positives.

So my current approach to fuzzy matching is:

  1. No fuzzy matching for uninomial alone
  2. No fuzzy matching for words less than 5 characters
  3. Allow edit distance 1 on a stemmed part of a name.

in general I use stemmed versions for all matching, and calculate edit distance after a fuzzy match is done.

Fuzzy matching is useful, because errors in names happen quite often during OCR or data input.

dimus commented 2 years ago

Nomer incorporates an older version of @dimus gnames parser .

Go version of gnparser moved forward quite a bit compare to old Scalla version.\

@jhpoelen, would you be adventurous enough to use clib bidnings of Go parser to make binding for Java via ffi? and release Java version of it like there is Ruby and Java Script versions?

joelnitta commented 2 years ago

I agree with @dimus that fuzzy matching will almost always end up with some amount of false positives, if the query set is large enough. Whatever cutoff you set will not work for everything. So it is important to check the fuzzily matched results.

To expand a bit more about why fuzzy matching is useful, here are some examples "in the wild". As I alluded to above, I think fuzzy matching is most important for matching author names, which can have a lot of variation. Author names are needed because the same Genus + specific epithet may refer to different entities according to different authors. For example,

Trichomanes bifidum:

So we need author names to correctly resolve taxa.

One example of author names having a lot variation, even in "standard" taxonomic databases. For example the same name appears as:

So we need fuzzy matching (and/or some sort of scientific name-aware parsing) to match these.

If this is all outside the scope of nomer, that's fine. It wasn't entirely clear to me if fuzzy matching is something nomer is trying to do.

dimus commented 2 years ago

@joelnitta I wonder if https://apidoc.globalnames.org/gnmatcher might help you with matching 'local' datasets. See also https://github.com/gnames/gnmatcher/issues/39

dimus commented 2 years ago

@joelnitta, I would say authors are beyond the scope of fuzzy matching. Variety in authors is such, that fuzzy matching would create huge amount of false positives if we try to tackle in on this level. I use modifications of algorithm developed by @pleary for uBio to score results by authors similarity and sort resuls by that score.

Also freshly developed faceted search https://apidoc.globalnames.org/gnames-beta might help in some hard situations with authors.

joelnitta commented 2 years ago

@dimus gnames/gnmatcher#39 is essentially the same thing I want to do. Thanks for linking that issue. Sounds like it is a distant possibility, but nothing I should expect soon. I had a look at https://apidoc.globalnames.org/gnmatcher and https://apidoc.globalnames.org/gnames-beta but as they both apparently require use of online reference databases, I don't think they will help achieve what I'm trying to do.

RE: fuzzy matching of author names. I think the approach taken by taxon-tools, which applies "taxonomic logic" besides just pure string distances, can help get at this. So for the example I gave above, if Cephalomanes auriculatum (Blume) Bosch is the query and Cephalmonaes auriculatum Bosch is the reference, it would match by auto_basio+ (identical after removing basionym author from query).

dimus commented 2 years ago

@joelnitta, may be @camwebb taxon-tools would cover your and @abubelinha usecase?

Another possible solution is to extract matching algorithms of gnmatcher and gnames into some separate offline project called gndiff for example :)

Something like gndiff names1.txt names2.txt > diff.txt with an optional online name verification

joelnitta commented 2 years ago

may be @camwebb taxon-tools would cover your and @abubelinha usecase?

Yes, that's why I'm using it, as I mentioned in gnames issue 22 :)

The reason I'm interested in gnames for this is that taxon-tools is designed based on the rules of botanical nomenclature, so it may not work as well as a general tool. Also, I think gnames might be faster (but have not conducted any benchmarks). I would like to be able to offer the matching capabilities of either gnames or taxon-tools in my taxastand R package.

That said, the gndiff idea you propose sounds great. I don't understand enough about how gnames works, but if the fuzzy matching algorithm could be easily split off into an offline CLI , that would be perfect.

(@jhpoelen apologies if this discussion is drifting completely out of the scope of nomer... I hope it's useful for you though!)

dimus commented 2 years ago

as the conversation is mostly abot gnames #22 I'll continue this part of discussion there.

jhpoelen commented 2 years ago

duplicate of related #78

jhpoelen commented 2 years ago

@joelnitta thanks for being patient as we are figuring this out. @zedomel also requested a similar feature in #78 .