globalbioticinteractions / nomer

maps identifiers and names to other identifiers and names
GNU General Public License v3.0
18 stars 3 forks source link

Support for misspellings in Nomer? #105

Open jtmiller28 opened 2 years ago

jtmiller28 commented 2 years ago

I was wondering if there is any support for misspellings in Nomer, some of the resolution services I have been working with allow for a max distance field/fuzzy name matching field that lets the user to specify a number of misspellings that are OK to round a taxonomic name to. Same is true for author names in with those services.

For example: echo -e "\tEchinocereus engelmannii subsp. engelmannii" | nomer append wfo returns: Echinocereus engelmannii subsp. engelmannii HAS_ACCEPTED_NAME WFO:0001430711 Echinocereus engelmannii subsp. engelmannii subspecies Angiosperms | Caryophyllales | Cactaceae | Echinocereus | Echinocereus engelmannii | Echinocereus engelmannii subsp. engelmannii WFO:9949999999 | WFO:9000000088 | WFO:7000000098 | WFO:4000012914 | WFO:0000661245 | WFO:0001430711 phylum | order | family | genus | species | subspecies http://www.worldfloraonline.org/taxon/wfo-0001430711

However removing one of the trailing i's (a possibly common mistake in data entry) from the subsp. results in failure to resolve:

echo -e "\tEchinocereus engelmannii subsp. engelmanni" | nomer append wfo returns: Echinocereus engelmannii subsp. engelmanni NONE Echinocereus engelmannii subsp. engelmanni

here, you could specify the max distance field = 1, allowing for a fuzzy match to return Echinocereus engelmannii subsp. engelmannii

jhpoelen commented 2 years ago

@jtmiller28 great idea!

I believe that @zedomel @joelnitta also describe something similar #68 and #78 . Can you confirm?

I have anecdotal evidence to suggest that fuzzy matching create more issues than they solve (see @dimus comments in #68) . . . but I understand that having a similar_to match can be more useful than none at all.

An alternative using existing functionality would be to:

  1. first do an exact match
  2. then do a parsed match (see #104 )
  3. then use a (slower) service that supports fuzzy matching
  4. rematch fuzzy matches from 3. using 1.

Example that use matcher supporting fuzzy matching include "globalnames"

$ echo -e '\tEchinocereus engelmannii subsp. engelmanni' | nomer append globalnames
[main] INFO org.globalbioticinteractions.nomer.match.TermMatcherRegistry - using matcher [globi-globalnames]
    Echinocereus engelmannii subsp. engelmanni  SIMILAR_TO  ITIS:527819 Echinocereus engelmannii engelmannii    Variety     Plantae | Viridiplantae | Streptophyta | Embryophyta | Tracheophyta | Spermatophytina | Magnoliopsida | Caryophyllanae | Caryophyllales | Cactaceae | Echinocereus | Echinocereus engelmannii | Echinocereus engelmannii engelmannii    ITIS:202422 | ITIS:954898 | ITIS:846494 | ITIS:954900 | ITIS:846496 | ITIS:846504 | ITIS:18063 | ITIS:846539 | ITIS:19520 | ITIS:19685 | ITIS:19803 | ITIS:19806 | ITIS:527819  Kingdom | Subkingdom | Infrakingdom | Superdivision | Division | Subdivision | Class | Superorder | Order | Family | Genus | Species | Variety  http://www.itis.gov/servlet/SingleRpt/SingleRpt?search_topic=TSN&search_value=527819    
    Echinocereus engelmannii subsp. engelmanni  SIMILAR_TO  ITIS:19806  Echinocereus engelmannii    Species     Plantae | Viridiplantae | Streptophyta | Embryophyta | Tracheophyta | Spermatophytina | Magnoliopsida | Caryophyllanae | Caryophyllales | Cactaceae | Echinocereus | Echinocereus engelmannii   ITIS:202422 | ITIS:954898 | ITIS:846494 | ITIS:954900 | ITIS:846496 | ITIS:846504 | ITIS:18063 | ITIS:846539 | ITIS:19520 | ITIS:19685 | ITIS:19803 | ITIS:19806    Kingdom | Subkingdom | Infrakingdom | Superdivision | Division | Subdivision | Class | Superorder | Order | Family | Genus | Species    http://www.itis.gov/servlet/SingleRpt/SingleRpt?search_topic=TSN&search_value=19806 
    Echinocereus engelmannii subsp. engelmanni  SIMILAR_TO  GBIF:7283894    Echinocereus engelmannii engelmannii    subspecies      Plantae | Tracheophyta | Magnoliopsida | Caryophyllales | Cactaceae | Echinocereus | Echinocereus engelmannii ex | Echinocereus engelmannii engelmannii GBIF:6 | GBIF:7707728 | GBIF:220 | GBIF:422 | GBIF:2519 | GBIF:5384005 | GBIF:8219081 | GBIF:7283894    kingdom | phylum | class | order | family | genus | species | subspecieshttp://www.gbif.org/species/7283894 
    Echinocereus engelmannii subsp. engelmanni  SIMILAR_TO  GBIF:7283894    Echinocereus engelmannii engelmannii    subspecies      Plantae | Tracheophyta | Magnoliopsida | Caryophyllales | Cactaceae | Echinocereus | Echinocereus engelmannii ex | Echinocereus engelmannii engelmannii GBIF:6 | GBIF:7707728 | GBIF:220 | GBIF:422 | GBIF:2519 | GBIF:5384005 | GBIF:8219081 | GBIF:7283894    kingdom | phylum | class | order | family | genus | species | subspecieshttp://www.gbif.org/species/7283894
jhpoelen commented 2 years ago

Another option is to create an explicit list of common typos (I believe some taxonomic name authorities have them already), and and apply those corrections. In this way, you have more control over exactly what name you'd like to re-interpret and who made the claim that X was a misspelling of Y.

jhpoelen commented 2 years ago

I created something like the typo list via https://github.com/globalbioticinteractions/globi-taxon-names/blob/main/taxon-name-mapping.csv .

This (or similar) mappings can be applied using something like:

$ echo -e "\tAblabesymia" | nomer append translate-names
...
    Ablabesymia SAME_AS     Ablabesmyia
jhpoelen commented 2 years ago

where SAME_AS should probably changed to something like REPLACED_WITH or TRANSLATED_TO or INTERPRETED_AS

dimus commented 2 years ago

For cases when you are searching more than one data-source, exact match might be deceiving by ignoring alternative spellings of a name. For example if some databases have 'Aus bus', while others 'Aus bus L. 1777' or 'Aus bus Linn.' some results might be missing. So you would have to run both canonical and exact matches. I think parsing got to a point when it produces canonical form correctly in vast majority of cases, so I stopped doing exact match these days.

The way http:/verifier.globalnames.org is doing the search goes like this:

  1. Parsing input with https://parser.globalnames.org. If the input name is not a virus - continue, if virus - perform virus match.
  2. exact canonical match. Stop if got the result
  3. fuzzy canonical match, stop if got the result (using stemmed canonical form)
  4. exact match removing middle or last epithet, converting 'Aus bus cus' to both 'Aus bus' and 'Aus cus', Stop if got result
  5. fuzzy match for the same removal of middle and last.
  6. exact match of genus part

More details are at https://verifier.globalnames.org/about

Then it is important to sort matched results, because not all matches are equal. For example input and output can have different authorships, one name can be valid, while other be a synonym etc.

dimus commented 2 years ago

Another option is to create an explicit list of common typos (I believe some taxonomic name authorities have them already), and and apply those corrections. In this way, you have more control over exactly what name you'd like to re-interpret and who made the claim that X was a misspelling of Y.

@jhpoelen, I think this approach does have a value when comparing names that were copied/pasted from one database to another. Sadly it is not always the case, especially when getting names from OCR or typed manually.

jtmiller28 commented 2 years ago

Yep seems I am rehashing issues #68 and #78, I think a similar_to matchcase as you illustrated would be a great asset for large heterogeneous in origin datasets. Methodology wise does seem problematic as pointed out by dimus in issue #68. I like the idea of listed misspellings as the primary implementation personally since that seems the most conservative. Would it be possible to enable multiple fuzzy matching schemes during the command line process? Where you could elect to use a known misspellings query (fuzzy_match1) or/and a character distance query (fuzzy_match2)? This way the user could choose a to enable a matching type. From a data analysis point of view I feel like the known misspellings could cut out alot of problematic names, while the fuzzy match through character distance may just increase them.