Open jtmiller28 opened 2 years ago
@jtmiller28 great idea!
I believe that @zedomel @joelnitta also describe something similar #68 and #78 . Can you confirm?
I have anecdotal evidence to suggest that fuzzy matching create more issues than they solve (see @dimus comments in #68) . . . but I understand that having a similar_to
match can be more useful than none at all.
An alternative using existing functionality would be to:
Example that use matcher supporting fuzzy matching include "globalnames"
$ echo -e '\tEchinocereus engelmannii subsp. engelmanni' | nomer append globalnames
[main] INFO org.globalbioticinteractions.nomer.match.TermMatcherRegistry - using matcher [globi-globalnames]
Echinocereus engelmannii subsp. engelmanni SIMILAR_TO ITIS:527819 Echinocereus engelmannii engelmannii Variety Plantae | Viridiplantae | Streptophyta | Embryophyta | Tracheophyta | Spermatophytina | Magnoliopsida | Caryophyllanae | Caryophyllales | Cactaceae | Echinocereus | Echinocereus engelmannii | Echinocereus engelmannii engelmannii ITIS:202422 | ITIS:954898 | ITIS:846494 | ITIS:954900 | ITIS:846496 | ITIS:846504 | ITIS:18063 | ITIS:846539 | ITIS:19520 | ITIS:19685 | ITIS:19803 | ITIS:19806 | ITIS:527819 Kingdom | Subkingdom | Infrakingdom | Superdivision | Division | Subdivision | Class | Superorder | Order | Family | Genus | Species | Variety http://www.itis.gov/servlet/SingleRpt/SingleRpt?search_topic=TSN&search_value=527819
Echinocereus engelmannii subsp. engelmanni SIMILAR_TO ITIS:19806 Echinocereus engelmannii Species Plantae | Viridiplantae | Streptophyta | Embryophyta | Tracheophyta | Spermatophytina | Magnoliopsida | Caryophyllanae | Caryophyllales | Cactaceae | Echinocereus | Echinocereus engelmannii ITIS:202422 | ITIS:954898 | ITIS:846494 | ITIS:954900 | ITIS:846496 | ITIS:846504 | ITIS:18063 | ITIS:846539 | ITIS:19520 | ITIS:19685 | ITIS:19803 | ITIS:19806 Kingdom | Subkingdom | Infrakingdom | Superdivision | Division | Subdivision | Class | Superorder | Order | Family | Genus | Species http://www.itis.gov/servlet/SingleRpt/SingleRpt?search_topic=TSN&search_value=19806
Echinocereus engelmannii subsp. engelmanni SIMILAR_TO GBIF:7283894 Echinocereus engelmannii engelmannii subspecies Plantae | Tracheophyta | Magnoliopsida | Caryophyllales | Cactaceae | Echinocereus | Echinocereus engelmannii ex | Echinocereus engelmannii engelmannii GBIF:6 | GBIF:7707728 | GBIF:220 | GBIF:422 | GBIF:2519 | GBIF:5384005 | GBIF:8219081 | GBIF:7283894 kingdom | phylum | class | order | family | genus | species | subspecieshttp://www.gbif.org/species/7283894
Echinocereus engelmannii subsp. engelmanni SIMILAR_TO GBIF:7283894 Echinocereus engelmannii engelmannii subspecies Plantae | Tracheophyta | Magnoliopsida | Caryophyllales | Cactaceae | Echinocereus | Echinocereus engelmannii ex | Echinocereus engelmannii engelmannii GBIF:6 | GBIF:7707728 | GBIF:220 | GBIF:422 | GBIF:2519 | GBIF:5384005 | GBIF:8219081 | GBIF:7283894 kingdom | phylum | class | order | family | genus | species | subspecieshttp://www.gbif.org/species/7283894
Another option is to create an explicit list of common typos (I believe some taxonomic name authorities have them already), and and apply those corrections. In this way, you have more control over exactly what name you'd like to re-interpret and who made the claim that X was a misspelling of Y.
I created something like the typo list via https://github.com/globalbioticinteractions/globi-taxon-names/blob/main/taxon-name-mapping.csv .
This (or similar) mappings can be applied using something like:
$ echo -e "\tAblabesymia" | nomer append translate-names
...
Ablabesymia SAME_AS Ablabesmyia
where SAME_AS should probably changed to something like REPLACED_WITH or TRANSLATED_TO or INTERPRETED_AS
For cases when you are searching more than one data-source, exact match might be deceiving by ignoring alternative spellings of a name. For example if some databases have 'Aus bus', while others 'Aus bus L. 1777' or 'Aus bus Linn.' some results might be missing. So you would have to run both canonical and exact matches. I think parsing got to a point when it produces canonical form correctly in vast majority of cases, so I stopped doing exact match these days.
The way http:/verifier.globalnames.org is doing the search goes like this:
More details are at https://verifier.globalnames.org/about
Then it is important to sort matched results, because not all matches are equal. For example input and output can have different authorships, one name can be valid, while other be a synonym etc.
Another option is to create an explicit list of common typos (I believe some taxonomic name authorities have them already), and and apply those corrections. In this way, you have more control over exactly what name you'd like to re-interpret and who made the claim that X was a misspelling of Y.
@jhpoelen, I think this approach does have a value when comparing names that were copied/pasted from one database to another. Sadly it is not always the case, especially when getting names from OCR or typed manually.
Yep seems I am rehashing issues #68 and #78, I think a similar_to matchcase as you illustrated would be a great asset for large heterogeneous in origin datasets. Methodology wise does seem problematic as pointed out by dimus in issue #68. I like the idea of listed misspellings as the primary implementation personally since that seems the most conservative. Would it be possible to enable multiple fuzzy matching schemes during the command line process? Where you could elect to use a known misspellings query (fuzzy_match1) or/and a character distance query (fuzzy_match2)? This way the user could choose a to enable a matching type. From a data analysis point of view I feel like the known misspellings could cut out alot of problematic names, while the fuzzy match through character distance may just increase them.
I was wondering if there is any support for misspellings in Nomer, some of the resolution services I have been working with allow for a max distance field/fuzzy name matching field that lets the user to specify a number of misspellings that are OK to round a taxonomic name to. Same is true for author names in with those services.
For example: echo -e "\tEchinocereus engelmannii subsp. engelmannii" | nomer append wfo returns: Echinocereus engelmannii subsp. engelmannii HAS_ACCEPTED_NAME WFO:0001430711 Echinocereus engelmannii subsp. engelmannii subspecies Angiosperms | Caryophyllales | Cactaceae | Echinocereus | Echinocereus engelmannii | Echinocereus engelmannii subsp. engelmannii WFO:9949999999 | WFO:9000000088 | WFO:7000000098 | WFO:4000012914 | WFO:0000661245 | WFO:0001430711 phylum | order | family | genus | species | subspecies http://www.worldfloraonline.org/taxon/wfo-0001430711
However removing one of the trailing i's (a possibly common mistake in data entry) from the subsp. results in failure to resolve:
echo -e "\tEchinocereus engelmannii subsp. engelmanni" | nomer append wfo returns: Echinocereus engelmannii subsp. engelmanni NONE Echinocereus engelmannii subsp. engelmanni
here, you could specify the max distance field = 1, allowing for a fuzzy match to return Echinocereus engelmannii subsp. engelmannii