globalbioticinteractions / name-alignment-template

align names with known taxonomic resources
https://big-bee-network.github.io/name-alignment-workshop
Creative Commons Zero v1.0 Universal
2 stars 6 forks source link

Author Naming Requirements #4

Closed jtmiller28 closed 1 year ago

jtmiller28 commented 2 years ago

Errors in the strings such as not capitalizing the authors seems to be overlooked. For example, if the a name is "Andrena imitatrix cresson" rather than "Andrena imitatrix Cresson" the tool will assume that cresson is part of the name and register NONE for the field that specifies whether it has an accepted_name or synonym located in the catalogue. If ran like this, it will also store the full provided name "Andrena imitatrix cresson" in the resolvedName output rather than NA or NONE. Not sure if this is an issue or more of just an additional point to be added to the readme in order to avoid accidental misinterpretation.

seltmann commented 2 years ago

@jtmiller28 did you test the result of Andrena imitatrix Cresson and did that make a difference?

jtmiller28 commented 2 years ago

Yes, Changing "Andrena imitatrix cresson" with the following output: image to "Andrena imitatrix Cresson" solves the issue: image

So provided that the data entry is correct there shouldn't be issues, but if there are errors its necessary to manually look for outputs that give a resolved name but have NONE listed for the matches.

jhpoelen commented 2 years ago

@jtmiller28 thanks for sharing your observations.

The idea behind being a little stringent for specific processing steps (e.g., aligning a name with the ITIS taxonomy), is to have a separation of concern:

instead of clumping all logic into one operation, Nomer is designed to separate the processing steps. This way, you can design your own name alignment process for your particular source.

For example, one such process/method is the "translate-name" nomer matcher.

I use the "translate-name" matcher to specifically identify known spelling errors in names / terms, or to related common names to their likely scientific name counter part. This matcher uses https://github.com/globalbioticinteractions/globi-taxon-names/blob/main/taxon-name-mapping.csv by default to translate a provided name into the "resolved" name, or translated name. And, your can make your own list and configure Nomer to use it via the nomer append --properties my.properties command, where my.properties contains something like:

nomer.taxon.name.stopword.url=https://example.prg/my-taxon-name-mapping.txt

In the existing name alignment template the following matchers are used in sequence:

gbif-parse

and some matcher related to a taxonomic name resolver (e.g., itis, gbif, col).

So, depending on your situation, I could add the functionality for you to implement your own workflow using Nomer, or using your own program.

For instance, I imagine that you might want to day something like:

first capitalize the author name

then parse the name

then align the name with ITIS

Curious to hear your thoughts on how you think the current (crude) name alignment template can be re-used / altered to make the tool a little easier to understand, use, or configure.

jhpoelen commented 2 years ago

btw - here's a list of supported matchers https://github.com/globalbioticinteractions/nomer#show-supported-matchers .

you can also generate the list by running:

$ nomer matchers --verbose
ala Lookup taxon in Atlas of Living Australia by name or by id using ALATaxon:* prefix.
bold-web    Use BOLD webservice to lookup taxa by bin/taxon id using BOLD:* and BOLDTaxon:* prefixes.
col Lookup Catalogue of Life taxon by name or COL:* prefixed ids using offline-enabled database dump
crossref-doi    uses api.crossref.org to resolve doi associated with human readable citation
discoverlife    Lookup DiscoverLife taxa by name, synonym using offline-enabled database dump
envo    Lookup envo terms by name or by id using ENVO:* prefix.
eol Lookup EOL pages by id with EOL:* prefix using offline-enabled database dump
gbif    Lookup GBIF taxa by name, synonym or id using offline-enabled database dump
gbif-parse  Attempts extract canonical taxonomic name from name string using https://github.com/gbif/name-parser .
gbif-web    Web-based taxon id/name lookup using GBIF backbone API and GBIF:* prefix.
globalnames Uses https://resolver.globalnames.org to match taxon names. Searches by name only (not id).
globi   Uses GloBI's Taxon Graph to lookup terms by id or name across many taxonomies / ontologies. Caches a copy locally on first use to allow for subsequent offline usage. Use properties [nomer.term.cache.url] and [nomer.term.map.url] to override default cache and map locations. See https://doi.org/10.5281/zenodo.755513 for more information.
globi-correct   Scrubs names using GloBI's (taxonomic) name scrubber. Scrubbing includes removing of stopwords (e.g., undefined), correcting common typos using a "crappy" names list, parse to canonical name using gnparser (see https://github.com/GlobalNamesArchitecture/gnparser), and more.
globi-enrich    Uses GloBI's taxon enricher to find first term match by id or name. Uses various web apis like Encyclopedia of Life, World Registry of Marine Species (WoRMS), Integrated Taxonomic Information System (ITIS), National Biodiversity Network (NBN) and more.
globi-rank  Finds taxonomic rank identifiers by rank commons name (e.g., species, order, soort). Uses Wikidata taxon rank items. Caches a copy locally on first usage to allow for subsequent offline usage.
gn-parse    Attempts extract canonical taxonomic name from name string using https://github.com/GlobalNamesArchitecture/gnparser .
gulfbase    Look up taxa of https://gulfbase.org by name or id with BioGoMx:* prefix.
inaturalist-id  Lookup taxon in iNaturalist by id with INAT_TAXON:* prefix.
indexfungorum   Lookup Index Fungorum taxon by name or id using offline-enabled database dump
itis    Lookup ITIS taxon by name or id using offline-enabled database dump
itis-web    Use itis webservice to lookup taxa by id using ITIS:* prefix.
nbn Lookup taxon of National Biodiversity Network by id with NBN:* prefix.
ncbi    Lookup NCBI taxa by name, synonym or id using offline-enabled database dump
ncbi-web    Lookup NCBI taxon by id with NCBI:* prefix using web apis.
nodc    Lookup taxon in the Taxonomic Code of the National Oceanographic Data Center (NODC) by id with prefix NODC: . Maps to ITIS terms if possible.
openbiodiv  uses openbiodiv sparql endpoint to resolve openbiodiv terms
orcid-web   Lookup ORCID by id with ORCID:* prefix.
ott Lookup Open Tree of Life taxon by name or (OTT|GBIF|WORMS|IF|NCBI|IRMNG)* prefixed ids using offline-enabled database dump
plazi   Lookup Plazi taxon treatment by name or id using offline-enabled database dump
pmid-doi    resolves pubmed id to doi using https://www.ncbi.nlm.nih.gov/pmc/pmctopmid/
remove-stop-words   Removes stop words (e.g., undefined) using a stop word list specified by property [nomer.taxon.name.stopword.url] .
translate-names Translates incoming names using a two column csv file specified by property [nomer.taxon.name.correction.url] .
uksi-current-name   Use UK Species Inventory to find current taxonomic name.
wikidata-web    uses wikidata to cross-walk taxon id across taxonomies
worms   Lookup taxon in WoRMS by name or by id with WORMS:* prefix.
jtmiller28 commented 2 years ago

I did not see that you can modify your own configuration, that is probably ideal rather then creating a step to capitalize part of the string. So far I find the tool to useful and intuitive. I am having issues working on using it on a large scale example, my file appears to run into issues with the scientificName column not being found (not sure if its .csv issues or something with my table). I'll recheck some of my configuration and update once I have a better handle on this.

jhpoelen commented 2 years ago

@jtmiller28 thanks for taking the time to share your feedback . Can you please provide specific examples of cases where the scientificName column is not being found?

jhpoelen commented 1 year ago

closing stale issue @jtmiller28 please feel free to re-open, comment if you have additional questions.