Closed jtmiller28 closed 2 years ago
@jtmiller28 thanks for pointing out that the authorship strings are not being preserved on name parsing using @dimus 's Global Names parser and @mdoering 's GBIF name parser.
I made a first pass at preserving the authorship strings as formatted by the name parsers.
using my.properties
nomer.append.schema.output.example.taxon.rank.order=[{"column":0,"type":"path.order.id"},{"column": 1,"type":"path.order.name"},{"column": 2,"type":"path.order"}]
nomer.append.schema.output=[{"column":0,"type":"externalId"},{"column": 1,"type":"name"},{"column": 2,"type":"authorship"},{"column":3,"type":"rank"}]
nomer.schema.input=[{"column":0,"type":"externalId"},{"column": 1,"type":"name"},{"column": 2,"type":"authorship"},{"column": 3, "type":"rank"}]
I was able to produce:
$ echo -e "\tHomo sapiens Linneaus 1758" | nomer append gbif-parse --include-header --properties my.properties
[main] INFO org.globalbioticinteractions.nomer.match.TermMatcherRegistry - using matcher [gbif-parse]
providedExternalId providedName relationName resolvedExternalId resolvedName resolvedAuthorship resolvedRank
Homo sapiens Linneaus 1758 SAME_AS Homo sapiens Linneaus, 1758
and
$ echo -e "\tHomo sapiens Linneaus 1758" | nomer append gn-parse --include-header --properties my.properties
[main] INFO org.globalbioticinteractions.nomer.match.TermMatcherRegistry - using matcher [gn-parse]
providedExternalId providedName relationName resolvedExternalId resolvedName resolvedAuthorship resolvedRank
Homo sapiens Linneaus 1758 SAME_AS Homo sapiens Linneaus 1758
note that GBIF parser adds a comma between name and year, whereas GN parser leaves does not insert the comma.
Hoping to include this functionality in upcoming nomer release.
Thanks jorrit! I'll add this to my workflow and provide feedback once name resolution occurs.
https://github.com/globalbioticinteractions/nomer/releases/tag/0.3.1 is now available with fix for this issue. @jtmiller28 please review.
Hey jorrit, this is a great addition to nomer for keeping track of intended authorship. I've tried it out and it works as intended.
echo -e "\tHelianthemum scoparium\tNutt. ex Torr. & A.Gray" | nomer append wfo --include-header --properties file://$PWD/my.properties2
providedExternalId providedName providedAuthorship relationName resolvedExternalId resolvedName resolvedAuthorship resolvedRank
[main] INFO org.globalbioticinteractions.nomer.match.WorldOfFloraOnlineTaxonService - [WORLD_OF_FLORA_ONLINE] taxonomy already indexed at [/home/jt-miller/.cache/nomer/world_of_flora_online/world_of_flora_online], no need to import.
Helianthemum scoparium Nutt. ex Torr. & A.Gray HAS_ACCEPTED_NAME WFO:0001295598 Helianthemum scoparium Nutt. ex Torr. & A.Gray species
There is a bit of a problem when including authorship on real world data however, in that it seems pretty common that authors are not correctly written as per the exact character string that the catalogue has listed. This is probably due to more ambiguity in how an author should be correctly notated for example: abbreviation, order of names listed, etc.
The result of this is most mappings will end up listed as NONE just due to the author field being filled out, but there being a case of at least one character mismatch in the author string. A way around this workflow wise that I am planning to use is run my list through nomer just looking at scientificName > find all instances where multiple mappings are generated (indicating authorship matters for determining correct name mapping according to the catalogue) > rerun multiple mapping names only while the including of authorship.
I think we can both agree that fuzzy matching is a bit pecarious so not the best solution here, but it might be interesting to enable nomer so that if it fails to match a name when authorship is provided it defaults back to just using the provided scientificName field only. As an example from my data:
'echo -e "\tHelianthemum scoparium\tNutt. ex Torr. & A. Gray" | nomer append wfo --include-header --properties file://$PWD/my.properties2' returns: Helianthemum scoparium Nutt. ex Torr. & A. Gray NONE Helianthemum scoparium Nutt. ex Torr. & A. Gray
After looking at the WFO catalogue, it becomes apparent the only difference between the name pulled from gbif is that A. Gray should be A.Gray without the white space:
echo -e "\tHelianthemum scoparium\tNutt. ex Torr. & A.Gray" | nomer append wfo --include-header --properties file://$PWD/my.properties2
returns:
Helianthemum scoparium Nutt. ex Torr. & A.Gray HAS_ACCEPTED_NAME WFO:0001295598 Helianthemum scoparium Nutt. ex Torr. & A.Gray species
This is easy to find in a simple example like above, however it becomes rather daunting when presented with tens of thousands of such erroneous mappings only due to authorship. If instead there was some field that denoted authorship matching failed so nomer retried scientificName only and was successful, this could alleviate many false NONE matches just due to ambiguity in correctly writing authorship to the catalogues exact standard. It would also provide a way to track where these authorships failed if one was interested in tracking down author names that appear erroneous in their datasets.
Hope that makes sense, thanks for all of your attention to this discussion about enhancement!
Note that authors are sometimes abbreviated, omitted, or ordered in a different way. Because of that direct comparison of authors does not work all the time. So gnresolver uses this algorithm for matching https://github.com/gnames/gnames/blob/master/ent/score/auth.go
note that v0.4.1 of Nomer now includes authorship in the default input/output schema
echo -e "\tHelianthemum scoparium\tNutt. ex Torr. & A.Gray" | nomer append col --include-header
providedExternalId | providedName | providedAuthorship | relationName | resolvedExternalId | resolvedName | resolvedAuthorship | resolvedRank | resolvedCommonNames | resolvedPath | resolvedPathIds | resolvedPathNames | resolvedPathAuthorships | resolvedExternalUrl |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Helianthemum scoparium | Nutt. ex Torr. & A.Gray | NONE | Helianthemum scoparium | Nutt. ex Torr. & A.Gray |
And with WFO
echo -e "\tHelianthemum scoparium\tNutt. ex Torr. & A.Gray" | nomer append wfo --include-header
providedExternalId | providedName | providedAuthorship | relationName | resolvedExternalId | resolvedName | resolvedAuthorship | resolvedRank | resolvedCommonNames | resolvedPath | resolvedPathIds | resolvedPathNames | resolvedPathAuthorships | resolvedExternalUrl |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Helianthemum scoparium | Nutt. ex Torr. & A.Gray | HAS_ACCEPTED_NAME | WFO:0001295598 | Helianthemum scoparium | Nutt. ex Torr. & A.Gray | species | Angiosperms | Malvales | Cistaceae | Helianthemum | Helianthemum scoparium | WFO:9949999999 | WFO:9000000312 | WFO:7000000136 | WFO:4000017201 | WFO:0001295598 | phylum | order | family | genus | species | http://www.worldfloraonline.org/taxon/wfo-0001295598 |
my.properties.gz
In my ideal world if nomer detects multiple taxonomically valid names for a supplied name it would then query a second tabular input, containing the authorship if provided, to select the intended resolution. If the authorship is not provided, then continue generating multiple possible mappings.
Example: echo -e "\t×Aegilotriticum requienii\t(Ces., Pass. & Gibelli) P.Fourn." | nomer append gn-parse | nomer append wfo would provide ×Aegilotriticum requienii (Ces., Pass. & Gibelli) P.Fourn. SAME_AS × Aegilotriticum requienii SYNONYM_OF WFO:0000841768 ×Aegilotriticum triticoides species ×Aegilotriticum triticoides WFO:0000841768 species http://www.worldfloraonline.org/taxon/wfo-0000841768
Alternatively, providing the authorship of original name path through nomer could allievate workflow for tracing taxonomic names that recieved multiple mappings. Something like: echo -e "\t×Aegilotriticum requienii\t(Ces., Pass. & Gibelli) P.Fourn." | nomer append gn-parse | nomer append wfo
×Aegilotriticum requienii (Ces., Pass. & Gibelli) P.Fourn. SAME_AS × Aegilotriticum requienii INTENDED_AUTHOR
(Ces., Pass. & Gibelli) P.Fourn. SYNONYM_OF WFO:0000841768 ×Aegilotriticum triticoides species ×Aegilotriticum triticoides
×Aegilotriticum requienii (Ces., Pass. & Gibelli) P.Fourn. SAME_AS × Aegilotriticum requienii INTENDED_AUTHOR (Ces., Pass. & Gibelli) Veldkamp HAS_ACCEPTED_NAME WFO:0001356228 ×Aegilotriticum requienii species×Aegilotriticum requienii WFO:0001356228 species http://www.worldfloraonline.org/taxon/wfo-0001356228
Hope thats clear, let me know if there's anything too ambiguous in my example
Originally posted by @jtmiller28 in https://github.com/globalbioticinteractions/nomer/issues/104#issuecomment-1262872493