Suggest to include authorship past name parsing gn-parse, gbif-parse

jtmiller28 commented 2 years ago

    Ah yes, my bad:

my.properties.gz

In my ideal world if nomer detects multiple taxonomically valid names for a supplied name it would then query a second tabular input, containing the authorship if provided, to select the intended resolution. If the authorship is not provided, then continue generating multiple possible mappings.

Example: echo -e "\t×Aegilotriticum requienii\t(Ces., Pass. & Gibelli) P.Fourn." | nomer append gn-parse | nomer append wfo would provide ×Aegilotriticum requienii (Ces., Pass. & Gibelli) P.Fourn. SAME_AS × Aegilotriticum requienii SYNONYM_OF WFO:0000841768 ×Aegilotriticum triticoides species ×Aegilotriticum triticoides WFO:0000841768 species http://www.worldfloraonline.org/taxon/wfo-0000841768

Alternatively, providing the authorship of original name path through nomer could allievate workflow for tracing taxonomic names that recieved multiple mappings. Something like: echo -e "\t×Aegilotriticum requienii\t(Ces., Pass. & Gibelli) P.Fourn." | nomer append gn-parse | nomer append wfo

×Aegilotriticum requienii (Ces., Pass. & Gibelli) P.Fourn. SAME_AS × Aegilotriticum requienii INTENDED_AUTHOR
(Ces., Pass. & Gibelli) P.Fourn. SYNONYM_OF WFO:0000841768 ×Aegilotriticum triticoides species ×Aegilotriticum triticoides

WFO:0000841768 species http://www.worldfloraonline.org/taxon/wfo-0000841768

×Aegilotriticum requienii (Ces., Pass. & Gibelli) P.Fourn. SAME_AS × Aegilotriticum requienii INTENDED_AUTHOR (Ces., Pass. & Gibelli) Veldkamp HAS_ACCEPTED_NAME WFO:0001356228 ×Aegilotriticum requienii species×Aegilotriticum requienii WFO:0001356228 species http://www.worldfloraonline.org/taxon/wfo-0001356228

Hope thats clear, let me know if there's anything too ambiguous in my example

Originally posted by @jtmiller28 in https://github.com/globalbioticinteractions/nomer/issues/104#issuecomment-1262872493

jhpoelen commented 2 years ago

@jtmiller28 thanks for pointing out that the authorship strings are not being preserved on name parsing using @dimus 's Global Names parser and @mdoering 's GBIF name parser.

I made a first pass at preserving the authorship strings as formatted by the name parsers.

using my.properties

nomer.append.schema.output.example.taxon.rank.order=[{"column":0,"type":"path.order.id"},{"column": 1,"type":"path.order.name"},{"column": 2,"type":"path.order"}]
nomer.append.schema.output=[{"column":0,"type":"externalId"},{"column": 1,"type":"name"},{"column": 2,"type":"authorship"},{"column":3,"type":"rank"}]
nomer.schema.input=[{"column":0,"type":"externalId"},{"column": 1,"type":"name"},{"column": 2,"type":"authorship"},{"column": 3, "type":"rank"}]

I was able to produce:

$ echo -e "\tHomo sapiens Linneaus 1758" | nomer append gbif-parse --include-header --properties my.properties 
[main] INFO org.globalbioticinteractions.nomer.match.TermMatcherRegistry - using matcher [gbif-parse]
providedExternalId  providedName    relationName    resolvedExternalId  resolvedName    resolvedAuthorship  resolvedRank
    Homo sapiens Linneaus 1758  SAME_AS     Homo sapiens    Linneaus, 1758

and

$ echo -e "\tHomo sapiens Linneaus 1758" | nomer append gn-parse --include-header --properties my.properties 
[main] INFO org.globalbioticinteractions.nomer.match.TermMatcherRegistry - using matcher [gn-parse]
providedExternalId  providedName    relationName    resolvedExternalId  resolvedName    resolvedAuthorship  resolvedRank
    Homo sapiens Linneaus 1758  SAME_AS     Homo sapiens    Linneaus 1758

note that GBIF parser adds a comma between name and year, whereas GN parser leaves does not insert the comma.

Hoping to include this functionality in upcoming nomer release.

jtmiller28 commented 2 years ago

Thanks jorrit! I'll add this to my workflow and provide feedback once name resolution occurs.

jhpoelen commented 2 years ago

https://github.com/globalbioticinteractions/nomer/releases/tag/0.3.1 is now available with fix for this issue. @jtmiller28 please review.

jtmiller28 commented 2 years ago

Hey jorrit, this is a great addition to nomer for keeping track of intended authorship. I've tried it out and it works as intended.

echo -e "\tHelianthemum scoparium\tNutt. ex Torr. & A.Gray" | nomer append wfo --include-header --properties file://$PWD/my.properties2 providedExternalId providedName providedAuthorship relationName resolvedExternalId resolvedName resolvedAuthorship resolvedRank [main] INFO org.globalbioticinteractions.nomer.match.WorldOfFloraOnlineTaxonService - [WORLD_OF_FLORA_ONLINE] taxonomy already indexed at [/home/jt-miller/.cache/nomer/world_of_flora_online/world_of_flora_online], no need to import. Helianthemum scoparium Nutt. ex Torr. & A.Gray HAS_ACCEPTED_NAME WFO:0001295598 Helianthemum scoparium Nutt. ex Torr. & A.Gray species

There is a bit of a problem when including authorship on real world data however, in that it seems pretty common that authors are not correctly written as per the exact character string that the catalogue has listed. This is probably due to more ambiguity in how an author should be correctly notated for example: abbreviation, order of names listed, etc.

The result of this is most mappings will end up listed as NONE just due to the author field being filled out, but there being a case of at least one character mismatch in the author string. A way around this workflow wise that I am planning to use is run my list through nomer just looking at scientificName > find all instances where multiple mappings are generated (indicating authorship matters for determining correct name mapping according to the catalogue) > rerun multiple mapping names only while the including of authorship.

I think we can both agree that fuzzy matching is a bit pecarious so not the best solution here, but it might be interesting to enable nomer so that if it fails to match a name when authorship is provided it defaults back to just using the provided scientificName field only. As an example from my data:

'echo -e "\tHelianthemum scoparium\tNutt. ex Torr. & A. Gray" | nomer append wfo --include-header --properties file://$PWD/my.properties2' returns: Helianthemum scoparium Nutt. ex Torr. & A. Gray NONE Helianthemum scoparium Nutt. ex Torr. & A. Gray

After looking at the WFO catalogue, it becomes apparent the only difference between the name pulled from gbif is that A. Gray should be A.Gray without the white space:

echo -e "\tHelianthemum scoparium\tNutt. ex Torr. & A.Gray" | nomer append wfo --include-header --properties file://$PWD/my.properties2 returns: Helianthemum scoparium Nutt. ex Torr. & A.Gray HAS_ACCEPTED_NAME WFO:0001295598 Helianthemum scoparium Nutt. ex Torr. & A.Gray species

This is easy to find in a simple example like above, however it becomes rather daunting when presented with tens of thousands of such erroneous mappings only due to authorship. If instead there was some field that denoted authorship matching failed so nomer retried scientificName only and was successful, this could alleviate many false NONE matches just due to ambiguity in correctly writing authorship to the catalogues exact standard. It would also provide a way to track where these authorships failed if one was interested in tracking down author names that appear erroneous in their datasets.

Hope that makes sense, thanks for all of your attention to this discussion about enhancement!

dimus commented 2 years ago

Note that authors are sometimes abbreviated, omitted, or ordered in a different way. Because of that direct comparison of authors does not work all the time. So gnresolver uses this algorithm for matching https://github.com/gnames/gnames/blob/master/ent/score/auth.go

jhpoelen commented 2 years ago

note that v0.4.1 of Nomer now includes authorship in the default input/output schema

echo -e "\tHelianthemum scoparium\tNutt. ex Torr. & A.Gray" | nomer append col --include-header

providedExternalId	providedName	providedAuthorship	relationName	resolvedExternalId	resolvedName	resolvedAuthorship	resolvedRank	resolvedCommonNames	resolvedPath	resolvedPathIds	resolvedPathNames	resolvedPathAuthorships	resolvedExternalUrl
	Helianthemum scoparium	Nutt. ex Torr. & A.Gray	NONE		Helianthemum scoparium	Nutt. ex Torr. & A.Gray

jhpoelen commented 2 years ago

And with WFO

echo -e "\tHelianthemum scoparium\tNutt. ex Torr. & A.Gray" |  nomer append wfo --include-header

providedExternalId	providedName	providedAuthorship	relationName	resolvedExternalId	resolvedName	resolvedAuthorship	resolvedRank	resolvedCommonNames	resolvedPath	resolvedPathIds	resolvedPathNames	resolvedPathAuthorships	resolvedExternalUrl
	Helianthemum scoparium	Nutt. ex Torr. & A.Gray	HAS_ACCEPTED_NAME	WFO:0001295598	Helianthemum scoparium	Nutt. ex Torr. & A.Gray	species		Angiosperms \| Malvales \| Cistaceae \| Helianthemum \| Helianthemum scoparium	WFO:9949999999 \| WFO:9000000312 \| WFO:7000000136 \| WFO:4000017201 \| WFO:0001295598	phylum \| order \| family \| genus \| species		http://www.worldfloraonline.org/taxon/wfo-0001295598

globalbioticinteractions / nomer

Suggest to include authorship past name parsing gn-parse, gbif-parse #112