globalbioticinteractions / nomer

maps identifiers and names to other identifiers and names
GNU General Public License v3.0
19 stars 3 forks source link

Inclusion of Authorship effect on name resolution? #104

Closed jtmiller28 closed 2 years ago

jtmiller28 commented 2 years ago

Name relation registering as NONE for most of wfo's outputs, even when resolution is available and seems to occur?

echo -e "\tLysichiton americanus Hultén & H.St.John" | nomer append wfo

Produces the follwing output:

Lysichiton americanus Hultén & H.St.John NONE Lysichiton americanus Hultén & H.St.John

Name is accepted and should be integrated into wfo: http://www.worldfloraonline.org/taxon/wfo-0000231603

@issue #76 Agreement in changing the output the NONE_FOUND to express clarity in resolution status.

jtmiller28 commented 2 years ago

Update: it appears that the authors are messing with the taxonomic designation.

echo -e "\tLysichiton americanus" | nomer append wfo

returns: Lysichiton americanus HAS_ACCEPTED_NAME WFO:0000231603 Lysichiton americanus species Angiosperms | Alismatales | Araceae | Lysichiton | Lysichiton americanus WFO:9949999999 | WFO:9000000013 | WFO:7000000042 | WFO:4000022556 | WFO:0000231603 phylum | order | family | genus | species http://www.worldfloraonline.org/taxon/wfo-0000231603

Recommendations for including authorship in nomer query? gnparse is a tool to break up names, but can we include the author in the query? https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-017-1663-3

jhpoelen commented 2 years ago

@jtmiller28 thanks for sharing these specific examples!

If you'd like to use current functionality, you may want to checkout the gnparse and/or gbif-parse integrations that Nomer provides:

gbif-parse -

$ echo -e "\tLysichiton americanus Hultén & H.St.John"  | nomer append gbif-parse
[main] INFO org.globalbioticinteractions.nomer.match.TermMatcherRegistry - using matcher [gbif-parse]
    Lysichiton americanus Hultén & H.St.John    SAME_AS     Lysichiton americanus                           

and gn-parse -

$ echo -e "\tLysichiton americanus Hultén & H.St.John"  | nomer append gn-parse
[main] INFO org.globalbioticinteractions.nomer.match.TermMatcherRegistry - using matcher [gn-parse]
    Lysichiton americanus Hultén & H.St.John    SAME_AS     Lysichiton americanus       

With this, you can create a processing pipeline like:

echo ... 
 | nomer replace gn-parse
 | nomer append wfo

If you'd like to provide a parsed version of the authorship, we may want to consider something like:

echo -e "\tLysichiton americanus\tHultén & H.St.John" | nomer append wfo

where we'd make sure that nomer understands that the third column (e.g., Hultén & H.St.John) should be interpreted as authorship. Then, the authorship can be matched along with the name against corresponding fields in wfo.

Curious to hear your thoughts on this.

jtmiller28 commented 2 years ago

The consideration came up when dealing with subspecies + varieties. For a stream lined process, it seems better to provide the authorship so that it accurately maps to the accepted name without manual decision. I am having trouble however when there is an accepted variety, but can only obtain synonymous resolution back to the species level.

Ex. echo -e "\tYucca brevifolia" | nomer append wfo Returns: Yucca brevifolia HAS_ACCEPTED_NAME WFO:0000752275 Yucca brevifolia species Angiosperms | Asparagales | Asparagaceae | Yucca | Yucca brevifolia WFO:9949999999 | WFO:9000000036 | WFO:7000000050 | WFO:4000041098 | WFO:0000752275 phylum | order | family | genus | species http://www.worldfloraonline.org/taxon/wfo-0000752275
Yucca brevifolia SYNONYM_OF WFO:0000753634 Yucca baccata var. brevifolia variety Angiosperms | Asparagales | Asparagaceae | Yucca | Yucca baccata | Yucca baccata var. brevifolia WFO:9949999999 | WFO:9000000036 | WFO:7000000050 | WFO:4000041098 | WFO:0000752057 | WFO:0000753634 phylum | order | family | genus | species | variety http://www.worldfloraonline.org/taxon/wfo-0000753634

Suggests that the accepted name should be Yucca baccata, and that the variety is synonymous. However if you follow that wfo link you can find that the variety is an accepted name and can be independent. Adding authorship would ideally place resolution of the accepted name for the variety.

When authorship is provided however, echo -e "\tYucca brevifolia\tL.D.Benson & Darrow" | nomer append wfo Yucca brevifolia L.D.Benson & Darrow HAS_ACCEPTED_NAME WFO:0000752275 Yucca brevifolia species Angiosperms | Asparagales | Asparagaceae | Yucca | Yucca brevifolia WFO:9949999999 | WFO:9000000036 | WFO:7000000050 | WFO:4000041098 | WFO:0000752275 phylum | order | family | genus | species http://www.worldfloraonline.org/taxon/wfo-0000752275
Yucca brevifolia L.D.Benson & Darrow SYNONYM_OF WFO:0000753634 Yucca baccata var. brevifolia variety Angiosperms | Asparagales | Asparagaceae | Yucca | Yucca baccata | Yucca baccata var. brevifolia WFO:9949999999 | WFO:9000000036 | WFO:7000000050 | WFO:4000041098 | WFO:0000752057 | WFO:0000753634 phylum | order | family | genus | species | variety http://www.worldfloraonline.org/taxon/wfo-0000753634

the same mapping occurs, even when the variety is an accepted name. http://www.worldfloraonline.org/taxon/wfo-0000753634

Any suggestions?

jtmiller28 commented 2 years ago

A bit of an update in the Authorship effecting name resolution. It appears that the inclusion of authorship in WFO's catalogue can matter when considering taxonomic name resolution. To generate the dataset I used:

nomer list --properties /path/to/myproperties wfo > wfo_names_w_authorship

reading this list into R, I then organized it by grouping by the original scientific name and filtering for only instances where a identical name could occur twice, and then processed the resolvedScientificName mappings to be only unique resolutions.

I built a .csv to summarize these instances of authorship dependent mapping (~32,000 names mapping instances). This relationship seems to trip nomer into printing two possible (taxonomically valid) mappings.

As an example: echo -e "\t×Aegilotriticum requienii\t(Ces., Pass. & Gibelli) P.Fourn." | nomer append gn-parse | nomer append wfo Provides: ×Aegilotriticum requienii (Ces., Pass. & Gibelli) P.Fourn. SAME_AS × Aegilotriticum requienii SYNONYM_OF WFO:0000841768 ×Aegilotriticum triticoides species ×Aegilotriticum triticoides WFO:0000841768 species http://www.worldfloraonline.org/taxon/wfo-0000841768

×Aegilotriticum requienii (Ces., Pass. & Gibelli) P.Fourn. SAME_AS × Aegilotriticum requienii HAS_ACCEPTED_NAME WFO:0001356228 ×Aegilotriticum requienii species×Aegilotriticum requienii WFO:0001356228 species http://www.worldfloraonline.org/taxon/wfo-0001356228

This is a bit undesirable since currently I believe nomer removes authorship from the output created (unsure if there's an option to change this), making it difficult to re-check multiple mapped names for intended authorship. It can also pose quite a bit of manual resolution if a large number of the 32,000 names are included in ones taxonomic list. I'll attach a .csv of the multiple names that can match. multiple_mappings.csv

jhpoelen commented 2 years ago

Thanks for sharing your detailed notes @jtmiller28 !

Would you be able to list the properties that you used, so I can try and reproduce your results?

Also, what would your ideal/desired behavior look like for the specific example you used?

Thanks for being patient with me . . .

jtmiller28 commented 2 years ago

Ah yes, my bad: my.properties.gz

In my ideal world if nomer detects multiple taxonomically valid names for a supplied name it would then query a second tabular input, containing the authorship if provided, to select the intended resolution. If the authorship is not provided, then continue generating multiple possible mappings.

Example: echo -e "\t×Aegilotriticum requienii\t(Ces., Pass. & Gibelli) P.Fourn." | nomer append gn-parse | nomer append wfo would provide ×Aegilotriticum requienii (Ces., Pass. & Gibelli) P.Fourn. SAME_AS × Aegilotriticum requienii SYNONYM_OF WFO:0000841768 ×Aegilotriticum triticoides species ×Aegilotriticum triticoides WFO:0000841768 species http://www.worldfloraonline.org/taxon/wfo-0000841768

Alternatively, providing the authorship of original name path through nomer could allievate workflow for tracing taxonomic names that recieved multiple mappings. Something like: echo -e "\t×Aegilotriticum requienii\t(Ces., Pass. & Gibelli) P.Fourn." | nomer append gn-parse | nomer append wfo

×Aegilotriticum requienii (Ces., Pass. & Gibelli) P.Fourn. SAME_AS × Aegilotriticum requienii INTENDED_AUTHOR
(Ces., Pass. & Gibelli) P.Fourn.
SYNONYM_OF WFO:0000841768 ×Aegilotriticum triticoides species ×Aegilotriticum triticoides

×Aegilotriticum requienii (Ces., Pass. & Gibelli) P.Fourn. SAME_AS × Aegilotriticum requienii INTENDED_AUTHOR (Ces., Pass. & Gibelli) Veldkamp HAS_ACCEPTED_NAME WFO:0001356228 ×Aegilotriticum requienii species×Aegilotriticum requienii WFO:0001356228 species http://www.worldfloraonline.org/taxon/wfo-0001356228

Hope thats clear, let me know if there's anything too ambiguous in my example

jtmiller28 commented 2 years ago

Ah and for clarity in my ideal solution: if for example: echo -e "\t×Aegilotriticum requienii" | nomer append gn-parse | nomer append wfo then provide ×Aegilotriticum requienii SAME_AS × Aegilotriticum requienii SYNONYM_OF WFO:0000841768 ×Aegilotriticum triticoides species ×Aegilotriticum triticoides WFO:0000841768 species http://www.worldfloraonline.org/taxon/wfo-0000841768

×Aegilotriticum requienii SAME_AS × Aegilotriticum requienii HAS_ACCEPTED_NAME WFO:0001356228 ×Aegilotriticum requienii species×Aegilotriticum requienii WFO:0001356228 species http://www.worldfloraonline.org/taxon/wfo-0001356228

So that if the user does not provide an authorship it maps to multiple possible names by default (as is the normal result of nomer right now).

Also probably a default option that if the author does not match that of the catalogues entered one, then return multiple mapping (as is default for nomer right now) as well.

jhpoelen commented 2 years ago

@jtmiller28 thanks again for elaborating and sharing your desired results

from

curl -L "https://github.com/globalbioticinteractions/nomer/files/9679319/my.properties.gz" | gunzip

your properties are:

nomer.append.schema.output.example.taxon.rank.order=[{"column":0,"type":"path.order.id"},{"column": 1,"type":"path.order.name"},{"column": 2,"type":"path.order"}]
nomer.append.schema.output=[{"column":0,"type":"externalId"},{"column": 1,"type":"name"},{"column": 2,"type":"authorship"},{"column":3,"type":"rank"}]
nomer.schema.input=[{"column":0,"type":"externalId"},{"column": 1,"type":"name"},{"column": 2,"type":"authorship"},{"column": 3, "type":"rank"}]

showing the inclusion of authorship field of input / output schemas.

jhpoelen commented 2 years ago

with recent enhancements, authorship is now taken into account when matching, provided that:

  1. the authorship is a non-empty value
  2. the authorship column is defined in input schema

Examples:

echo -e "\tYucca baccata var. brevifolia\tL.D.Benson & Darrox" | nomer append wfo --properties my.properties
    Yucca baccata var. brevifolia   L.D.Benson & Darrox NONE        Yucca baccata var. brevifolia   L.D.Benson & Darrox 

Note that L.D.Benson & Darrox is not the authorship according to WFO.

However, when using the appropriate name author is used L.D.Benson & Darrow (note the last character is w not x

$ echo -e "\tYucca baccata var. brevifolia\tL.D.Benson & Darrow" | nomer append wfo --properties file://$PWD/my.properties
    Yucca baccata var. brevifolia   L.D.Benson & Darrow HAS_ACCEPTED_NAME   WFO:0000753634  Yucca baccata var. brevifolia   L.D.Benson & Darrow variety

the variety is matched, just like when the authorship is omitted:

$ echo -e "\tYucca baccata var. brevifolia\t" | nomer append wfo --properties my.properties
    Yucca baccata var. brevifolia       HAS_ACCEPTED_NAME   WFO:0000753634  Yucca baccata var. brevifolia   L.D.Benson & Darrow variety

Note that exact matches are applied, so . . . anything mismatches (extra whitespace, upper/lowercase, abbreviations), the match will not succeed. Pre- or post-processing of authorship name string may be needed to normalize, expand, or otherwise transform the authorship strings.

jhpoelen commented 2 years ago

@jtmiller28 hoping to publish a new nomer version shortly, so that you can verify whether your desired changes have actually been implemented.

Thanks again for being patient and for providing such detailed requests / proposals.

jtmiller28 commented 2 years ago

Perfect fix, thanks!