globalbioticinteractions / nomer

maps identifiers and names to other identifiers and names
GNU General Public License v3.0
19 stars 3 forks source link

Some names (e.g., Ancylandrena atoposoma) are currently on DiscoverLife's active site but not matched in Nomer via discoverlife #149

Closed jtmiller28 closed 6 months ago

jtmiller28 commented 1 year ago

A new paper published "Completeness analysis for over 3000 United States bee species identifies persistent data gap" Chesshire et al 2023, reveals some names that are currently on DL's website but unavailable via Nomer currently.

Example: echo -e "\tAncylandrena atoposoma"| nomer append discoverlife Yields: Ancylandrena atoposoma NONE Ancylandrena atoposoma

This is a known name according to the DL website: https://www.discoverlife.org/mp/20q

Also, to assure newest version of Nomer: nomer version yields: 0.4.9

My question is whether another pull from the DL name list is necessary to align current names?

As a side note: The authors of Chesshire et al 2023 went through the names of all United States bees to correct them via expert designations. They provide a file chesshires-name-list.xlsx that shows all original names pulled from their occurrence data aggregates (GBIF and SCAN) and their corrections via name alignment + ending corrections + final names after correction. This might be a great list to add for United States bees with their permission? It might also be a great way to tackle fuzzy names without implementing a character replacement algorithm into Nomer, as they provide names that have known incorrect spellings from aggregators like GBIF and their final resolution mapping.

jhpoelen commented 1 year ago

@jtmiller28 thanks for sharing this specific example.

I was able to independently reproduce:

$ echo -e "\tAncylandrena atoposoma" | nomer append discoverlife
    Ancylandrena atoposoma  NONE        Ancylandrena atoposoma      

also, I was able to find the name on the https://www.discoverlife.org website as you mentioned. See screenshot below.

image

And, I much like your idea to re-use Chesshire et al 2023 to help complement the existing resources.

Am also curious to hear from @seltmann on the topic.

Next step for me is to figure out why Ancylandrena atoposoma is not picked up by Nomer.

jhpoelen commented 1 year ago

@jtmiller28 would you happen to have the full list of active DiscoverLife names that appear to not match via Nomer's support for DiscoverLife taxa?

jtmiller28 commented 1 year ago

I thought I did initially, however after closer inspection I noticed that these were more nuanced with some being corrections made by those experts for new designations that are not as of yet reflected in the DL database. Was hoping it was just an update issue, but I'll pull the initial names from chesshire and run Nomer through it and see where that leads. Hopefully more soon

jhpoelen commented 1 year ago

@jtmiller28 took at look at your unexpected mismatches of nomer against discoverlife.

Turned out that the reason is that our DiscoverLife friends have upgraded their lists in 2022, and Nomer was still using the older DiscoverLife copy. Also see #80 .

After upgrading to the "new" discoverlife, I was able to produced the results below

echo -e "\tAncylandrena atoposoma"\
 | nomer append discoverlife

yielded:

[main] INFO org.globalbioticinteractions.nomer.match.DiscoverLifeTaxonService - DiscoverLife name indexing started...
[main] INFO org.globalbioticinteractions.nomer.match.DiscoverLifeTaxonService - [51348] DiscoverLife names were indexed in 14s (@ 3667 names/s)
    Ancylandrena atoposoma  HAS_ACCEPTED_NAME   https://www.discoverlife.org/mp/20q?search=Ancylandrena+atoposoma   Ancylandrena atoposoma  (Cockerell, 1934)   species     Animalia | Arthropoda | Insecta | Hymenoptera | Andrenidae | Ancylandrena atoposoma https://www.discoverlife.org/mp/20q?search=Animalia | https://www.discoverlife.org/mp/20q?search=Arthropoda | https://www.discoverlife.org/mp/20q?search=Insecta | https://www.discoverlife.org/mp/20q?search=Hymenoptera | https://www.discoverlife.org/mp/20q?search=Andrenidae | https://www.discoverlife.org/mp/20q?search=Ancylandrena+atoposoma   kingdom | phylum | class | order | family | species     https://www.discoverlife.org/mp/20q?search=Ancylandrena+atoposoma
jtmiller28 commented 1 year ago

Gotcha, that makes sense! Is there a possibility in the future for automation in updating DL with each edition?

I have some code to run that'll build the known list of names that are in DL for US bees according to that paper, so once Nomer is fully updated I can run it against it and check if there are any differences.

jhpoelen commented 1 year ago

yep, working on it, see #80

jhpoelen commented 1 year ago

@jtmiller28 please verify that Nomer v0.4.10 is matching the names as expected.

Thanks again for pointing this out - these are the kinds of things that make Nomer (and other open source software) tick.

jhpoelen commented 1 year ago

@jtmiller28 did you get a chance to confirm that the newer discoverlife is now used in the recent version of Nomer and associated name alignment template tool.

jtmiller28 commented 1 year ago

@jhpoelen Apologies for the delayed response, I was getting rather enigmatic outputs so I assumed some error was occurring on my end. I still get them now however so Ill share what I found.

Starting from the beginning: Chesshire has an outlined csv that denotes verbatim names (after parsing), 2 decision steps (one through automated name alignment, 2nd reviewed by taxonomist John Ascher), and 1 field containing notes on alignment decisions. This data can be found here: chesshires-namelist.csv

To test whether names that are noted as present in Discover Life, I greped out anything with "DiscoverLife" or "DL" within the field that indicates alignment decisions. obtaining the following file (though removed one instance that suggested white space caused failure for alignment). I then ran these verbatimNames through nomer obtaining the following: nomer-test-output.txt

This is where things get a bit tricky, there are still 385 names that fail to align. The following strings constitute for source of name change: "accepted synonym when entered into DL website, still give to John" "collapsed subspecies/switched based on DL - still give to John" "DL list indicates that this is the accepted synonym" "Accepted Name in the DL list" "valid on ITIS and DL, Keep but do run by John" "On DL website and ITIS, Keep but do run by John" "DL list indicates that this is the accepted synonym, pass by John"

Problem is I can't replicate their resolution across the name list by using DL. There are some instances where the name is definitely on discoverlife and unread by Nomer, some where the name pulls you to the genus by using the site (& is not on Nomer), Dead links on DL, and others where the reason for mapping is completely off from what they suggest.

First Case: Name is on DL and not currently seen by Nomer. echo -e "\tPseudopanurgus parvus" | nomer append discoverlife yields Pseudopanurgus parvus NONE Pseudopanurgus parvus

Reason for name alignment given by Chesshire: "Accepted Name in the DL list" https://www.discoverlife.org/ shows it is present on DL through manual search options

Second Case: with mismatch mapping through the website search tool, but a correct synonym when final name was searched echo -e "\tLasioglossum nymphaerum" | nomer append discoverlife yields Lasioglossum nymphaerum NONE Lasioglossum nymphaerum

Reason attached to this particular name for alignment is noted as "accepted synonym when entered into DL website, still give to John" Searching https://www.discoverlife.org/ with Lasioglossum nymphaerum yields just the Lasioglossum genus. Backtracking from their final decision made name we can see however that Lasioglossum nymphaerum is synonymous with Lasioglossum oceanicum. A similar scenario to this seems present for Andrena californica, though note the one that is actually searchable on DL is Andrena californica wickhami.

Third case: Mapping fails, dead linkage on DL site echo -e "\tHeterosarus bakeri" | nomer append discoverlife yields Heterosarus bakeri NONE Heterosarus bakeri

Reason noted for alignment: "DL list indicates that this is the accepted synonym" Searching name through DL yields a dead linkage? Authors suggested name: Pseudopanurgus bakeri which does not have present synonyms...

Fourth Case: Presumed erroneous reason field for alignment in their name table ex. echo -e "\tHeterosarus helianthi" | nomer append discoverlife yields Heterosarus helianthi NONE Heterosarus helianthi

Reason for alignment: "DL list indicates that this is the accepted synonym, pass by John" When discover life is searched for this name you arrive at a moth in Lepitdoptera: Hellinsia helianthi (Walsingham, 1880) The name they suggested is Pseudopanurgus helianthi , perserving the specificEpithet but opting for a bee genus. See Ashmeadiella washingtonensis for another case of this where specificEpithet is preserved but the genus is dropped for unexplained reasons. Seems to be a purposeful decision, but that seems odd they didn't correctly note that decision. This probably is not something to fix on Nomer end as there shouldnt be months/fungi mapping to bees by default, but I figured I'd make it apparent whats happening with some names.

To assure correct version nomer version yields 0.4.10

Apologies for the rather lengthy response, but I was a bit perplexed while going through it all...

Thanks! JT

jhpoelen commented 1 year ago

@jtmiller28 thanks for your specific examples and for being patient with me. Hoping to have a look sooner rather than later.

jtmiller28 commented 1 year ago

Thanks jorrit for your constant attention to these issues!

jhpoelen commented 1 year ago

I was able to reproduce your four examples of "NONE" matches via

https://github.com/jhpoelen/chesshires/actions/runs/4395176631

with abbreviated alignment report including:

providedName alignRelation alignedCatalogName alignedExternalId alignedName alignedAuthority
Heterosarus bakeri NONE discoverlife   Heterosarus bakeri  
Heterosarus bakeri SYNONYM_OF itis http://www.itis.gov/servlet/SingleRpt/SingleRpt?search_topic=TSN&search_value=753406 Pseudopanurgus bakeri (Cockerell, 1896)
Heterosarus bakeri SYNONYM_OF ncbi https://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?id=625948 Pseudopanurgus bakeri  
Heterosarus helianthi NONE discoverlife   Heterosarus helianthi  
Heterosarus helianthi SYNONYM_OF itis http://www.itis.gov/servlet/SingleRpt/SingleRpt?search_topic=TSN&search_value=753448 Pseudopanurgus helianthi Mitchell, 1960
Heterosarus helianthi NONE ncbi   Heterosarus helianthi  
Lasioglossum nymphaerum NONE discoverlife   Lasioglossum nymphaerum  
Lasioglossum nymphaerum NONE itis   Lasioglossum nymphaerum  
Lasioglossum nymphaerum NONE ncbi   Lasioglossum nymphaerum  
Pseudopanurgus parvus NONE discoverlife   Pseudopanurgus parvus  
Pseudopanurgus parvus HAS_ACCEPTED_NAME itis http://www.itis.gov/servlet/SingleRpt/SingleRpt?search_topic=TSN&search_value=753479 Pseudopanurgus parvus (Robertson, 1892)
Pseudopanurgus parvus NONE ncbi   Pseudopanurgus parvus  
jhpoelen commented 1 year ago

In trying to verify your claim:

First Case: Name is on DL and not currently seen by Nomer. echo -e "\tPseudopanurgus parvus" | nomer append discoverlife yields Pseudopanurgus parvus NONE Pseudopanurgus parvus

Reason for name alignment given by Chesshire: "Accepted Name in the DL list" https://www.discoverlife.org/ shows it is present on DL through manual search options

I think I found the associated species page at https://www.discoverlife.org/mp/20q?search=Protandrena+parva (see screenshot below)

image

Your detailed info was helpful to narrow down the suspicious (non) matches. If you can, please include evidence from DiscoverLife (e.g., link + screenshot), that would save me some time, assuming that you already had found the DL Url.

Am hoping to work through your examples and attempt fix them one by one, and see whether there's a pattern. Thanks for being patient.

jhpoelen commented 1 year ago

@jtmiller28 thanks to your detailed notes, I was able to find the root cause:

Nomer took the names as is from DiscoverLife, so it expected provided names to include the subgenus whenever DiscoverLife used it. Your example shows that omitting the subgenus happens, and should be accounted for.

In other words,

echo -e "\tPseudopanurgus parvus" | nomer append discoverlife

and

echo -e "\tPseudopanurgus (Heterosarus) parvus " | nomer append discoverlife

should both appear as synonyms of

Protandrena parva (Robertson, 1892)

jhpoelen commented 1 year ago

After updating nomer to include matches excluding the subgenus, I was able to generate the following results:

echo -e "\tPseudopanurgus parvus"\
 | nomer append --include-header discoverlife\
 | mlr --itsv --omd cat
providedExternalId providedName relationName resolvedExternalId resolvedName resolvedAuthorship resolvedRank resolvedCommonNames resolvedPath resolvedPathIds resolvedPathNames resolvedPathAuthorships resolvedExternalUrl
Pseudopanurgus parvus SYNONYM_OF https://www.discoverlife.org/mp/20q?search=Protandrena+parva Protandrena parva (Robertson, 1892) species Animalia | Arthropoda | Insecta | Hymenoptera | Andrenidae | Protandrena parva https://www.discoverlife.org/mp/20q?search=Animalia | https://www.discoverlife.org/mp/20q?search=Arthropoda | https://www.discoverlife.org/mp/20q?search=Insecta | https://www.discoverlife.org/mp/20q?search=Hymenoptera | https://www.discoverlife.org/mp/20q?search=Andrenidae | https://www.discoverlife.org/mp/20q?search=Protandrena+parva kingdom | phylum | class | order | family | species https://www.discoverlife.org/mp/20q?search=Protandrena+parva

and

echo -e "\tPseudopanurgus (Heterosarus) parvus"\
 | nomer append --include-header discoverlife\
 | mlr --itsv --omd cat
providedExternalId providedName relationName resolvedExternalId resolvedName resolvedAuthorship resolvedRank resolvedCommonNames resolvedPath resolvedPathIds resolvedPathNames resolvedPathAuthorships resolvedExternalUrl
Pseudopanurgus (Heterosarus) parvus SYNONYM_OF https://www.discoverlife.org/mp/20q?search=Protandrena+parva Protandrena parva (Robertson, 1892) species Animalia | Arthropoda | Insecta | Hymenoptera | Andrenidae | Protandrena parva https://www.discoverlife.org/mp/20q?search=Animalia | https://www.discoverlife.org/mp/20q?search=Arthropoda | https://www.discoverlife.org/mp/20q?search=Insecta | https://www.discoverlife.org/mp/20q?search=Hymenoptera | https://www.discoverlife.org/mp/20q?search=Andrenidae | https://www.discoverlife.org/mp/20q?search=Protandrena+parva kingdom | phylum | class | order | family | species https://www.discoverlife.org/mp/20q?search=Protandrena+parva
jhpoelen commented 1 year ago

In addition to Pseudopanurgus parvus, the names Heterosarus helianthi and Heterosarus bakeri now also matched to their accepted name.

Not so for Lasioglossum nymphaerum though. Working on that next.

jhpoelen commented 1 year ago

I was able to find

Lasioglossum nymphale (Smith, 1853)

But not Lasioglossum nymphaerum

Also, no hits for nymphaerum .

Which is consistent with @jtmiller28 observation that

Searching https://www.discoverlife.org/ with Lasioglossum nymphaerum yields just the Lasioglossum genus. Backtracking from their final decision made name we can see however that Lasioglossum nymphaerum is synonymous with Lasioglossum oceanicum. A similar scenario to this seems present for Andrena californica, though note the one that is actually searchable on DL is Andrena californica wickhami.

@jtmiller28 Can you please provide some evidence to suggest that Lasioglossum nymphaerum is documented somewhere in DiscoverLife ? If not, would it be possible that they have yet to add the name to the checklist?

(see screenshots below)

image

Screenshot from 2023-03-11 20-58-28

jhpoelen commented 1 year ago

It appears that https://www.discoverlife.org/mp/20q?search=Lasioglossum+oceanicum contains

Lasioglossum (Dialictus) nymphaearum (Robertson, 1895),

But not Lasioglossum nymphaerum

So now the question is - is this a typo, and if so, who made the typo?

image

jhpoelen commented 1 year ago

I've just release v0.4.11 with the aspiring fix. Please verify.

jhpoelen commented 1 year ago

By the way, after re-running the name alignment for https://github.com/jhpoelen/chesshires/actions/runs/4395278727 with v0.4.11 , the following result is found:

providedName alignRelation alignedCatalogName alignedExternalId alignedName
Heterosarus bakeri SYNONYM_OF discoverlife https://www.discoverlife.org/mp/20q?search=Protandrena+bakeri Protandrena bakeri
Heterosarus bakeri SYNONYM_OF itis http://www.itis.gov/servlet/SingleRpt/SingleRpt?search_topic=TSN&search_value=753406 Pseudopanurgus bakeri
Heterosarus bakeri SYNONYM_OF ncbi https://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?id=625948 Pseudopanurgus bakeri
Heterosarus helianthi SYNONYM_OF discoverlife https://www.discoverlife.org/mp/20q?search=Protandrena+helianthi Protandrena helianthi
Heterosarus helianthi SYNONYM_OF itis http://www.itis.gov/servlet/SingleRpt/SingleRpt?search_topic=TSN&search_value=753448 Pseudopanurgus helianthi
Heterosarus helianthi NONE ncbi   Heterosarus helianthi
Lasioglossum nymphaerum NONE discoverlife   Lasioglossum nymphaerum
Lasioglossum nymphaerum NONE itis   Lasioglossum nymphaerum
Lasioglossum nymphaerum NONE ncbi   Lasioglossum nymphaerum
Pseudopanurgus (Heterosarus) parvus SYNONYM_OF discoverlife https://www.discoverlife.org/mp/20q?search=Protandrena+parva Protandrena parva
Pseudopanurgus (Heterosarus) parvus HAS_ACCEPTED_NAME itis http://www.itis.gov/servlet/SingleRpt/SingleRpt?search_topic=TSN&search_value=753479 Pseudopanurgus parvus
Pseudopanurgus (Heterosarus) parvus NONE ncbi   Pseudopanurgus parvus
Pseudopanurgus parvus SYNONYM_OF discoverlife https://www.discoverlife.org/mp/20q?search=Protandrena+parva Protandrena parva
Pseudopanurgus parvus HAS_ACCEPTED_NAME itis http://www.itis.gov/servlet/SingleRpt/SingleRpt?search_topic=TSN&search_value=753479 Pseudopanurgus parvus
Pseudopanurgus parvus NONE ncbi   Pseudopanurgus parvus
jtmiller28 commented 1 year ago

Agreed, on identifying URLs would be helpful. I have a problem with this however, I cant seem to produce correct linkages when searching discoverlife, its always mapping to https://www.discoverlife.org/mp/20q regardless of the page im on within the site. Is there a trick to this? I've tried it on both firefox & chrome. Will continue the issue thread of the 41 mapping issues after nomer update 0.4.11 (Update fixed >350 names!)

jhpoelen commented 1 year ago

@jtmiller28 thanks for your reply. Can you please provide explicit steps with explicit examples to reproduce the issue you describe in:

I cant seem to produce correct linkages when searching discoverlife, its always mapping to https://www.discoverlife.org/mp/20q regardless of the page im on within the site.

jtmiller28 commented 1 year ago

Sure thing, When trying to navigate DiscoverLife's site I start on their landing page: https://www.discoverlife.org/ I then enter the name in question, ex: Pseudopanurgus parvus: image Searching this name brings me to the page showing the species information, however; the URL link found in the top search is not a link that I can copy to help others navigate to that said page. image https://www.discoverlife.org/mp/20q <- is the link. Using that as your url will bring you to the following page: image Which is an uninformative page concerning that actual address of what I was trying to share. Example case was done in Firefox browser

jhpoelen commented 1 year ago

@jtmiller28 thanks for your specific example. I think I understand your desire and reported a separate issue at https://github.com/globalbioticinteractions/nomer/issues/150 . Can you please check whether the issue title makes sense?

Aside from this important navigation / page reference issue, please let me know if there's additional things that need attention as far as this issue (i.e., https://github.com/globalbioticinteractions/nomer/issues/149) goes. If not, please let me know and/or close this issue.

jtmiller28 commented 1 year ago

Yep that issue pretty much covers it.

Back to #149, Here are some other cases I find where they aligned names that Nomer did not: nomer version yields 0.4.11 txt file of failed to align names if of interest: nomer-v0.4.11-nonmatches.txt

  1. There are names that are unsearchable, but have present homonym status according to discover life if you look at Chesshire's final resolution data. Ex. echo -e "\tAndrena illinoensis bicolor" | nomer append discoverlife yields Andrena illinoensis bicolor NONE Andrena illinoensis bicolor

Searching for the name manually on discoverlife will bring you to the following page: image Which doesn't yield a sufficient path for resolution. Manually going to the name that Chesshire notes is the final resolution "Andrena nigrae" however does denote a suspected homonym of Andrena illnoesis bicolor. image Which notably has some oddness to the linking: Andrena illinoensis form bicolor_homonym Robertson, 1898 Not sure how homonyms are dealt with in Nomer/DL indexing, but possibly "form" messing with it?

  1. Some names are still accepted names according to DL, however Nomer is not indexing them. ex. echo -e "\tPseudopanurgus fraterculus" | nomer append discoverlife yields Pseudopanurgus fraterculus NONE Pseudopanurgus fraterculus

Not sure what to note here, but the page is rather sparse so maybe something necessary for Nomer to index is missing here? image

  1. var. vs var causes failure in alignment (probably something to note, rather than change. I believe this may of came up in a previous issue, but basically when verbatim names have punctuation or unrecongized abbreviations it will trip alignment (subsp -> spp.) ex. echo -e "\tProsopis georgica var. leana" | nomer append discoverlife yields Prosopis georgica var. leana NONE Prosopis georgica var. leana
    however, echo -e "\tProsopis georgica var leana" | nomer append discoverlife yields Prosopis georgica var leana SYNONYM_OF https://www.discoverlife.org/mp/20q?search=Hylaeus+georgicus Hylaeus georgicus (Cockerell, 1896) species Animalia | Arthropoda | Insecta | Hymenoptera | Colletidae | Hylaeus georgicus https://www.discoverlife.org/mp/20q?search=Animalia | https://www.discoverlife.org/mp/20q?search=Arthropoda | https://www.discoverlife.org/mp/20q?search=Insecta | https://www.discoverlife.org/mp/20q?search=Hymenoptera | https://www.discoverlife.org/mp/20q?search=Colletidae | https://www.discoverlife.org/mp/20q?search=Hylaeus+georgicus kingdom | phylum | class | order | family | species https://www.discoverlife.org/mp/20q?search=Hylaeus+georgicus

  2. infraspecific Epithet in combination without subgenus added may cause issues in resolution ex. echo -e "\tMelissodes atripes atrimitra" | nomer append discoverlife yields: Melissodes atripes atrimitra NONE Melissodes atripes atrimitra

Looking at Chesshire's final resolution Svastra atripes, we note that this name is a synonym but also has a subgenus in combination with the infraspecificEpithet image echo -e "\tMelissodes (Epimelissodes) atripes atrimitra" | nomer append discoverlife yields Melissodes (Epimelissodes) atripes atrimitra SYNONYM_OF https://www.discoverlife.org/mp/20q?search=Svastra+atripes Svastra atripes (Cresson, 1872) species Animalia | Arthropoda | Insecta | Hymenoptera | Apidae | Svastra atripes https://www.discoverlife.org/mp/20q?search=Animalia | https://www.discoverlife.org/mp/20q?search=Arthropoda | https://www.discoverlife.org/mp/20q?search=Insecta | https://www.discoverlife.org/mp/20q?search=Hymenoptera | https://www.discoverlife.org/mp/20q?search=Apidae | https://www.discoverlife.org/mp/20q?search=Svastra+atripes kingdom | phylum | class | order | family | species https://www.discoverlife.org/mp/20q?search=Svastra+atripes

Andrena californica wickhami is another case of this.

  1. Somewhat related to (4), both lacking infraspecific Epithet (& subgenus) will trip resolution. Unsure if this is intended as some names may require the infraspecificEpithet to correctly arrive at designation. echo -e "\tPerdita texana" | nomer append discoverlife yields Perdita texana NONE Perdita texana

As noted in Chesshire: image However, just including infraspecificEpithet is insufficient presumably due to (4). echo -e "\tPerdita texana ablusa" | nomer append discoverlife yields Perdita texana ablusa NONE Perdita texana ablusa Finally echo -e "\tPerdita (Macrotera) texana ablusa" | nomer append discoverlife yields Perdita (Macrotera) texana ablusa SYNONYM_OF https://www.discoverlife.org/mp/20q?search=Macrotera+texana Macrotera texana Cresson, 1878 species Animalia | Arthropoda | Insecta | Hymenoptera | Andrenidae | Macrotera texana https://www.discoverlife.org/mp/20q?search=Animalia | https://www.discoverlife.org/mp/20q?search=Arthropoda | https://www.discoverlife.org/mp/20q?search=Insecta | https://www.discoverlife.org/mp/20q?search=Hymenoptera | https://www.discoverlife.org/mp/20q?search=Andrenidae | https://www.discoverlife.org/mp/20q?search=Macrotera+texana kingdom | phylum | class | order | family | species https://www.discoverlife.org/mp/20q?search=Macrotera+texana

  1. Related to both 4 & 5, some names lack subgenus but need infraspecific epithets to be resolved. Ex. echo -e "\tNeolarra congregata" | nomer append discoverlife yields Neolarra congregata NONE Neolarra congregata DL website search shows infraspecific epithet is necessary for alignment, but notably lacks subgenus designation image echo -e "\tNeolarra congregata helianthi" | nomer append discoverlife yields Neolarra congregata helianthi SYNONYM_OF https://www.discoverlife.org/mp/20q?search=Neolarra+verbesinae Neolarra verbesinae (Cockerell, 1895) species Animalia | Arthropoda | Insecta | Hymenoptera | Apidae | Neolarra verbesinae https://www.discoverlife.org/mp/20q?search=Animalia | https://www.discoverlife.org/mp/20q?search=Arthropoda | https://www.discoverlife.org/mp/20q?search=Insecta | https://www.discoverlife.org/mp/20q?search=Hymenoptera | https://www.discoverlife.org/mp/20q?search=Apidae | https://www.discoverlife.org/mp/20q?search=Neolarra+verbesinae kingdom | phylum | class | order | family | species https://www.discoverlife.org/mp/20q?search=Neolarra+verbesinae

Those are the cases I've found overall at the moment. Notably 4,5,6 seem to be related issues. Hard to say if there is a "good" resolution for it, from my research experience occurrence data is hit or miss on inclusion of infraspecific epithets and subgenera. Technically, its not supposed to be included at all by taxonomic standards I believe, but has been fashioned in based upon expert taxonomist decisions (which is a rather subjective area that will lead to endless issues if we pursue in my opinion). This poses potential problems for nomer in workflow, considering parsing steps for aligning names removes the subgenus (). Infraspecific Epithets are maintained at least for the first round in my heirarchical use of Nomer alignment, however, lacking subgenus causes failure in resolution in 4 & 5.

jhpoelen commented 1 year ago

@jtmiller28 thanks for preparing the list of example related to the discoverlife name matches. As far as I can tell, the name mismatches stem from interpretation of taxonomic name structures on parsing the discoverlife name lists. I wish there was a way you could help tweak these parsing rules and tune them appropriately. This way, you wouldn't have to wait for folks like me to help nail down these important details.

How do you propose to succeed?

jhpoelen commented 6 months ago

@jtmiller28 please feel free to comment / re-open issue related to Nomer's support for DiscoverLife. Note that the upcoming release is going to have some improvements such as #161 #167 .