globalbioticinteractions / nomer

maps identifiers and names to other identifiers and names
GNU General Public License v3.0
18 stars 3 forks source link

Hanging var names in discoverlife names #52

Closed seltmann closed 2 years ago

seltmann commented 2 years ago

In a review of dump discoverlife some of the accepted names need further data cleaning. A few names have a trailing var which indicates that the variety name after the var has been deleted

For example: Andrena (Lepidandrena) firuzaensis var should be changed to Andrena (Lepidandrena) firuzaensis var atra Andrena (Trachandrena) tacitula var should be changed to Andrena (Trachandrena) tacitula var grossulariae

jhpoelen commented 2 years ago

I was able to produce the issue using

$ nomer dump discoverlife | grep "Andrena (Lepidandrena) firuzaensis var"
using matcher [discoverlife-taxon]
DiscoverLife name indexing started...
[50590] DiscoverLife names were indexed in 19s (@ 2662 names/s)
https://www.discoverlife.org/mp/20q?search=Andrena+(Lepidandrena)+firuzaensis+var   Andrena (Lepidandrena) firuzaensis var  SYNONYM_OF  https://www.discoverlife.org/mp/20q?search=Andrena+firuzaensis  Andrena firuzaensis species     Animalia | Arthropoda | Insecta | Hymenoptera | Andrenidae | Andrena firuzaensis    https://www.discoverlife.org/mp/20q?search=Animalia | https://www.discoverlife.org/mp/20q?search=Arthropoda | https://www.discoverlife.org/mp/20q?search=Insecta | https://www.discoverlife.org/mp/20q?search=Hymenoptera | https://www.discoverlife.org/mp/20q?search=Andrenidae | https://www.discoverlife.org/mp/20q?search=Andrena+firuzaensis  kingdom | phylum | class | order | family | species https://www.discoverlife.org/mp/20q?search=Andrena+firuzaensis  

and found the related name record in the discover life bees source:

 <tr bgcolor="#f0f0f0">
            <td>
                 
              <i>
                <a href="/mp/20q?search=Andrena+firuzaensis" target="_self">
                  Andrena firuzaensis
                </a>
              </i>
              <font size="-1" face="sans-serif">
                Popov, 1940
              </font>
               -- 
              <i>
                Andrena (Lepidandrena) firuzaensis 
              </i>
              Popov, 1940; 
              <i>
                Andrena (Lepidandrena) firuzaensis var 
              </i>
              atra_homonym4 Popov, 1940; 
              <i>
                Andrena popovella 
              </i>
              Gusenleitner and Schwarz, 2002, replacement name
            </td>
          </tr>
jhpoelen commented 2 years ago

Looks like there are a little over 200 names with dangling vars -

$ nomer dump discoverlife | grep -P " var\t" | wc -l
using matcher [discoverlife-taxon]
214
jhpoelen commented 2 years ago

Here's a list of those chopped names:

$ cat /home/jorrit/proj/globi/nomer/nomer-taxon-resolver/src/main/resources/org/globalbioticinteractions/nomer/match/discoverlife/bees.xml.gz | gunzip | grep -P "var [^a-z]" | sort | uniq
                Ammobates (Ammobates) lativalvis var 
                Ammobates (Euphileremus) handlirschi var 
                Andrena (Andrena) nasonii var 
                Andrena (Lepidandrena) firuzaensis var 
                Andrena (Parandrena) andrenoides var 
                Andrena (Ptilandrena) supervirens var 
                Andrena (Scrapter) imitatrix var 
                Andrena (Trachandrena) tacitula var 
                Anthophora (Micranthophora) curta var 
                Augochlora (Augochloropsis) vesta var 
                Bombus (Agrobombus) helferanus var 
                Bombus (Agrobombus) muscorum var 
                Bombus (Alpigenobombus) tetrachromus var 
                Bombus (Chromobombus) muscorum var 
                Bombus (Chromobombus) variabilis_homonym var 
                Bombus (Diversobombus) wilemani var 
                Bombus (Hortobombus) consobrinus var 
                Bombus (Hortobombus) mimeticus var 
                Bombus (Lapidariobombus) oculatus var 
                Bombus (Lapidariobombus) rufofasciatus var 
                Bombus (Lapidariobombus) sicheli var 
                Bombus (Lapidariobombus) tenellus var 
                Bombus (Leucobombus) terrestris var 
                Bombus (Melanobombus) confusus var 
                Bombus (Orientalibombus) orientalis var 
                Bombus (Pratobombus) atrocinctus var 
                Bombus (Pratobombus) biroi var 
                Bombus (Pratobombus) hypnorum var 
                Bombus (Pratobombus) impatiens var 
                Bombus (Pratobombus) parthenius var 
                Bombus (Pratombombus) mearnsi var 
                Bombus (Rhodobombus) helleri var 
                Bombus (Rufipedibombus) eximius var 
                Bombus (Senexibombus) bicoloratus var 
                Bombus (Subterraneobombus) fragans var 
                Bombus (Terrestribombus) lucorum var 
                Bombus (Terrestribombus) terrestris var 
                Bremus (Alpigenobombus) dentatus var 
                Bremus (Alpigenobombus) grahami var 
                Bremus (Bremus) ignitus var 
                Bremus (Lapidariobombus) formosellus var 
                Bremus (Pratobombus) mearnsi var 
                Bremus (Rufipedibombus) rufipes var 
                Bremus (Senexibombus) senex var 
                Bremus (Sibiricobombus) oculatus var 
                Centris (Epicharis) conica var 
                Centris (Epicharis) dejeani var 
                Centris (Epicharis) maculata var 
                Centris (Epicharis) rustica var 
                Centris (Epicharis) umbraculata var 
                Centris (Hemisia) nitens var 
                Centris (Melanocentris) furcata var 
                Centris (Melanocentris) obsoleta var 
                Centris (Melanocentris) petreae var 
                Centris (Ptilotopus) denudans var 
                Ceratina (Ceratinidia) hieroglyphica var 
                Ceratina (Ceratinidia) lepida var 
                Ceratina speculifrons var 
                Chalicodoma (Chalicodoma) lefebvrei var 
                Euaspis (Parevaspis) basalis var 
                Euglossa (Eufriesea) magrettii var 
                Euglossa (Euglossa) cordata var 
                Euglossa (Euglossa) variabilis var 
                Euglossa (Eulaema) nigrita var 
                Euglossa (Eulema) dimidiata var 
                Euglossa (Eulema) mexicana var 
                Euglossa (Eulema) nigrifacies var 
                Euglossa (Eumorpha) combinata var 
                Euglossa (Eumorpha) magrettii var 
                Euglossa (Eumorpha) mariana var 
                Exomalopsis (Anthophorula) compactula var 
                Halictus (Chloralictus) pilosus var 
                Halictus (Corynura) corynogaster var 
                Halictus (Evylaeus) arcuatus var 
                Halictus (Thrichostoma) sjoestedti var 
                Hylaeus (Deranchylaeus) tenuis var 
                Megachile (Argyropile) parallela var 
                Megachile (Chalicodoma) lefebvrei var 
                Megachile (Chalicodoma) manicata var 
                Megachile (Chalicodoma) monstrifica var 
                Megachile (Chalicodoma) muraria var 
                Megachile (Chalicodoma) pyrenaica var 
                Megachile (Chelostomoides) exilis var 
                Megachile (Delomegachile) gemula var 
                Megachile (Delomegachile) melanophaea var 
                Megachile (Delomegachile) vidua var 
                Megachile (Eumegachile) bilobata var 
                Megachile (Eumegachile) sculpturalis var 
                Megachile (Litomegachile) brevis var 
                Megachile (Pseudocentron) pruina var 
                Megachile (Sayapis) frugalis var 
                Melissa (Epiclopus) gayi var 
                Melitoma (Ancyloscelis) chilensis var 
                Nomada (Holonomada) edwardsii var 
                Nomada (Micronomada) modesta var 
                Nomada (Nomadula) rhodosoma var 
                Nomada (Xanthidium) crotchii var 
                Nomada (Xanthidium) vallesina var 
                Nomia (Crocisaspidia) postscutellaris var 
                Nomia (Epinomia) bakeri var 
                Nomia (Hoplonomia) pulchribalteata var 
                Osmia (Melanosmia) nigrifrons var 
                Paratrigona (Paratrigona) ornaticeps var 
                Perdita (Perdita) eriastri var 
                Perdita (Perdita) macswaini var 
                Perdita (Pygoperdita) malacothricis var 
                Psaenythia (Psaenythia) bizonata var 
                Psaenythia (Psaenythia) rubripes var 
                Psithyrus (Allopsithyrus) barbutellus var 
                Psithyrus (Allopsithyrus) maxillosus var 
                Psithyrus (Ashtonipsithyrus) distinctus var 
                Psithyrus (Ashtonipsithyrus) vestalis var 
                Psithyrus (Fernaldaepsithyrus) flavidus var 
                Psithyrus (Fernaldaepsithyrus) norvegicus var 
                Psithyrus (Fernaldaepsithyrus) quadricolor var 
                Psithyrus (Fernaldaepsithyrus) sylvestris var 
                Psithyrus (Metapsithyrus) campestris var 
                Psithyrus (Metapsithyrus) pieli var 
                Psithyrus (Psithyrus) acutisquameus var 
                Sphecodes hispanicus subvar 
                Stenotritus elegans var 
                Trigona (Cephalotrigona) capitata var 
                Trigona (Geotrigona) acapulconis var 
                Trigona (Geotrigona) leucogastra var 
                Trigona (Hypotrigona) pendleburyi var 
                Trigona (Lepidotrigona) nitidiventris var 
                Trigona (Lepidotrigona) terminata var 
                Trigona (Lepidotrigona) ventralis var 
                Trigona (Lestrimelitta) limao var 
                Trigona (Nannotrigona) postica var 
                Trigona (Nannotrigona) testaceicornis var 
                Trigona (Oxytrigona) tataira var 
                Trigona (Parapartamona) zonata var 
                Trigona (Paratrigona) lineata var 
                Trigona (Paratrigona) opaca var 
                Trigona (Patera) testacea var 
                Trigona (Scaptotrigona) mexicana var 
                Trigona (Scaptotrigona) pectoralis var 
                Trigona (Tetragona) buchwaldi var 
                Trigona (Tetragona) dorsalis var 
                Trigona (Tetragona) fimbriata var 
                Trigona (Tetragona) fusco-balteata var 
                Trigona (Tetragona) fuscobalteata var 
                Trigona (Tetragona) heideri var 
                Trigona (Tetragona) jaty var 
                Trigona (Tetragona) nigra var 
                Trigona (Tetragona) sarawakensis var 
                Trigona (Tetragona) subgrisea var 
                Trigona (Trigona) dimidiata var 
                Trigona (Trigona) hypogea var 
                Trigona (Trigona) pallida var 
                Xylocopa (Afroxylocopa) caffra var 
                Xylocopa (Afroxylocopa) nigrita var 
                Xylocopa (Afroxylocopa) scioensis var 
                Xylocopa (Koptorthosoma) caerulea var 
                Xylocopa (Koptorthosoma) caeruleiformis var 
                Xylocopa (Koptortosoma) flavicollis var 
                Xylocopa (Xylocopa) rufipes var 
jhpoelen commented 2 years ago

@seltmann root cause for the var name chopping appears to be a syntax error on the discoverlife side.

I've implemented a workaround, and ideally the authors of Discover Life would correct the entries in which the var names are chunked with the authorship string.

Note that the example below related to a _homonym in addition to a chopped var name:

$ nomer dump discoverlife | grep "Andrena (Lepidandrena) firuzaensis var"
using matcher [discoverlife-taxon]
DiscoverLife name indexing started...
[50590] DiscoverLife names were indexed in 19s (@ 2662 names/s)
https://www.discoverlife.org/mp/20q?search=Andrena+(Lepidandrena)+firuzaensis+var+atra  Andrena (Lepidandrena) firuzaensis var atra NONE    https://www.discoverlife.org/mp/20q?search=Andrena+firuzaensis  Andrena firuzaensis species     Animalia | Arthropoda | Insecta | Hymenoptera | Andrenidae | Andrena firuzaensis    https://www.discoverlife.org/mp/20q?search=Animalia | https://www.discoverlife.org/mp/20q?search=Arthropoda | https://www.discoverlife.org/mp/20q?search=Insecta | https://www.discoverlife.org/mp/20q?search=Hymenoptera | https://www.discoverlife.org/mp/20q?search=Andrenidae | https://www.discoverlife.org/mp/20q?search=Andrena+firuzaensis  kingdom | phylum | class | order | family | species https://www.discoverlife.org/mp/20q?search=Andrena+firuzaensis  
jhpoelen commented 2 years ago

Note example for https://www.discoverlife.org/mp/20q?search=Megachile+lefebvrei .

One var name appears to be entered correctly (i.e., Megachile lefeburei var albomaculata Friese, 1898), however, suspicious var names are also included Megachile (Chalicodoma) lefebvrei var albida Pérez, 1897 and Megachile (Chalicodoma) muraria var variabilis Friese, 1920 .

You can visually see the issue by noting the errors do not have full italicized names: the var part is not italic.

See attached screenshot.

Screenshot from 2021-11-01 15-45-00

jhpoelen commented 2 years ago

Another exception to the exception was found . . . in which a comma was added between the var and the dangling var name.

Example:

Ceratina laevifrons var , moricei Friese, 1899

See https://www.discoverlife.org/mp/20q?search=Ceratina+moricei and attached screenshot

Screenshot from 2021-11-01 15-58-25

jhpoelen commented 2 years ago

Also, please note that there's a dangling var with name A.

Stenotritus elegans var A Cockerell, 1914

See https://www.discoverlife.org/mp/20q?search=Stenotritus+elegans and attached screenshot.

Screenshot from 2021-11-01 16-00-37

jhpoelen commented 2 years ago

After implementation of workarounds, the following two names remain, both of which are var A names.

$ nomer list discoverlife | grep -P "var [^a-z]" | sort | uniq 
using matcher [discoverlife-taxon]
DiscoverLife name indexing started...
[50590] DiscoverLife names were indexed in 19s (@ 2662 names/s)
https://www.discoverlife.org/mp/20q?search=Ceratina+speculifrons+var+A  Ceratina speculifrons var A SYNONYM_OF  https://www.discoverlife.org/mp/20q?search=Ceratina+speculifrons    Ceratina speculifrons   species     Animalia | Arthropoda | Insecta | Hymenoptera | Apidae | Ceratina speculifrons  https://www.discoverlife.org/mp/20q?search=Animalia | https://www.discoverlife.org/mp/20q?search=Arthropoda | https://www.discoverlife.org/mp/20q?search=Insecta | https://www.discoverlife.org/mp/20q?search=Hymenoptera | https://www.discoverlife.org/mp/20q?search=Apidae | https://www.discoverlife.org/mp/20q?search=Ceratina+speculifrons    kingdom | phylum | class | order | family | species https://www.discoverlife.org/mp/20q?search=Ceratina+speculifrons    
https://www.discoverlife.org/mp/20q?search=Stenotritus+elegans+var+A    Stenotritus elegans var A   SYNONYM_OF  https://www.discoverlife.org/mp/20q?search=Stenotritus+elegans  Stenotritus elegans species     Animalia | Arthropoda | Insecta | Hymenoptera | Stenotritidae | Stenotritus elegans https://www.discoverlife.org/mp/20q?search=Animalia | https://www.discoverlife.org/mp/20q?search=Arthropoda | https://www.discoverlife.org/mp/20q?search=Insecta | https://www.discoverlife.org/mp/20q?search=Hymenoptera | https://www.discoverlife.org/mp/20q?search=Stenotritidae | https://www.discoverlife.org/mp/20q?search=Stenotritus+elegans   kingdom | phylum | class | order | family | species https://www.discoverlife.org/mp/20q?search=Stenotritus+elegans  
seltmann commented 2 years ago

@jhpoelen the var typically is not italic in writing scientific names so this would not be seen as an error. So Megachile (Chalicodoma) muraria var variabilis Friese, 1920 is incorrect and should be Megachile (Chalicodoma) muraria var variabilis Friese, 1920

I will ask the authors about Stenotritus elegans var A and Ceratina speculifrons var A correct Ceratina laevifrons var , moricei Friese, 1899 and be consistent reg. italics of var

jhpoelen commented 2 years ago

@seltmann thanks for sharing your thoughts on taxonomic "var" names formatting.

Because most of the var parsing issues have been addressed, I'll close this issue and open a newer narrower ones, describing the external data issues related to the:

  1. consistent italics of var names https://github.com/globalbioticinteractions/nomer/issues/55
  2. correct Stenotritus elegans var A and Ceratina speculifrons var A https://github.com/globalbioticinteractions/nomer/issues/56

Please feel free to re-open this issue if you'd like to proceed in some other way.