globalbioticinteractions / nomer

maps identifiers and names to other identifiers and names
GNU General Public License v3.0
18 stars 3 forks source link

Add DiscoverLife Bees to Nomer #42

Closed seltmann closed 2 years ago

seltmann commented 3 years ago

@jhpoelen Can we add the Discover Life bee checklist to Nomer? https://www.discoverlife.org/mp/20q?act=x_checklist&guide=Apoidea_species&flags=HAS:

Ascher, J. S. and J. Pickering. 2020. Discover Life bee species guide and world checklist (Hymenoptera: Apoidea: Anthophila). http://www.discoverlife.org/mp/20q?guide=Apoidea_species

jhpoelen commented 3 years ago

@seltmann I've implemented a first pass at support for DiscoverLife taxon matching to nomer:

$ echo -e "\tApis mellifera" | nomer append discoverlife-taxon
using matcher [discoverlife-taxon]
lazy init of taxonomy index [DiscoverLifeService] started...
index directory at [/tmp/taxon1632771108502] created.
lazy init of taxonomy index [DiscoverLifeService] done.
    Apis mellifera  SAME_AS https://www.discoverlife.org/mp/20q?search=Apis+mellifera   Apis mellifera  species     Animalia | Arthropoda | Insecta | Hymenoptera | Apidae | Apis mellifera https://www.discoverlife.org/mp/20q?search=Animalia | https://www.discoverlife.org/mp/20q?search=Arthropoda | https://www.discoverlife.org/mp/20q?search=Insecta | https://www.discoverlife.org/mp/20q?search=Hymenoptera | https://www.discoverlife.org/mp/20q?search=Apidae | https://www.discoverlife.org/mp/20q?search=Apis+mellifera   kingdom | phylum | class | order | family | species https://www.discoverlife.org/mp/20q?search=Apis+mellifera   

Note that name variations are not yet included.

Can you help show some examples of synonyms etc. from the Discover Life notation ? I am probably not as familiar with their notation as you are.

BTW - at first glance, the discover life list contains about 20k names. Does that sound about right?

jhpoelen commented 3 years ago

A quick performance test shows that the (un-optimized) Discover Life taxon matcher can match about 3k names/s or 3 names / ms.

seltmann commented 2 years ago

@jhpoelen 20K names is right. That is about the number of worldwide bee species.

Acamptopoeum vagans (Cockerell, 1926) -- Camptopoeum (Acamptopoeum) vagans Cockerell, 1926

Current valid name - Acamptopoeum vagans (Cockerell, 1926) Scientific Name - Acamptopoeum vagans Aurhorship - (Cockerell, 1926) Other name - Camptopoeum (Acamptopoeum) vagans Cockerell, 1926

Other names are not defined by Discover Life except on a few occasions, but most are prior names for the current valid taxon

seltmann commented 2 years ago

Andrena accepta Viereck, 1916 -- Andrena pulchella_homonym Robertson, 1891; Pterandrena pulchella (Robertson, 1891); Andrena accepta Viereck, 1916, replacement name

Current valid name - Andrena accepta Viereck, 1916 Scientific Name - Andrena accepta Aurhorship - Viereck, 1916 Other names - Andrena pulchella Robertson, 1891; Pterandrena pulchella (Robertson, 1891); Andrena accepta Viereck, 1916

Andrena pulchella_homonym Robertson, 1891 - indicates that Andrena pulchella is a homonym Andrena accepta Viereck, 1916 - not entirely sure what replacement name indicates, but some kind of synonym

jhpoelen commented 2 years ago

@seltmann thanks for providing your manually parsed examples.

here's the markup associated to the examples:

<td>
&nbsp;&nbsp;&nbsp;
<i><a href="/mp/20q?search=Acamptopoeum+vagans" target="_self">Acamptopoeum vagans</a></i> 
<font size="-1" face="sans-serif">(Cockerell, 1926)</font> 
-- 
<i>Camptopoeum (Acamptopoeum) vagans </i>Cockerell, 1926
</td>

and

<td>&nbsp;&nbsp;&nbsp;
<i><a href="/mp/20q?search=Andrena+accepta" target="_self">Andrena accepta</a></i> 
<font size="-1" face="sans-serif">Viereck, 1916</font> 
-- 
<i>Andrena pulchella_homonym </i>Robertson, 1891; 
<i>Pterandrena pulchella </i>(Robertson, 1891); 
<i>Andrena accepta </i>Viereck, 1916, replacement name
</td>
seltmann commented 2 years ago

Example with subgenus

Andrena anisochlora Cockerell, 1936 -- Andrena (Micrandrena) dinognatha Timberlake, 1938 Current valid name - Andrena anisochlora Cockerell, 1936 Scientific Name - Andrena anisochlora Authorship - Cockerell, 1936 Other names - Andrena (Micrandrena) dinognatha Timberlake, 1938

Where (Micrandrena) is a subgenus which is always in paranthesis after the genus

jhpoelen commented 2 years ago

@seltmann Just added synonym support and multi-match support to Nomer's discoverlife taxon matcher -

$ echo -e "\tAcamptopoeum argentinum" | nomer append discoverlife-taxon
using matcher [discoverlife-taxon]
    Acamptopoeum argentinum SAME_AS https://www.discoverlife.org/mp/20q?search=Acamptopoeum+argentinum  Acamptopoeum argentinum species     Animalia | Arthropoda | Insecta | Hymenoptera | Andrenidae | Acamptopoeum argentinum    https://www.discoverlife.org/mp/20q?search=Animalia | https://www.discoverlife.org/mp/20q?search=Arthropoda | https://www.discoverlife.org/mp/20q?search=Insecta | https://www.discoverlife.org/mp/20q?search=Hymenoptera | https://www.discoverlife.org/mp/20q?search=Andrenidae | https://www.discoverlife.org/mp/20q?search=Acamptopoeum+argentinum  kingdom | phylum | class | order | family | species https://www.discoverlife.org/mp/20q?search=Acamptopoeum+argentinum  
    Acamptopoeum argentinum SYNONYM_OF  https://www.discoverlife.org/mp/20q?search=Perdita+argentina    Perdita argentina               https://www.discoverlife.org/mp/20q?search=Perdita+argentina    

Notice the synonyms.

Also note that indexing all ~50k names takes about 30s .

$ echo -e "\tAcamptopoeum argentinum" | nomer append discoverlife-taxon
using matcher [discoverlife-taxon]
DiscoverLife name indexing started...
[50590] DiscoverLife names were indexed in 26s at 1945names/s
    Acamptopoeum argentinum SAME_AS https://www.discoverlife.org/mp/20q?search=Acamptopoeum+argentinum  Acamptopoeum argentinum species     Animalia | Arthropoda | Insecta | Hymenoptera | Andrenidae | Acamptopoeum argentinum    https://www.discoverlife.org/mp/20q?search=Animalia | https://www.discoverlife.org/mp/20q?search=Arthropoda | https://www.discoverlife.org/mp/20q?search=Insecta | https://www.discoverlife.org/mp/20q?search=Hymenoptera | https://www.discoverlife.org/mp/20q?search=Andrenidae | https://www.discoverlife.org/mp/20q?search=Acamptopoeum+argentinum  kingdom | phylum | class | order | family | species https://www.discoverlife.org/mp/20q?search=Acamptopoeum+argentinum  
    Acamptopoeum argentinum SYNONYM_OF  https://www.discoverlife.org/mp/20q?search=Perdita+argentina    Perdita argentina               https://www.discoverlife.org/mp/20q?search=Perdita+argentina    

After that I ran all GloBI taxon names again the matcher and found that 1.26M names where matched in 33 seconds at about 38k names / second.

$ zcat ~/tmp/names.tsv.gz | nomer append discoverlife-taxon | pv -l > /dev/null
using matcher [discoverlife-taxon]
1.26M 0:00:33 [38.0k/s] [                             
jhpoelen commented 2 years ago

I've made some changes to the discoverlife support in https://github.com/globalbioticinteractions/nomer/releases/tag/0.2.4 .

Please review the functionality by playing around with Nomer v0.2.4 with commands like:

$ echo -e "\tApis mellifera" | nomer append discoverlife 
...
    Apis mellifera  HAS_ACCEPTED_NAME   https://www.discoverlife.org/mp/20q?search=Apis+mellifera   Apis mellifera  species     Animalia | Arthropoda | Insecta | Hymenoptera | Apidae | Apis mellifera https://www.discoverlife.org/mp/20q?search=Animalia | https://www.discoverlife.org/mp/20q?search=Arthropoda | https://www.discoverlife.org/mp/20q?search=Insecta | https://www.discoverlife.org/mp/20q?search=Hymenoptera | https://www.discoverlife.org/mp/20q?search=Apidae | https://www.discoverlife.org/mp/20q?search=Apis+mellifera   kingdom | phylum | class | order | family | species https://www.discoverlife.org/mp/20q?search=Apis+mellifera   
jhpoelen commented 2 years ago

@seltmann closing main issue of adding discoverlife support. Please report specific issues/ improvement suggestions in separate issues.