Closed seltmann closed 2 years ago
@seltmann I've implemented a first pass at support for DiscoverLife taxon matching to nomer:
$ echo -e "\tApis mellifera" | nomer append discoverlife-taxon
using matcher [discoverlife-taxon]
lazy init of taxonomy index [DiscoverLifeService] started...
index directory at [/tmp/taxon1632771108502] created.
lazy init of taxonomy index [DiscoverLifeService] done.
Apis mellifera SAME_AS https://www.discoverlife.org/mp/20q?search=Apis+mellifera Apis mellifera species Animalia | Arthropoda | Insecta | Hymenoptera | Apidae | Apis mellifera https://www.discoverlife.org/mp/20q?search=Animalia | https://www.discoverlife.org/mp/20q?search=Arthropoda | https://www.discoverlife.org/mp/20q?search=Insecta | https://www.discoverlife.org/mp/20q?search=Hymenoptera | https://www.discoverlife.org/mp/20q?search=Apidae | https://www.discoverlife.org/mp/20q?search=Apis+mellifera kingdom | phylum | class | order | family | species https://www.discoverlife.org/mp/20q?search=Apis+mellifera
Note that name variations are not yet included.
Can you help show some examples of synonyms etc. from the Discover Life notation ? I am probably not as familiar with their notation as you are.
BTW - at first glance, the discover life list contains about 20k names. Does that sound about right?
A quick performance test shows that the (un-optimized) Discover Life taxon matcher can match about 3k names/s or 3 names / ms.
@jhpoelen 20K names is right. That is about the number of worldwide bee species.
Acamptopoeum vagans (Cockerell, 1926) -- Camptopoeum (Acamptopoeum) vagans Cockerell, 1926
Current valid name - Acamptopoeum vagans (Cockerell, 1926) Scientific Name - Acamptopoeum vagans Aurhorship - (Cockerell, 1926) Other name - Camptopoeum (Acamptopoeum) vagans Cockerell, 1926
Other names are not defined by Discover Life except on a few occasions, but most are prior names for the current valid taxon
Andrena accepta Viereck, 1916 -- Andrena pulchella_homonym Robertson, 1891; Pterandrena pulchella (Robertson, 1891); Andrena accepta Viereck, 1916, replacement name
Current valid name - Andrena accepta Viereck, 1916 Scientific Name - Andrena accepta Aurhorship - Viereck, 1916 Other names - Andrena pulchella Robertson, 1891; Pterandrena pulchella (Robertson, 1891); Andrena accepta Viereck, 1916
Andrena pulchella_homonym Robertson, 1891 - indicates that Andrena pulchella is a homonym Andrena accepta Viereck, 1916 - not entirely sure what replacement name indicates, but some kind of synonym
@seltmann thanks for providing your manually parsed examples.
here's the markup associated to the examples:
<td>
<i><a href="/mp/20q?search=Acamptopoeum+vagans" target="_self">Acamptopoeum vagans</a></i>
<font size="-1" face="sans-serif">(Cockerell, 1926)</font>
--
<i>Camptopoeum (Acamptopoeum) vagans </i>Cockerell, 1926
</td>
and
<td>
<i><a href="/mp/20q?search=Andrena+accepta" target="_self">Andrena accepta</a></i>
<font size="-1" face="sans-serif">Viereck, 1916</font>
--
<i>Andrena pulchella_homonym </i>Robertson, 1891;
<i>Pterandrena pulchella </i>(Robertson, 1891);
<i>Andrena accepta </i>Viereck, 1916, replacement name
</td>
Example with subgenus
Andrena anisochlora Cockerell, 1936 -- Andrena (Micrandrena) dinognatha Timberlake, 1938 Current valid name - Andrena anisochlora Cockerell, 1936 Scientific Name - Andrena anisochlora Authorship - Cockerell, 1936 Other names - Andrena (Micrandrena) dinognatha Timberlake, 1938
Where (Micrandrena) is a subgenus which is always in paranthesis after the genus
@seltmann Just added synonym support and multi-match support to Nomer's discoverlife taxon matcher -
$ echo -e "\tAcamptopoeum argentinum" | nomer append discoverlife-taxon
using matcher [discoverlife-taxon]
Acamptopoeum argentinum SAME_AS https://www.discoverlife.org/mp/20q?search=Acamptopoeum+argentinum Acamptopoeum argentinum species Animalia | Arthropoda | Insecta | Hymenoptera | Andrenidae | Acamptopoeum argentinum https://www.discoverlife.org/mp/20q?search=Animalia | https://www.discoverlife.org/mp/20q?search=Arthropoda | https://www.discoverlife.org/mp/20q?search=Insecta | https://www.discoverlife.org/mp/20q?search=Hymenoptera | https://www.discoverlife.org/mp/20q?search=Andrenidae | https://www.discoverlife.org/mp/20q?search=Acamptopoeum+argentinum kingdom | phylum | class | order | family | species https://www.discoverlife.org/mp/20q?search=Acamptopoeum+argentinum
Acamptopoeum argentinum SYNONYM_OF https://www.discoverlife.org/mp/20q?search=Perdita+argentina Perdita argentina https://www.discoverlife.org/mp/20q?search=Perdita+argentina
Notice the synonyms.
Also note that indexing all ~50k names takes about 30s .
$ echo -e "\tAcamptopoeum argentinum" | nomer append discoverlife-taxon
using matcher [discoverlife-taxon]
DiscoverLife name indexing started...
[50590] DiscoverLife names were indexed in 26s at 1945names/s
Acamptopoeum argentinum SAME_AS https://www.discoverlife.org/mp/20q?search=Acamptopoeum+argentinum Acamptopoeum argentinum species Animalia | Arthropoda | Insecta | Hymenoptera | Andrenidae | Acamptopoeum argentinum https://www.discoverlife.org/mp/20q?search=Animalia | https://www.discoverlife.org/mp/20q?search=Arthropoda | https://www.discoverlife.org/mp/20q?search=Insecta | https://www.discoverlife.org/mp/20q?search=Hymenoptera | https://www.discoverlife.org/mp/20q?search=Andrenidae | https://www.discoverlife.org/mp/20q?search=Acamptopoeum+argentinum kingdom | phylum | class | order | family | species https://www.discoverlife.org/mp/20q?search=Acamptopoeum+argentinum
Acamptopoeum argentinum SYNONYM_OF https://www.discoverlife.org/mp/20q?search=Perdita+argentina Perdita argentina https://www.discoverlife.org/mp/20q?search=Perdita+argentina
After that I ran all GloBI taxon names again the matcher and found that 1.26M names where matched in 33 seconds at about 38k names / second.
$ zcat ~/tmp/names.tsv.gz | nomer append discoverlife-taxon | pv -l > /dev/null
using matcher [discoverlife-taxon]
1.26M 0:00:33 [38.0k/s] [
I've made some changes to the discoverlife support in https://github.com/globalbioticinteractions/nomer/releases/tag/0.2.4 .
Please review the functionality by playing around with Nomer v0.2.4 with commands like:
$ echo -e "\tApis mellifera" | nomer append discoverlife
...
Apis mellifera HAS_ACCEPTED_NAME https://www.discoverlife.org/mp/20q?search=Apis+mellifera Apis mellifera species Animalia | Arthropoda | Insecta | Hymenoptera | Apidae | Apis mellifera https://www.discoverlife.org/mp/20q?search=Animalia | https://www.discoverlife.org/mp/20q?search=Arthropoda | https://www.discoverlife.org/mp/20q?search=Insecta | https://www.discoverlife.org/mp/20q?search=Hymenoptera | https://www.discoverlife.org/mp/20q?search=Apidae | https://www.discoverlife.org/mp/20q?search=Apis+mellifera kingdom | phylum | class | order | family | species https://www.discoverlife.org/mp/20q?search=Apis+mellifera
@seltmann closing main issue of adding discoverlife support. Please report specific issues/ improvement suggestions in separate issues.
@jhpoelen Can we add the Discover Life bee checklist to Nomer? https://www.discoverlife.org/mp/20q?act=x_checklist&guide=Apoidea_species&flags=HAS:
Ascher, J. S. and J. Pickering. 2020. Discover Life bee species guide and world checklist (Hymenoptera: Apoidea: Anthophila). http://www.discoverlife.org/mp/20q?guide=Apoidea_species