globalbioticinteractions / nomer

maps identifiers and names to other identifiers and names
GNU General Public License v3.0
19 stars 3 forks source link

Create a way for nomer to export an entire taxonomy (e.g., ITIS, DiscoverLife) #43

Closed seltmann closed 3 years ago

seltmann commented 3 years ago

As a function of nomer create a dwc-a export using user defined name lists. For example, create a checklist of bees that would include ITIS and DiscoverLife bee names, including synonymns.

jhpoelen commented 3 years ago

@seltmann sounds good!

Can you include an example of an existing checklist dwc-a that you'd like to use as a guiding example?

jhpoelen commented 3 years ago

@seltmann provided ITIS DwC-A as an example:

https://www.gbif.org/dataset/9ca92552-f23a-41a8-a140-01abaa31c931

Note however, that https://itis.gov/downloads/index.html does not offer a DwC-A bulk download.

Also, I noticed that GBIF's ITIS page points to https://hosted-datasets.gbif.org/datasets/itis.zip as their source (see https://www.gbif.org/dataset/9ca92552-f23a-41a8-a140-01abaa31c931#description)

jhpoelen commented 3 years ago

See https://itis.gov/dwca_format.html for ITIS DwC-A usage information.

jhpoelen commented 3 years ago

With recent changes, I was able to produce the following output:

$ nomer dump discoverlife-taxon | head 
using matcher [discoverlife-taxon]
https://www.discoverlife.org/mp/20q?search=Acamptopoeum+argentinum  Acamptopoeum argentinum HAS_ACCEPTED_NAME   https://www.discoverlife.org/mp/20q?search=Acamptopoeum+argentinum  Acamptopoeum argentinum species     Animalia | Arthropoda | Insecta | Hymenoptera | Andrenidae | Acamptopoeum argentinum    https://www.discoverlife.org/mp/20q?search=Animalia | https://www.discoverlife.org/mp/20q?search=Arthropoda | https://www.discoverlife.org/mp/20q?search=Insecta | https://www.discoverlife.org/mp/20q?search=Hymenoptera | https://www.discoverlife.org/mp/20q?search=Andrenidae | https://www.discoverlife.org/mp/20q?search=Acamptopoeum+argentinum  kingdom | phylum | class | order | family | species https://www.discoverlife.org/mp/20q?search=Acamptopoeum+argentinum  
https://www.discoverlife.org/mp/20q?search=Acamptopoeum+calchaqui   Acamptopoeum calchaqui  HAS_ACCEPTED_NAME   https://www.discoverlife.org/mp/20q?search=Acamptopoeum+calchaqui   Acamptopoeum calchaqui  species     Animalia | Arthropoda | Insecta | Hymenoptera | Andrenidae | Acamptopoeum calchaqui https://www.discoverlife.org/mp/20q?search=Animalia | https://www.discoverlife.org/mp/20q?search=Arthropoda | https://www.discoverlife.org/mp/20q?search=Insecta | https://www.discoverlife.org/mp/20q?search=Hymenoptera | https://www.discoverlife.org/mp/20q?search=Andrenidae | https://www.discoverlife.org/mp/20q?search=Acamptopoeum+calchaqui   kingdom | phylum | class | order | family | species https://www.discoverlife.org/mp/20q?search=Acamptopoeum+calchaqui   
https://www.discoverlife.org/mp/20q?search=Acamptopoeum+colombiense Acamptopoeum colombiense    HAS_ACCEPTED_NAME   https://www.discoverlife.org/mp/20q?search=Acamptopoeum+colombiense Acamptopoeum colombiense    speciesAnimalia | Arthropoda | Insecta | Hymenoptera | Andrenidae | Acamptopoeum colombiense    https://www.discoverlife.org/mp/20q?search=Animalia | https://www.discoverlife.org/mp/20q?search=Arthropoda | https://www.discoverlife.org/mp/20q?search=Insecta | https://www.discoverlife.org/mp/20q?search=Hymenoptera | https://www.discoverlife.org/mp/20q?search=Andrenidae | https://www.discoverlife.org/mp/20q?search=Acamptopoeum+colombiense kingdom | phylum | class | order | family | species https://www.discoverlife.org/mp/20q?search=Acamptopoeum+colombiense 
https://www.discoverlife.org/mp/20q?search=Acamptopoeum+colombiensis_sic    Acamptopoeum colombiensis_sic   SYNONYM_OF  https://www.discoverlife.org/mp/20q?search=Acamptopoeum+colombiense Acamptopoeum colombiense    speciesAnimalia | Arthropoda | Insecta | Hymenoptera | Andrenidae | Acamptopoeum colombiense    https://www.discoverlife.org/mp/20q?search=Animalia | https://www.discoverlife.org/mp/20q?search=Arthropoda | https://www.discoverlife.org/mp/20q?search=Insecta | https://www.discoverlife.org/mp/20q?search=Hymenoptera | https://www.discoverlife.org/mp/20q?search=Andrenidae | https://www.discoverlife.org/mp/20q?search=Acamptopoeum+colombiense kingdom | phylum | class | order | family | species https://www.discoverlife.org/mp/20q?search=Acamptopoeum+colombiense 
https://www.discoverlife.org/mp/20q?search=Acamptopoeum+fernandezi  Acamptopoeum fernandezi HAS_ACCEPTED_NAME   https://www.discoverlife.org/mp/20q?search=Acamptopoeum+fernandezi  Acamptopoeum fernandezi species     Animalia | Arthropoda | Insecta | Hymenoptera | Andrenidae | Acamptopoeum fernandezi    https://www.discoverlife.org/mp/20q?search=Animalia | https://www.discoverlife.org/mp/20q?search=Arthropoda | https://www.discoverlife.org/mp/20q?search=Insecta | https://www.discoverlife.org/mp/20q?search=Hymenoptera | https://www.discoverlife.org/mp/20q?search=Andrenidae | https://www.discoverlife.org/mp/20q?search=Acamptopoeum+fernandezi  kingdom | phylum | class | order | family | species https://www.discoverlife.org/mp/20q?search=Acamptopoeum+fernandezi  
https://www.discoverlife.org/mp/20q?search=Acamptopoeum+inauratum   Acamptopoeum inauratum  HAS_ACCEPTED_NAME   https://www.discoverlife.org/mp/20q?search=Acamptopoeum+inauratum   Acamptopoeum inauratum  species     Animalia | Arthropoda | Insecta | Hymenoptera | Andrenidae | Acamptopoeum inauratum https://www.discoverlife.org/mp/20q?search=Animalia | https://www.discoverlife.org/mp/20q?search=Arthropoda | https://www.discoverlife.org/mp/20q?search=Insecta | https://www.discoverlife.org/mp/20q?search=Hymenoptera | https://www.discoverlife.org/mp/20q?search=Andrenidae | https://www.discoverlife.org/mp/20q?search=Acamptopoeum+inauratum   kingdom | phylum | class | order | family | species https://www.discoverlife.org/mp/20q?search=Acamptopoeum+inauratum   
https://www.discoverlife.org/mp/20q?search=Acamptopoeum+melanogaster    Acamptopoeum melanogaster   HAS_ACCEPTED_NAME   https://www.discoverlife.org/mp/20q?search=Acamptopoeum+melanogaster    Acamptopoeum melanogaster   speciesAnimalia | Arthropoda | Insecta | Hymenoptera | Andrenidae | Acamptopoeum melanogaster   https://www.discoverlife.org/mp/20q?search=Animalia | https://www.discoverlife.org/mp/20q?search=Arthropoda | https://www.discoverlife.org/mp/20q?search=Insecta | https://www.discoverlife.org/mp/20q?search=Hymenoptera | https://www.discoverlife.org/mp/20q?search=Andrenidae | https://www.discoverlife.org/mp/20q?search=Acamptopoeum+melanogaster    kingdom | phylum | class | order | family | species https://www.discoverlife.org/mp/20q?search=Acamptopoeum+melanogaster    
https://www.discoverlife.org/mp/20q?search=Acamptopoeum+nigritarse  Acamptopoeum nigritarse HAS_ACCEPTED_NAME   https://www.discoverlife.org/mp/20q?search=Acamptopoeum+nigritarse  Acamptopoeum nigritarse species     Animalia | Arthropoda | Insecta | Hymenoptera | Andrenidae | Acamptopoeum nigritarse    https://www.discoverlife.org/mp/20q?search=Animalia | https://www.discoverlife.org/mp/20q?search=Arthropoda | https://www.discoverlife.org/mp/20q?search=Insecta | https://www.discoverlife.org/mp/20q?search=Hymenoptera | https://www.discoverlife.org/mp/20q?search=Andrenidae | https://www.discoverlife.org/mp/20q?search=Acamptopoeum+nigritarse  kingdom | phylum | class | order | family | species https://www.discoverlife.org/mp/20q?search=Acamptopoeum+nigritarse  
https://www.discoverlife.org/mp/20q?search=Acamptopoeum+prinii  Acamptopoeum prinii HAS_ACCEPTED_NAME   https://www.discoverlife.org/mp/20q?search=Acamptopoeum+prinii  Acamptopoeum prinii species     Animalia | Arthropoda | Insecta | Hymenoptera | Andrenidae | Acamptopoeum prinii    https://www.discoverlife.org/mp/20q?search=Animalia | https://www.discoverlife.org/mp/20q?search=Arthropoda | https://www.discoverlife.org/mp/20q?search=Insecta | https://www.discoverlife.org/mp/20q?search=Hymenoptera | https://www.discoverlife.org/mp/20q?search=Andrenidae | https://www.discoverlife.org/mp/20q?search=Acamptopoeum+prinii  kingdom | phylum | class | order | family | species https://www.discoverlife.org/mp/20q?search=Acamptopoeum+prinii  
https://www.discoverlife.org/mp/20q?search=Acamptopoeum+submetallicum   Acamptopoeum submetallicum  HAS_ACCEPTED_NAME   https://www.discoverlife.org/mp/20q?search=Acamptopoeum+submetallicum   Acamptopoeum submetallicum  speciesAnimalia | Arthropoda | Insecta | Hymenoptera | Andrenidae | Acamptopoeum submetallicum  https://www.discoverlife.org/mp/20q?search=Animalia | https://www.discoverlife.org/mp/20q?search=Arthropoda | https://www.discoverlife.org/mp/20q?search=Insecta | https://www.discoverlife.org/mp/20q?search=Hymenoptera | https://www.discoverlife.org/mp/20q?search=Andrenidae | https://www.discoverlife.org/mp/20q?search=Acamptopoeum+submetallicum   kingdom | phylum | class | order | family | species https://www.discoverlife.org/mp/20q?search=Acamptopoeum+submetallicum   
jhpoelen commented 3 years ago

Also, with recent changes, ITIS offline matcher supports dump

$ nomer dump itis-taxon-id | grep Apidae | wc -l
using matcher [itis-taxon-id]
ITIS taxonomy already indexed at [xxxx/nomer/itis/itis], no need to import.
6645

6645 names related to Apidae

with

$ nomer dump itis-taxon-id | grep Apidae | grep -o -P "(SYNONYM_OF|HAS_ACCEPTED_NAME)" | sort | uniq -c 
using matcher [itis-taxon-id]
ITIS taxonomy already indexed at [/media/jorrit/data/nomer/itis/itis], no need to import.
   6105 HAS_ACCEPTED_NAME
    540 SYNONYM_OF

6105 accepted names and 540 synonyms.

jhpoelen commented 3 years ago

Note that this is based on:

Integrated Taxonomic Information System. (2020). Repackaged Full ITIS Data Set (MS SQL Server) (itisMS.043020) [Data set]. Zenodo. https://doi.org/10.5281/zenodo.3833105

which is based on an ITIS data export provided in 2020. Updates can be made by adjusting itis nomer properties manually and/or updating defaults.

$ nomer properties | grep itis
nomer.itis.synonym_links=gz:https://zenodo.org/record/3833105/files/synonym_links.gz!/synonym_links
nomer.itis.taxon_unit_types=gz:https://zenodo.org/record/3833105/files/taxon_unit_types.gz!/taxon_unit_types
nomer.itis.taxonomic_units=gz:https://zenodo.org/record/3833105/files/taxonomic_units.gz!/taxonomic_units
jhpoelen commented 3 years ago

In total for ITIS -

$ nomer dump itis-taxon-id | grep -o -P "(SYNONYM_OF|HAS_ACCEPTED_NAME)" | sort | uniq -c 
using matcher [itis-taxon-id]
ITIS taxonomy already indexed at [xxxx/data/nomer/itis/itis], no need to import.
 600434 HAS_ACCEPTED_NAME
 234551 SYNONYM_OF

~600k names and 234k synonyms

jhpoelen commented 3 years ago

with performance currently at:

$ nomer dump itis-taxon-id | pv -l > /dev/null
using matcher [itis-taxon-id]
ITIS taxonomy already indexed at [/xxxx/nomer/itis/itis], no need to import.
 834k 0:00:32 [25.6k/s]

exporting ~834k names in about 30 seconds.

jhpoelen commented 3 years ago

Implemented in https://github.com/globalbioticinteractions/nomer/releases/tag/0.2.4 .

@seltmann if you get the chance, please reproduce (note is takes about 30s or more):

$ nomer dump itis > itis.tsv 
...
$ cat itis.tsv | head -n2 
ITIS:50 Bacteria    HAS_ACCEPTED_NAME   ITIS:50 Bacteria    kingdomBacteria ITIS:50 kingdom http://www.itis.gov/servlet/SingleRpt/SingleRpt?search_topic=TSN&search_value=50    
ITIS:51 Schizomycetes   SYNONYM_OF  ITIS:50 Bacteria    kingdom     Bacteria    ITIS:50 kingdom http://www.itis.gov/servlet/SingleRpt/SingleRpt?search_topic=TSN&search_value=50    
$ cat itis.tsv | sha256sum 
d9ea9fd1d44aeedc86643527b51055b2ae220674aa27127b7fbe2f7d07442332  -
$ ls -lha itis.tsv
xxxx 579M xxx itis.tsv
jhpoelen commented 3 years ago

Similarly,

please reproduce,

$ nomer dump discoverlife > discoverlife.tsv
...
$ cat discoverlife.tsv | head -n2 
https://www.discoverlife.org/mp/20q?search=Acamptopoeum+argentinum  Acamptopoeum argentinum HAS_ACCEPTED_NAME   https://www.discoverlife.org/mp/20q?search=Acamptopoeum+argentinum  Acamptopoeum argentinum species     Animalia | Arthropoda | Insecta | Hymenoptera | Andrenidae | Acamptopoeum argentinum    https://www.discoverlife.org/mp/20q?search=Animalia | https://www.discoverlife.org/mp/20q?search=Arthropoda | https://www.discoverlife.org/mp/20q?search=Insecta | https://www.discoverlife.org/mp/20q?search=Hymenoptera | https://www.discoverlife.org/mp/20q?search=Andrenidae | https://www.discoverlife.org/mp/20q?search=Acamptopoeum+argentinum  kingdom | phylum | class | order | family | species https://www.discoverlife.org/mp/20q?search=Acamptopoeum+argentinum  
https://www.discoverlife.org/mp/20q?search=Acamptopoeum+calchaqui   Acamptopoeum calchaqui  HAS_ACCEPTED_NAME   https://www.discoverlife.org/mp/20q?search=Acamptopoeum+calchaqui   Acamptopoeum calchaqui  species     Animalia | Arthropoda | Insecta | Hymenoptera | Andrenidae | Acamptopoeum calchaqui https://www.discoverlife.org/mp/20q?search=Animalia | https://www.discoverlife.org/mp/20q?search=Arthropoda | https://www.discoverlife.org/mp/20q?search=Insecta | https://www.discoverlife.org/mp/20q?search=Hymenoptera | https://www.discoverlife.org/mp/20q?search=Andrenidae | https://www.discoverlife.org/mp/20q?search=Acamptopoeum+calchaqui   kingdom | phylum | class | order | family | species https://www.discoverlife.org/mp/20q?search=Acamptopoeum+calchaqui
$ cat discoverlife.tsv | sha256sum 
c415109c04449a36ff398602b7afa623540dab8b1a6c628e020907386463b900  -
$ ls -lha discoverlife.tsv 
xxxx 36M xxxx discoverlife.tsv

See attached itis.tsv and discoverlife.tsv discoverlife.tsv.gz itis.tsv.gz

seltmann commented 3 years ago

@jhpoelen today in the TDWG taxonomic backbone discussion I asked the question:

" I have my own taxon names list, how best (easiest) for me to include my name list into globalnames services?"

This was answered by Dima to say "send me an email and I will include in globalnames"

I think this is a solution for including our bee names into globalnames.

seltmann commented 3 years ago

In the same meeting, Joe Miller suggests to "please submit it as a checklist to GBIF, it will get a DOI and you can compare it to COL and GBIF"

jhpoelen commented 3 years ago

Leaving out the DOI (that is a whole separate discussion),

Having a data driven approach can be very neat - no matter where they are stored (GBIF, @dimus 's own internal globalnames infrastructure, Zenodo, Internet Archive) .

This comes back to the separation of datasets (versions), and tracking their use.

jhpoelen commented 3 years ago

fyi @mjy

jhpoelen commented 3 years ago

@seltmann we've been using the "dump" or "list" features for a little while now.

Closing this issue, please report bugs in this functionality is specific, newly created issues.