globalbioticinteractions / nomer

maps identifiers and names to other identifiers and names
GNU General Public License v3.0
18 stars 3 forks source link

add support for offline matching of https://www.catalogueoflife.org/data/download #47

Closed jhpoelen closed 2 years ago

jhpoelen commented 2 years ago

Here's a DwC-A -

https://download.catalogueoflife.org/col/annual/2021_dwca.zip .

jhpoelen commented 2 years ago

An initial implementation of Catalogue of Life produces:

$ echo -e "\tArius felis" | nomer append -p my.properties col
[main] INFO org.globalbioticinteractions.nomer.match.TermMatcherRegistry - using matcher [col]
[main] INFO org.globalbioticinteractions.nomer.match.CatalogueOfLifeTaxonService - [Catalogue of Life] taxonomy already indexed at [/media/jorrit/branta/nomer/catalogue_of_life/catalogue_of_life], no need to import.
    Arius felis SYNONYM_OF  COL:GMX9    Ariopsis felis  species     Biota | Animalia | Chordata | Actinopterygii | Siluriformes | Ariidae | Ariopsis | Ariopsis felis   COL:5T6MX | COL:N | COL:CH2 | COL:CT | COL:6236K | COL:6Q6 | COL:8RZ3D | COL:GMX9   unranked | kingdom | phylum | class | order | family | genus | species      

and

$ echo -e "\tAriopsis felis" | nomer append -p my.properties col
[main] INFO org.globalbioticinteractions.nomer.match.TermMatcherRegistry - using matcher [col]
[main] INFO org.globalbioticinteractions.nomer.match.CatalogueOfLifeTaxonService - [Catalogue of Life] taxonomy already indexed at [/media/jorrit/branta/nomer/catalogue_of_life/catalogue_of_life], no need to import.
    Ariopsis felis  HAS_ACCEPTED_NAME   COL:GMX9    Ariopsis felis  species     Biota | Animalia | Chordata | Actinopterygii | Siluriformes | Ariidae | Ariopsis | Ariopsis felis   COL:5T6MX | COL:N | COL:CH2 | COL:CT | COL:6236K | COL:6Q6 | COL:8RZ3D | COL:GMX9   unranked | kingdom | phylum | class | order | family | genus | species      
$ echo -e "COL:6MB3T\t" | nomer append col
[main] INFO org.globalbioticinteractions.nomer.match.TermMatcherRegistry - using matcher [col]
[main] INFO org.globalbioticinteractions.nomer.match.CatalogueOfLifeTaxonService - [Catalogue of Life] taxonomy already indexed at [/media/jorrit/branta/nomer/catalogue_of_life/catalogue_of_life], no need to import.
COL:6MB3T       HAS_ACCEPTED_NAME   COL:6MB3T   Homo sapiens    species     Biota | Animalia | Chordata | Mammalia | Theria | Eutheria | Primates | Haplorrhini | Simiiformes | Hominoidea | Hominidae | Homininae | Homo | Homo sapiens    COL:5T6MX | COL:N | COL:CH2 | COL:6224G | COL:6226C | COL:LG | COL:3W7 | COL:4DT | COL:4PM | COL:58L | COL:6256T | COL:JPH | COL:636X2 | COL:6MB3T  unranked | kingdom | phylum | class | subclass | infraclass | order | suborder | infraorder | superfamily | family | subfamily | genus | species        

and

$ nomer ls col | head
[main] INFO org.globalbioticinteractions.nomer.match.TermMatcherRegistry - using matcher [col]
[main] INFO org.globalbioticinteractions.nomer.match.CatalogueOfLifeTaxonService - [Catalogue of Life] taxonomy already indexed at [/media/jorrit/branta/nomer/catalogue_of_life/catalogue_of_life], no need to import.
COL:32  Caldiserica HAS_ACCEPTED_NAME   COL:32  Caldiserica phylum  Biota | Bacteria | Negibacteria | Caldiserica   COL:5T6MX | COL:B | COL:622BB | COL:32  unranked | kingdom | subkingdom | phylum        
COL:322 Craciformes HAS_ACCEPTED_NAME   COL:322 Craciformes order   Biota | Animalia | Chordata | Aves | Galliformes | Craciformes  COL:5T6MX | COL:N | COL:CH2 | COL:V2 | COL:38Z | COL:322    unranked | kingdom | phylum | class | order | order     
COL:32223   Cryptoripersia hypolithus   HAS_ACCEPTED_NAME   COL:32223   Cryptoripersia hypolithus   species     Biota | Animalia | Arthropoda | Insecta | Hemiptera | Coccoidea | Pseudococcidae | Cryptoripersia | Cryptoripersia trichura | Cryptoripersia hypolithus COL:5T6MX | COL:N | COL:RT | COL:H6 | COL:HP | COL:4YN | COL:F8D | COL:3WSG | COL:3222J | COL:32223 unranked | kingdom | phylum | class | order | superfamily | family | genus | species | species      
COL:32224   Cryptoripersia kingii   HAS_ACCEPTED_NAME   COL:32224   Cryptoripersia kingii   species     Biota | Animalia | Arthropoda | Insecta | Hemiptera | Coccoidea | Pseudococcidae | Cryptoripersia | Cryptoripersia kingii   COL:5T6MX | COL:N | COL:RT | COL:H6 | COL:HP | COL:4YN | COL:F8D | COL:3WSG | COL:32224 unranked | kingdom | phylum | class | order | superfamily | family | genus | species        
COL:32225   Cryptoripersia kingii   HAS_ACCEPTED_NAME   COL:32225   Cryptoripersia kingii   species     Biota | Animalia | Arthropoda | Insecta | Hemiptera | Coccoidea | Pseudococcidae | Cryptoripersia | Cryptoripersia kingii | Cryptoripersia kingii   COL:5T6MX | COL:N | COL:RT | COL:H6 | COL:HP | COL:4YN | COL:F8D | COL:3WSG | COL:32224 | COL:32225 unranked | kingdom | phylum | class | order | superfamily | family | genus | species | species      
COL:32226   Cryptoripersia leucocystis  HAS_ACCEPTED_NAME   COL:32226   Cryptoripersia leucocystis  species     Biota | Animalia | Arthropoda | Insecta | Hemiptera | Coccoidea | Pseudococcidae | Cryptoripersia | Cryptoripersia leucocystis  COL:5T6MX | COL:N | COL:RT | COL:H6 | COL:HP | COL:4YN | COL:F8D | COL:3WSG | COL:32226 unranked | kingdom | phylum | class | order | superfamily | family | genus | species        
COL:32227   Cryptoripersia lii  HAS_ACCEPTED_NAME   COL:32227   Cryptoripersia lii  species     Biota | Animalia | Arthropoda | Insecta | Hemiptera | Coccoidea | Pseudococcidae | Cryptoripersia | Cryptoripersia lii | Cryptoripersia lii COL:5T6MX | COL:N | COL:RT | COL:H6 | COL:HP | COL:4YN | COL:F8D | COL:3WSG | COL:32228 | COL:32227 unranked | kingdom | phylum | class | order | superfamily | family | genus | species | species      
COL:32228   Cryptoripersia lii  HAS_ACCEPTED_NAME   COL:32228   Cryptoripersia lii  species     Biota | Animalia | Arthropoda | Insecta | Hemiptera | Coccoidea | Pseudococcidae | Cryptoripersia | Cryptoripersia lii  COL:5T6MX | COL:N | COL:RT | COL:H6 | COL:HP | COL:4YN | COL:F8D | COL:3WSG | COL:32228 unranked | kingdom | phylum | class | order | superfamily | family | genus | species        
COL:32229   Cryptoripersia loweri   HAS_ACCEPTED_NAME   COL:32229   Cryptoripersia loweri   species     Biota | Animalia | Arthropoda | Insecta | Hemiptera | Coccoidea | Pseudococcidae | Cryptoripersia | Cryptoripersia loweri   COL:5T6MX | COL:N | COL:RT | COL:H6 | COL:HP | COL:4YN | COL:F8D | COL:3WSG | COL:32229 unranked | kingdom | phylum | class | order | superfamily | family | genus | species        
COL:3222B   Cryptoripersia myrmecophylla    HAS_ACCEPTED_NAME   COL:3222B   Cryptoripersia myrmecophylla    species     Biota | Animalia | Arthropoda | Insecta | Hemiptera | Coccoidea | Pseudococcidae | Cryptoripersia | Cryptoripersia myrmecophila | Cryptoripersia myrmecophylla  COL:5T6MX | COL:N | COL:RT | COL:H6 | COL:HP | COL:4YN | COL:F8D | COL:3WSG | COL:3222D | COL:3222B unranked | kingdom | phylum | class | order | superfamily | family | genus | species | species      
jhpoelen commented 2 years ago

Note that is takes a little while 30-60min to initially index the Catalogue of Life for offline processing.

jhpoelen commented 2 years ago

GloBI successfully used the Catalogue of Life (COL) offline-enable taxon resolver to add COL links to taxa. Closing issue.

@mdoering @seltmann please let me know if you have any comments/ questions / suggestions about the novel way to quickly and reliably match thousands of names without having to rely on an internet connection.