globalbioticinteractions / nomer

maps identifiers and names to other identifiers and names
GNU General Public License v3.0
19 stars 3 forks source link

Add matcher for Index Fungorum #49

Closed nleguillarme closed 3 years ago

nleguillarme commented 3 years ago

Hi, it would be nice to add a matcher for the Index Fungorum taxonomy : http://www.indexfungorum.org/names/names.asp

jhpoelen commented 3 years ago

@nleguillarme great idea! Would you happen to know how to access all the data associated with Index Fungorum?

Please note that Global Names already has some support for resolving Index Fungorum names, but unfortunately, this is not yet available for off-line processing @dimus .

nleguillarme commented 3 years ago

Hi @jhpoelen, Index Fungorum seems to have an API for resolving both names and ids : http://www.indexfungorum.org/ixfwebservice/fungus.asmx

dimus commented 3 years ago

@nleguillarme @jhpoelen I write to Paul Kirk periodically, he sends me somewhat idiosincratic data, that I convert and import into https://verifier.globalnames.org. The data can be found in the data dump http://opendata.globalnames.org/dumps/gnames-2020-11-28.tar.gz

I am going to update his data in the next month or two

dimus commented 3 years ago

the last data received from Paul Kirk can always be found in this file: https://github.com/GlobalNamesArchitecture/dwca_hunter/blob/master/lib/dwca_hunter/resources/index-fungorum.rb

currently they are at https://uofi.box.com/shared/static/54l3b7h4q4pwqq4fgqvx42h3d328fl1c.csv

jhpoelen commented 3 years ago

@nleguillarme @dimus thanks for the info and ideas -

I made a first pass at offline-enabled Index Fungorum support and got results like:

$ echo "IF:177054" | nomer append indexfungorum
using matcher [indexfungorum]
[INDEX_FUNGORUM] taxonomy importing...
caching [https://uofi.box.com/shared/static/54l3b7h4q4pwqq4fgqvx42h3d328fl1c.csv] at [/media/jorrit/branta/nomer/1967320cc97d53ea9343a0611907accbb27344f4f4975d050d7aa7ea4486b80e.gz]...
Cookie rejected: "$Version=0; box_visitor_id=61805741abb017.31353894; $Path=/; $Domain=.box.com". Domain attribute ".box.com" violates RFC 2109: host minus domain may not contain any dots
Cookie rejected: "$Version=0; site_preference=desktop; $Path=/; $Domain=.box.com". Domain attribute ".box.com" violates RFC 2109: host minus domain may not contain any dots
Cookie rejected: "$Version=0; b=e8aa55fa47b0dde6f0f9bc54c0af9fb97375616c53c8a614f57524e29798f890; $Path=/; $Domain=.public.boxcloud.com". Illegal domain attribute ".public.boxcloud.com". Domain of origin: "public.boxcloud.com"
caching [https://uofi.box.com/shared/static/54l3b7h4q4pwqq4fgqvx42h3d328fl1c.csv] at [/media/jorrit/branta/nomer/1967320cc97d53ea9343a0611907accbb27344f4f4975d050d7aa7ea4486b80e.gz] done.
using cached [https://uofi.box.com/shared/static/54l3b7h4q4pwqq4fgqvx42h3d328fl1c.csv] at [/media/jorrit/branta/nomer/1967320cc97d53ea9343a0611907accbb27344f4f4975d050d7aa7ea4486b80e.gz]
cache with [547875] items built in [423.3] s or [1294.3] items/s.
[INDEX_FUNGORUM] taxonomy imported.
IF:177054   SYNONYM_OF  IF:808518   Leucocybe candicans         Fungi | Basidiomycota | Agaricomycotina | Agaricomycetes | Agaricomycetidae | Agaricales | Incertae sedis       kingdom | phylum | subphylum | class | subclass | order | family    http://www.indexfungorum.org/names/NamesRecord.asp?RecordID=808518  

Note that you can also list all the indexfungorum names using

$ nomer dump indexfungorum 
using matcher [indexfungorum]
[INDEX_FUNGORUM] taxonomy already indexed at [/media/jorrit/branta/nomer/index_fungorum/index_fungorum], no need to import.
IF:1    Michenera   SYNONYM_OF  IF:17976    Licrostroma         Fungi | Basidiomycota | Agaricomycotina | Agaricomycetes | Incertae sedis | Russulales | Peniophoraceae     kingdom | phylum | subphylum | class | subclass | order | family    http://www.indexfungorum.org/names/NamesRecord.asp?RecordID=17976   
IF:2    Abaphospora SYNONYM_OF  IF:3016 Massarina           Fungi | Ascomycota | Pezizomycotina | Dothideomycetes | Pleosporomycetidae | Pleosporales | Massarinaceae       kingdom | phylum | subphylum | class | subclass | order | family    http://www.indexfungorum.org/names/NamesRecord.asp?RecordID=3016    
IF:3    Abrothallomyces SYNONYM_OF  IF:4    Abrothallus         Fungi | Ascomycota | Pezizomycotina | Dothideomycetes | Incertae sedis | Abrothallales | Abrothallaceae     kingdom | phylum | subphylum | class | subclass | order | familyhttp://www.indexfungorum.org/names/NamesRecord.asp?RecordID=4   
IF:4    Abrothallus SYNONYM_OF  IF:4    Abrothallus         Fungi | Ascomycota | Pezizomycotina | Dothideomycetes | Incertae sedis | Abrothallales | Abrothallaceae     kingdom | phylum | subphylum | class | subclass | order | familyhttp://www.indexfungorum.org/names/NamesRecord.asp?RecordID=4   
IF:5    Absconditella   SYNONYM_OF  IF:5    Absconditella           Fungi | Ascomycota | Pezizomycotina | Lecanoromycetes | Ostropomycetidae | Ostropales | Stictidaceae        kingdom | phylum | subphylum | class | subclass | order | familyhttp://www.indexfungorum.org/names/NamesRecord.asp?RecordID=5   
IF:6    Abyssomyces SYNONYM_OF  IF:6    Abyssomyces         Fungi | Ascomycota | Pezizomycotina | Sordariomycetes | Incertae sedis | Incertae sedis | Incertae sedis        kingdom | phylum | subphylum | class | subclass | order | family    http://www.indexfungorum.org/names/NamesRecord.asp?RecordID=6   
IF:7    Acallomyces SYNONYM_OF  IF:7    Acallomyces         Fungi | Ascomycota | Pezizomycotina | Laboulbeniomycetes | Laboulbeniomycetidae | Laboulbeniales | Laboulbeniaceae      kingdom | phylum | subphylum | class | subclass | order | family    http://www.indexfungorum.org/names/NamesRecord.asp?RecordID=7   
IF:8    Acantharia  SYNONYM_OF  IF:8    Acantharia          Fungi | Ascomycota | Pezizomycotina | Dothideomycetes | Pleosporomycetidae | Venturiales | Venturiaceae     kingdom | phylum | subphylum | class | subclass | order | familyhttp://www.indexfungorum.org/names/NamesRecord.asp?RecordID=8   
IF:9    Acanthographina SYNONYM_OF  IF:24   Acanthothecis           Fungi | Ascomycota | Pezizomycotina | Lecanoromycetes | Ostropomycetidae | Ostropales | Graphidaceae        kingdom | phylum | subphylum | class | subclass | order | familyhttp://www.indexfungorum.org/names/NamesRecord.asp?RecordID=24  
IF:10   Acanthographis  SYNONYM_OF  IF:24   Acanthothecis           Fungi | Ascomycota | Pezizomycotina | Lecanoromycetes | Ostropomycetidae | Ostropales | Graphidaceae        kingdom | phylum | subphylum | class | subclass | order | familyhttp://www.indexfungorum.org/names/NamesRecord.asp?RecordID=24  
...

Feeding the ~0.5M Index Fungorum names into itself

$ nomer dump indexfungorum | nomer append indexfungorum | pv -l > /dev/null
using matcher [indexfungorum]
using matcher [indexfungorum]
[INDEX_FUNGORUM] taxonomy already indexed at [/media/jorrit/branta/nomer/index_fungorum/index_fungorum], no need to import.
[INDEX_FUNGORUM] taxonomy already indexed at [/media/jorrit/branta/nomer/index_fungorum/index_fungorum], no need to import.    ]
 547k 0:01:16 [7.13k/s] [                  

took a little over 1 minute without need for internet connectivity.

with

$ time nomer dump indexfungorum | cut -f1,2 | pv -l | gzip > names.tsv.gz
using matcher [indexfungorum]
[INDEX_FUNGORUM] taxonomy already indexed at [/media/jorrit/branta/nomer/index_fungorum/index_fungorum], no need to import.
 547k 0:00:24 [22.3k/s] [                                           <=>                                                        ]

real    0m24.579s
user    0m58.959s
sys 0m4.415s

and

$ time zcat names.tsv.gz | nomer append indexfungorum | pv -l > /dev/null
using matcher [indexfungorum]
[INDEX_FUNGORUM] taxonomy already indexed at [/media/jorrit/branta/nomer/index_fungorum/index_fungorum], no need to import.
 547k 0:00:41 [13.1k/s] [                                                                               <=>                    ]

real    0m41.914s
user    1m19.796s
sys 0m7.136s

Note that this is all single threaded, and without any kind of optimization.

jhpoelen commented 3 years ago

Index Fungorum matcher is now available in Nomer v0.2.5 https://github.com/globalbioticinteractions/nomer/releases/tag/0.2.5 .

@nleguillarme please review and confirm desired functionality by closing this issue.

jhpoelen commented 3 years ago

@nleguillarme closing issue, please re-open if you find any issues.

nleguillarme commented 3 years ago

Thank you @jhpoelen, it works like a charm.

jhpoelen commented 3 years ago

@nleguillarme thanks for trying out the Index Fungorum matcher . . . happy to hear any suggestions on improvements or funny things that come up as you are using the newly added taxonomic scheme.

Thanks again to @dimus for helping to find access to an easy to use version of Index Fungorum.