globalbioticinteractions / nomer

maps identifiers and names to other identifiers and names
GNU General Public License v3.0
18 stars 3 forks source link

create a edited corpus of all known taxonomies in a single citable digital data volume:e.g., All Names of the World Vol. 2 #48

Closed jhpoelen closed 2 years ago

jhpoelen commented 2 years ago

Compiling the available taxonomic resources takes time:

GBIF backbone is at . . . DiscoverLife is at . . . ITIS is at . . .

would it be an idea to create a corpus of known taxonomic datasets of known provenance.

I am sure that we've all built similar things independently. . . I'd be curious how Global Names, Catalogue of Life manages their resources.

@seltmann @dimus @mjy

dimus commented 2 years ago

@jhpoelen, yes, it would be very nice to have a place where we have all resources we know about, and also people can submit their resources into the system as well. Such a system would probably have technical as well as human-relations component. Of course the best would be not just a metadata store, but also a platform, where a resource can be harvested, converted into an internal data structure that is the same for all resources, and have an ability to spew out a bunch of different formats, like DwCA, TCS, and even some custom JSON and CSV.

dimus commented 2 years ago

Global Names still uses Ruby script to normalize resources: https://github.com/GlobalNamesArchitecture/dwca_hunter

One more thought. I think the most amazing user interface in the world is an electric switch in my room. I need light, I flip the switch, I have light. What is happening behind the scenes is absolutely mind boggling. Would be nice for an interface of 'all biodata of the world' project to try to get as close as we can to 'electric switch' interface.

mjy commented 2 years ago

I think this is an interesting thought experiment. It feels very fractal.

It's fairly straightforward to explore the limitations of this thought experiment:

Perhaps what is really being asked is how do we strengthen a practical approximation of this question e.g. @dimus approach in dwca_hunter? [EDIT] - I see in the first comment this is more the goal, rather than the issue title.

dimus commented 2 years ago

Different cases of datasets that I encounter

  1. DwCA files at a stable URL
  2. Various formats files at a stable URL
  3. Various formats files at a stable URL behind some kind of a 'wall' that requires human effort
  4. Data that require a manual effort (usually of a particular person) to get, with stable format of some sort
  5. Data that require a manual effort with unstable format
  6. "Dead" data that now exists only in the database
  7. Data received by walking an API
  8. Data received by scraping a web page

I imagine that the scope would involve 1 and 2, with all other formats 'normalized' to 2

jhpoelen commented 2 years ago

Perhaps we can start with a shopping list of filesets with minimal descriptors.

E.g., see #49

name: Index Fungorum
files:
 -  https://uofi.box.com/shared/static/54l3b7h4q4pwqq4fgqvx42h3d328fl1c.csv
jhpoelen commented 2 years ago

I imagine that the scope would involve 1 and 2, with all other formats 'normalized' to 2

@dimus yes!

jhpoelen commented 2 years ago

Here's a list I keep for nomer:

$ nomer properties
nomer.append.schema.output.example.taxon.rank.order=[{"column":0,"type":"path.order.id"},{"column": 1,"type":"path.order.name"},{"column": 2,"type":"path.order"}]
nomer.append.schema.output=
nomer.cache.dir=./.nomer
nomer.doi.cache.url=
nomer.doi.min.match.score=100
nomer.eol.taxon=gz:https://zenodo.org/record/3834881/files/taxon.tab.gz!/taxon.tab
nomer.gbif.ids=gz:https://zenodo.org/record/5222044/files/gbif-backbone-by-id.tsv.gz!/gbif-backbone-by-id.tsv
nomer.gbif.names=gz:https://zenodo.org/record/5222044/files/gbif-backbone-by-name.tsv.gz!/gbif-backbone-by-name.tsv
nomer.indexfungorum.export=https://uofi.box.com/shared/static/54l3b7h4q4pwqq4fgqvx42h3d328fl1c.csv
nomer.itis.synonym_links=gz:https://zenodo.org/record/3833105/files/synonym_links.gz!/synonym_links
nomer.itis.taxon_unit_types=gz:https://zenodo.org/record/3833105/files/taxon_unit_types.gz!/taxon_unit_types
nomer.itis.taxonomic_units=gz:https://zenodo.org/record/3833105/files/taxonomic_units.gz!/taxonomic_units
nomer.ncbi.merged=tar:gz:https://ftp.ncbi.nlm.nih.gov/pub/taxonomy/taxdump.tar.gz!/taxdump.tar!/merged.dmp
nomer.ncbi.names=tar:gz:https://ftp.ncbi.nlm.nih.gov/pub/taxonomy/taxdump.tar.gz!/taxdump.tar!/names.dmp
nomer.ncbi.nodes=tar:gz:https://ftp.ncbi.nlm.nih.gov/pub/taxonomy/taxdump.tar.gz!/taxdump.tar!/nodes.dmp
nomer.nodc.url=tar:gz:https://www.nodc.noaa.gov/cgi-bin/OAS/prd/download/50418.1.1.tar.gz!/50418.1.1.tar!/0050418/1.1/data/0-data/NODC_TaxonomicCode_V8_CD-ROM/TAXBRIEF.DAT
nomer.plazi.treatments.archive=https://github.com/plazi/treatments-rdf/archive/master.zip
nomer.pmid2doi.cache.url=ftp://ftp.ncbi.nlm.nih.gov/pub/pmc/PMC-ids.csv.gz
nomer.schema.input=[{"column":0,"type":"externalId"},{"column": 1,"type":"name"}]
nomer.schema.output=[{"column":0,"type":"externalId"},{"column": 1,"type":"name"}]
nomer.taxon.name.correction.url=https://github.com/globalbioticinteractions/globi-taxon-names/raw/main/taxon-name-mapping.csv
nomer.taxon.name.stopword.url=https://github.com/globalbioticinteractions/globi-taxon-names/raw/main/non-taxon-words.txt
nomer.taxon.rank.cache.url=
nomer.taxon.rank.map.url=
nomer.term.cache.url=https://zenodo.org/record/5526782/files/taxonCache.tsv.gz
nomer.term.map.maxLinksPerTerm=125
nomer.term.map.url=https://zenodo.org/record/5526782/files/taxonMap.tsv.gz
jhpoelen commented 2 years ago

@dimus @mjy @seltmann @mielliot I am happy to announce:

Poelen, Jorrit H. (2021). Nomer Corpus of Taxonomic Resources hash://sha256/bb6dac6461b66212c5b1826447d7765529ff5cbadeac1915f7c3be9748eda991 (0.1) [Data set]. Zenodo. https://doi.org/10.5281/zenodo.5639794

Am working to integrate this with Nomer, so that even online resources can be accessed in a location independent manner.

There's a bunch of taxonomies missing still but hey . . . you gotta start somewhere.

Please see https://github.com/globalbioticinteractions/nomer-corpus-builder for Makefile that helped create this publication.

jhpoelen commented 2 years ago

Corpus now in active use by Nomer v0.2.8+