Closed jhpoelen closed 3 years ago
@jhpoelen, yes, it would be very nice to have a place where we have all resources we know about, and also people can submit their resources into the system as well. Such a system would probably have technical as well as human-relations component. Of course the best would be not just a metadata store, but also a platform, where a resource can be harvested, converted into an internal data structure that is the same for all resources, and have an ability to spew out a bunch of different formats, like DwCA, TCS, and even some custom JSON and CSV.
Global Names still uses Ruby script to normalize resources: https://github.com/GlobalNamesArchitecture/dwca_hunter
One more thought. I think the most amazing user interface in the world is an electric switch in my room. I need light, I flip the switch, I have light. What is happening behind the scenes is absolutely mind boggling. Would be nice for an interface of 'all biodata of the world' project to try to get as close as we can to 'electric switch' interface.
I think this is an interesting thought experiment. It feels very fractal.
It's fairly straightforward to explore the limitations of this thought experiment:
Perhaps what is really being asked is how do we strengthen a practical approximation of this question e.g. @dimus approach in dwca_hunter? [EDIT] - I see in the first comment this is more the goal, rather than the issue title.
Different cases of datasets that I encounter
I imagine that the scope would involve 1 and 2, with all other formats 'normalized' to 2
Perhaps we can start with a shopping list of filesets with minimal descriptors.
E.g., see #49
name: Index Fungorum
files:
- https://uofi.box.com/shared/static/54l3b7h4q4pwqq4fgqvx42h3d328fl1c.csv
I imagine that the scope would involve 1 and 2, with all other formats 'normalized' to 2
@dimus yes!
Here's a list I keep for nomer:
$ nomer properties
nomer.append.schema.output.example.taxon.rank.order=[{"column":0,"type":"path.order.id"},{"column": 1,"type":"path.order.name"},{"column": 2,"type":"path.order"}]
nomer.append.schema.output=
nomer.cache.dir=./.nomer
nomer.doi.cache.url=
nomer.doi.min.match.score=100
nomer.eol.taxon=gz:https://zenodo.org/record/3834881/files/taxon.tab.gz!/taxon.tab
nomer.gbif.ids=gz:https://zenodo.org/record/5222044/files/gbif-backbone-by-id.tsv.gz!/gbif-backbone-by-id.tsv
nomer.gbif.names=gz:https://zenodo.org/record/5222044/files/gbif-backbone-by-name.tsv.gz!/gbif-backbone-by-name.tsv
nomer.indexfungorum.export=https://uofi.box.com/shared/static/54l3b7h4q4pwqq4fgqvx42h3d328fl1c.csv
nomer.itis.synonym_links=gz:https://zenodo.org/record/3833105/files/synonym_links.gz!/synonym_links
nomer.itis.taxon_unit_types=gz:https://zenodo.org/record/3833105/files/taxon_unit_types.gz!/taxon_unit_types
nomer.itis.taxonomic_units=gz:https://zenodo.org/record/3833105/files/taxonomic_units.gz!/taxonomic_units
nomer.ncbi.merged=tar:gz:https://ftp.ncbi.nlm.nih.gov/pub/taxonomy/taxdump.tar.gz!/taxdump.tar!/merged.dmp
nomer.ncbi.names=tar:gz:https://ftp.ncbi.nlm.nih.gov/pub/taxonomy/taxdump.tar.gz!/taxdump.tar!/names.dmp
nomer.ncbi.nodes=tar:gz:https://ftp.ncbi.nlm.nih.gov/pub/taxonomy/taxdump.tar.gz!/taxdump.tar!/nodes.dmp
nomer.nodc.url=tar:gz:https://www.nodc.noaa.gov/cgi-bin/OAS/prd/download/50418.1.1.tar.gz!/50418.1.1.tar!/0050418/1.1/data/0-data/NODC_TaxonomicCode_V8_CD-ROM/TAXBRIEF.DAT
nomer.plazi.treatments.archive=https://github.com/plazi/treatments-rdf/archive/master.zip
nomer.pmid2doi.cache.url=ftp://ftp.ncbi.nlm.nih.gov/pub/pmc/PMC-ids.csv.gz
nomer.schema.input=[{"column":0,"type":"externalId"},{"column": 1,"type":"name"}]
nomer.schema.output=[{"column":0,"type":"externalId"},{"column": 1,"type":"name"}]
nomer.taxon.name.correction.url=https://github.com/globalbioticinteractions/globi-taxon-names/raw/main/taxon-name-mapping.csv
nomer.taxon.name.stopword.url=https://github.com/globalbioticinteractions/globi-taxon-names/raw/main/non-taxon-words.txt
nomer.taxon.rank.cache.url=
nomer.taxon.rank.map.url=
nomer.term.cache.url=https://zenodo.org/record/5526782/files/taxonCache.tsv.gz
nomer.term.map.maxLinksPerTerm=125
nomer.term.map.url=https://zenodo.org/record/5526782/files/taxonMap.tsv.gz
@dimus @mjy @seltmann @mielliot I am happy to announce:
Poelen, Jorrit H. (2021). Nomer Corpus of Taxonomic Resources hash://sha256/bb6dac6461b66212c5b1826447d7765529ff5cbadeac1915f7c3be9748eda991 (0.1) [Data set]. Zenodo. https://doi.org/10.5281/zenodo.5639794
Am working to integrate this with Nomer, so that even online resources can be accessed in a location independent manner.
There's a bunch of taxonomies missing still but hey . . . you gotta start somewhere.
Please see https://github.com/globalbioticinteractions/nomer-corpus-builder for Makefile that helped create this publication.
Corpus now in active use by Nomer v0.2.8+
Compiling the available taxonomic resources takes time:
GBIF backbone is at . . . DiscoverLife is at . . . ITIS is at . . .
would it be an idea to create a corpus of known taxonomic datasets of known provenance.
I am sure that we've all built similar things independently. . . I'd be curious how Global Names, Catalogue of Life manages their resources.
@seltmann @dimus @mjy