create a edited corpus of all known taxonomies in a single citable digital data volume:e.g., All Names of the World Vol. 2

jhpoelen commented 2 years ago

Compiling the available taxonomic resources takes time:

GBIF backbone is at . . . DiscoverLife is at . . . ITIS is at . . .

would it be an idea to create a corpus of known taxonomic datasets of known provenance.

I am sure that we've all built similar things independently. . . I'd be curious how Global Names, Catalogue of Life manages their resources.

@seltmann @dimus @mjy

dimus commented 2 years ago

@jhpoelen, yes, it would be very nice to have a place where we have all resources we know about, and also people can submit their resources into the system as well. Such a system would probably have technical as well as human-relations component. Of course the best would be not just a metadata store, but also a platform, where a resource can be harvested, converted into an internal data structure that is the same for all resources, and have an ability to spew out a bunch of different formats, like DwCA, TCS, and even some custom JSON and CSV.

dimus commented 2 years ago

Global Names still uses Ruby script to normalize resources: https://github.com/GlobalNamesArchitecture/dwca_hunter

One more thought. I think the most amazing user interface in the world is an electric switch in my room. I need light, I flip the switch, I have light. What is happening behind the scenes is absolutely mind boggling. Would be nice for an interface of 'all biodata of the world' project to try to get as close as we can to 'electric switch' interface.

mjy commented 2 years ago

I think this is an interesting thought experiment. It feels very fractal.

It's fairly straightforward to explore the limitations of this thought experiment:

How will you define what's in the data-set? For example by the time you start crawling you will have new datasets.
How will you identify candidates for inclusion in the data-set that are/not humanly chosen?
Do previous versions of the index of everything count as a data-set to be included in the next data-set? Why/not?
What happens when someone generates 400GB of random names that take just enough examination to understand that it's a joke, just for LOLs, are you grabbing that too?
How will you deal with vastly different metadata for your sources?
What protocols will you use to request data for ingestion? IPFS? IRC? HTTPS? File::IO? BitTorrent? Floppy disk controllers? How will you decide to/not use protocol X?

Perhaps what is really being asked is how do we strengthen a practical approximation of this question e.g. @dimus approach in dwca_hunter? [EDIT] - I see in the first comment this is more the goal, rather than the issue title.

dimus commented 2 years ago

Different cases of datasets that I encounter

DwCA files at a stable URL
Various formats files at a stable URL
Various formats files at a stable URL behind some kind of a 'wall' that requires human effort
Data that require a manual effort (usually of a particular person) to get, with stable format of some sort
Data that require a manual effort with unstable format
"Dead" data that now exists only in the database
Data received by walking an API
Data received by scraping a web page

I imagine that the scope would involve 1 and 2, with all other formats 'normalized' to 2

jhpoelen commented 2 years ago

Perhaps we can start with a shopping list of filesets with minimal descriptors.

E.g., see #49

name: Index Fungorum
files:
 -  https://uofi.box.com/shared/static/54l3b7h4q4pwqq4fgqvx42h3d328fl1c.csv

jhpoelen commented 2 years ago

I imagine that the scope would involve 1 and 2, with all other formats 'normalized' to 2

@dimus yes!

jhpoelen commented 2 years ago

Here's a list I keep for nomer:

$ nomer properties
nomer.append.schema.output.example.taxon.rank.order=[{"column":0,"type":"path.order.id"},{"column": 1,"type":"path.order.name"},{"column": 2,"type":"path.order"}]
nomer.append.schema.output=
nomer.cache.dir=./.nomer
nomer.doi.cache.url=
nomer.doi.min.match.score=100
nomer.eol.taxon=gz:https://zenodo.org/record/3834881/files/taxon.tab.gz!/taxon.tab
nomer.gbif.ids=gz:https://zenodo.org/record/5222044/files/gbif-backbone-by-id.tsv.gz!/gbif-backbone-by-id.tsv
nomer.gbif.names=gz:https://zenodo.org/record/5222044/files/gbif-backbone-by-name.tsv.gz!/gbif-backbone-by-name.tsv
nomer.indexfungorum.export=https://uofi.box.com/shared/static/54l3b7h4q4pwqq4fgqvx42h3d328fl1c.csv
nomer.itis.synonym_links=gz:https://zenodo.org/record/3833105/files/synonym_links.gz!/synonym_links
nomer.itis.taxon_unit_types=gz:https://zenodo.org/record/3833105/files/taxon_unit_types.gz!/taxon_unit_types
nomer.itis.taxonomic_units=gz:https://zenodo.org/record/3833105/files/taxonomic_units.gz!/taxonomic_units
nomer.ncbi.merged=tar:gz:https://ftp.ncbi.nlm.nih.gov/pub/taxonomy/taxdump.tar.gz!/taxdump.tar!/merged.dmp
nomer.ncbi.names=tar:gz:https://ftp.ncbi.nlm.nih.gov/pub/taxonomy/taxdump.tar.gz!/taxdump.tar!/names.dmp
nomer.ncbi.nodes=tar:gz:https://ftp.ncbi.nlm.nih.gov/pub/taxonomy/taxdump.tar.gz!/taxdump.tar!/nodes.dmp
nomer.nodc.url=tar:gz:https://www.nodc.noaa.gov/cgi-bin/OAS/prd/download/50418.1.1.tar.gz!/50418.1.1.tar!/0050418/1.1/data/0-data/NODC_TaxonomicCode_V8_CD-ROM/TAXBRIEF.DAT
nomer.plazi.treatments.archive=https://github.com/plazi/treatments-rdf/archive/master.zip
nomer.pmid2doi.cache.url=ftp://ftp.ncbi.nlm.nih.gov/pub/pmc/PMC-ids.csv.gz
nomer.schema.input=[{"column":0,"type":"externalId"},{"column": 1,"type":"name"}]
nomer.schema.output=[{"column":0,"type":"externalId"},{"column": 1,"type":"name"}]
nomer.taxon.name.correction.url=https://github.com/globalbioticinteractions/globi-taxon-names/raw/main/taxon-name-mapping.csv
nomer.taxon.name.stopword.url=https://github.com/globalbioticinteractions/globi-taxon-names/raw/main/non-taxon-words.txt
nomer.taxon.rank.cache.url=
nomer.taxon.rank.map.url=
nomer.term.cache.url=https://zenodo.org/record/5526782/files/taxonCache.tsv.gz
nomer.term.map.maxLinksPerTerm=125
nomer.term.map.url=https://zenodo.org/record/5526782/files/taxonMap.tsv.gz

jhpoelen commented 2 years ago

@dimus @mjy @seltmann @mielliot I am happy to announce:

Poelen, Jorrit H. (2021). Nomer Corpus of Taxonomic Resources hash://sha256/bb6dac6461b66212c5b1826447d7765529ff5cbadeac1915f7c3be9748eda991 (0.1) [Data set]. Zenodo. https://doi.org/10.5281/zenodo.5639794

Am working to integrate this with Nomer, so that even online resources can be accessed in a location independent manner.

There's a bunch of taxonomies missing still but hey . . . you gotta start somewhere.

Please see https://github.com/globalbioticinteractions/nomer-corpus-builder for Makefile that helped create this publication.

jhpoelen commented 2 years ago

Corpus now in active use by Nomer v0.2.8+

globalbioticinteractions / nomer

create a edited corpus of all known taxonomies in a single citable digital data volume:e.g., All Names of the World Vol. 2 #48