ropensci / taxizedb

Tools for Working with Taxonomic SQL Databases
Other
30 stars 7 forks source link

EOL data source #39

Closed sckott closed 3 years ago

sckott commented 6 years ago

this should be the latest https://opendata.eol.org/dataset/tram-580-581/resource/bac4e11c-28ab-4038-9947-02d9f1b0329f

sckott commented 6 years ago

trying to use the file from https://opendata.eol.org/dataset/tram-580-581/resource/bac4e11c-28ab-4038-9947-02d9f1b0329f in branch eol - having a hard time reading the file

sckott commented 6 years ago

@cboettig FYI working on this. Have emailed EOL folks about above, waiting on response

KatjaSchulz commented 6 years ago

Sorry, this file is very much a working version. We weren't expecting it to be used by anyone but us for now. There will be an updated version available in a few weeks.

meta.xml says it's \t separated, okay good, there are some quotes, which I assume we can ignore

Yes, the quotes should all be part of the source text.

some lines end with \n, while others are terminated with ;\n

The lines should all end in \n. You may sometimes see something like ;\t\n if there is an entry in EOLidAnnotations and Landmark is empty. But the ; should never be at the very end of the line.

there's at least one line that has two lines in it: 1623030

Yes, that was an error. It's fixed in the latest version.

starting at line 2724677 there is a totally different set of columns with what looks like 8 columns, e.g.: syn1 -1275399 Ishmaridae family synonym Ishmaridae

That's a preliminary list of synonyms. It's the same set of columns, but there are values for different columns, e.g., there wouldn't be a value for parentNameUsageID, because synonyms don't have parent taxa. Many of the other columns don't have values either.

sckott commented 6 years ago

thanks @KatjaSchulz for the feedback on this.

Am I correct in using this file? Or is there a different one I should use? I am going for all of EOL taxonomy data.

KatjaSchulz commented 6 years ago

This file is not "all of EOL taxonomy data." It's just a draft for a hierarchy we're working on to support navigation and clade-based queries in the upcoming new version of EOL. We currently don't have a file with all of our taxonomy data, but this will become available once the new site and APIs have launched.

sckott commented 6 years ago

@KatjaSchulz okay, thx for clarification. What proportion of the taxonomy is it if you happen to know?

KatjaSchulz commented 6 years ago

It's a single comprehensive hierarchy. But EOL manages hundreds of alternative hierarchies from many different providers, including Catalogue of Life, GBIF, WoRMS, NCBI, etc. You can get most of these hierarchies directly from the source.

sckott commented 6 years ago

thanks, but i'm not sure i understand. I'm not sure how to square

It's a single comprehensive hierarchy"

with

This file is not "all of EOL taxonomy data."

??

KatjaSchulz commented 6 years ago

EOL does not have a single "taxonomy." Instead, we manage hundreds of partially overlapping, sometimes contradictory hierarchies. So "all of EOL taxonomy data" would include all of these hierarchies. The dynamic working hierarchy merges data from several of our source hierarchies to create one hierarchy with broad coverage, but it does not include all taxa known to EOL.

To get a better idea of our taxonomy management, have a look at a Names tab on the current EOL web site, e.g. the one for the sperm whale: http://eol.org/pages/328547/names?all=1 This page provides an overview of all the different taxonomic hierarchies provided by our content partners for this species.

cboettig commented 6 years ago

Hi @KatjaSchulz ,

Thanks so much for the detailed replies and your advice on these issues. Yes, we're familiar with a good number of the different authorities providing taxonomic data and the subsequent challenges involved in mapping between the names. EOL has done great work on this, and we'd like to be able to build on the efforts of EOL as much as possible rather than reinventing the wheel.

For instance, the data displayed in your example, http://eol.org/pages/328547 , is a nice illustration of that synthesis you've done, and also information that could be quite naturally described in JSON-LD where it would be considerably easier to mobilize for scientific purposes than in HTML. Is that the plan with the upcoming release?

Like you say, it is somewhat straight forward enough to access data from at least some of these providers (though as you know, many provide only API access, and many, like WORMS, are encumbered by agreements that discourage general-purpose requests). Such issues aside, as far as I know, there isn't a particularly obvious way to map across different identifiers used by each of the different authorities. Pages like your example suggest that EOL has at least an implicit mapping that species EOL:328547 is the same as ITIS: 180488 and the same as NCBI:txid9755, yes? That's not a trivial mapping to make by accessing only ITIS and NCBI dumps separately (e.g. NCBI species name resolves to a name that ITIS only lists as invalid/synonym).

Also, while I understand your point about EOL not being "a taxonomy", it does seem there is an underlying 'dynamic taxonomy' in EOL that might be a fusion of existing taxonomies but isn't simply a 1:1 port of them. I think this results in EOL being treated like it's own taxonomy, whether or not that was the intent. For example, you will find many EOL identifiers being used to describe species in the Global Biological Interactions database, GLOBI. E.g. here's a query in which we ask "what does organism EOL:328547 (i.e. sperm whales) eat?: https://www.globalbioticinteractions.org/?interactionType=interactsWith&sourceTaxon=EOL%3A328547

sckott commented 6 years ago

thanks @KatjaSchulz makes more sense now

KatjaSchulz commented 6 years ago

@cboettig Some of that information is also exposed in our current pages API, e.g.: http://eol.org/api/pages/1.0.json?id=328547&synonyms=true&taxonomy=true But we can definitely do better and are planning to expose all of the taxon mappings, with more detailed metadata in the next version of EOL.

cboettig commented 6 years ago

@KatjaSchulz Very cool! That's a super nice walk of synonyms.

Yeah, exposing more of this would be awesome. Also from the perspective of a scientist trying to make use of this data, it would be great if the information could be available as some form of database or text file dump rather than only through the API. The API is great when you want information on a handful of species, but less ideal for more meta-analysis and synthesis work. Having researchers write loops to have some server ping an API 1.7 million times is not really in the best interests of anyone involved :-)

The format ITIS and NCBI use to declare synonyms in their current database dumps is reasonably convenient, but munging this out a bunch of compressed json would still be preferable to hammering EOL servers with requests.

KatjaSchulz commented 6 years ago

Noted. We're definitely planning on making available data dumps. We often hear from people that they want all of our trait data or all of our taxonomy data. There will also be a query interface with the option to download results.