Closed James-Thorson-NOAA closed 2 years ago
Hey @James-Thorson-NOAA , always nice to hear from you. Yes, apologies for that.
For taxonomic tables, there is still load_taxa()
, which merely joins the underlying taxonomy tables of the database for you (see https://github.com/ropensci/rfishbase/blob/master/R/load_taxa.R, working around a few oddities and differences between sealifebase and fishbase taxa).
Actually I think fishbase
as a function for the taxa tables should have been deprecated earlier, it was merely a 'pre-computed' load_taxa()
call.
I'm still not particularly happy with representation of taxonomy or synonyms, rfishbase 4.0 really aims at exposing the underlying tables as truthfully and efficiently as possible, warts and all (no fault to the Fishbase team -- that's what happens when you have to continue to extend a SQL database with more and new types of data for nearly three decades and who knows how many versions of database software...). Really one day we ought to have more overlay functions that can smooth over these things, but have never gotten around to it. I'd like taxonomy and synonym handling closer to what we have in taxadb following darwin core; though as you know programmatic approaches to taxonomy are inherently a dicey business!
Thanks for explaining all of this! I have my fix for FishLife
, which just involves installing the most recent rfishbase
release that still contained fishbase
. I'm closing the issue, and will aim to update to using load_taxa()
in future updates.
So that I fully understand, does load_taxa
use the API of the current FishBase version, so that load_taxa
will always give a up-to-date version of their taxa tables? That seems very helpful, but I ask so that I can know whether to keep a static copy in FishLife
(which currently just does a static download of data and then runs models on that output, which I update every couple years).
FishBase doesn't have an API, they just send us snapshot MySQL dumps roughly semi-annually. Historically @sckott maintained a Ruby-based API at https://fishbase.ropensci.org/ as a frontend to those dumps, but it was never strictly "current". Since these were already static snapshots to begin with, I dispensed with the API middleman in fishbase 3.0, which just downloaded the tables in tab-separated-value format directly. Once downloaded, rfishbase cached those (in the directory set by FISHBASE_HOME
or otherwise the default cache). The 4.0 version keeps this pattern with a few minor tweaks (data are formatted in parquet, which preserves things like data type (logical/integer/character) from the original dump, and do not need to be imported into a relational database in order to query). All snapshots are versioned, so you can specify which dump you want to pull (with the default always going to the latest
).
rfishbase
has always had logic so that it can get the "latest" snapshot without necessarily updating to the latest release of rfishbase
. (previously it would check the GitHub releases tab, in 4.0 it checks the online data provenance log instead, with a fallback to the packaged copy).
The deprecated function fishbase
was the exception to this. It wasn't accessing the raw dump files that are external to the package, but instead accessed a precomputed copy that was actually bundled in the rfishbase R package as an internal data object. This meant that it would eventually lag behind the "latest" snapshot unless it was updated. (There was also the chance for error where I would forget to update the packaged version). Since load_taxa()
should be sufficiently fast on repeated use (let me know otherwise), I figured this old function was just asking for trouble and dropped it. In rfishbase
4.0, all data should correspond directly the snapshots we get from the official FishBase team.
It should also be relatively easy now to actually bypass rfishbase and access any specific snapshots of any specific tables directly. Because the snapshots are just provided as middleware for rfishbase, I don't feel I can officially deposit them in a formal DOI-granting archive like Zenodo, so they are merely cached here on GitHub and also in the software heritage archive. Instead of DOIs, rfishbase
uses content identifiers to resolve each table. For example, the parquet-serialized species table has identifier hash://sha256/11284f8036fdb3599ebeb503c6e32dab6642ffbc1f5be1083c1590eb962a188b
. We can read this file in R by just resolving the identifier and then reading in the file:
species.parquet <- contentid::resolve("hash://sha256/11284f8036fdb3599ebeb503c6e32dab6642ffbc1f5be1083c1590eb962a188b", store=TRUE)
arrow::read_parquet(species.parquet)
Using store=TRUE
on the identifier resolution preserves a copy of the file locally, so it won't be downloaded a second time if you were to re-run the same code. Otherwise it is downloaded to temporary file. Using sha256
hashes ensures precise version identification and file integrity, and means we can positively identify the data file from any number of locations, rather than trusting to a single central authority such as a data repository.
Ok that's probably more than you wanted...
Once again, I really appreciate all the careful thought and time you've put into this. I'll think through the parquet
stuff, but as you say load_taxa()
seems sufficient for future purposes.
First off, thanks Carl and all developers/maintainers for this amazing resource! I've used
rfishbase
as a dependency of my R packageFishLife
code here for several years, properly citing the dependency in publications in Ecol Appl and Fish and Fisheries.I specifically use
rfishbase
to provide an data-objectfishbase
containing taxonomy for all fishes in FishBase. This object was available inrfishbase
release 3.0.1 through 3.0.4 (where the latter was used through R release 4.0.2). However, when upgrading to R release 4.1.0, it then installsrfishbase
release 4.0.0, which appears to no longer include the data-objectfishbase
.Is the
fishbase
data object permanently removed, or has it been renamed? If so, I recommend also deleting theman
file forfishbase
, and would welcome a pointer to how the updated version handles taxonomic queries. If not, can you clarify where it was moved or renamed?