ropensci / rfishbase

R interface to the fishbase.org database
https://docs.ropensci.org/rfishbase
109 stars 40 forks source link

`fishbase` object removed between 3.0.1 and 4.0.0 #240

Closed James-Thorson-NOAA closed 2 years ago

James-Thorson-NOAA commented 2 years ago

First off, thanks Carl and all developers/maintainers for this amazing resource! I've used rfishbase as a dependency of my R package FishLife code here for several years, properly citing the dependency in publications in Ecol Appl and Fish and Fisheries.

I specifically use rfishbase to provide an data-object fishbase containing taxonomy for all fishes in FishBase. This object was available in rfishbase release 3.0.1 through 3.0.4 (where the latter was used through R release 4.0.2). However, when upgrading to R release 4.1.0, it then installs rfishbase release 4.0.0, which appears to no longer include the data-object fishbase.

Is the fishbase data object permanently removed, or has it been renamed? If so, I recommend also deleting the man file for fishbase, and would welcome a pointer to how the updated version handles taxonomic queries. If not, can you clarify where it was moved or renamed?

> install.packages("rfishbase")
--- Please select a CRAN mirror for use in this session ---
trying URL 'https://cloud.r-project.org/bin/windows/contrib/4.1/rfishbase_4.0.0.zip'
Content type 'application/zip' length 839631 bytes (819 KB)
downloaded 819 KB

package ‘rfishbase’ successfully unpacked and MD5 sums checked

The downloaded binary packages are in
        C:\Users\James.Thorson\AppData\Local\Temp\Rtmp4Gqam5\downloaded_packages
> library(rfishbase)
Warning message:
package ‘rfishbase’ was built under R version 4.1.2 
> packageVersion("rfishbase")
[1] ‘4.0.0’
> ?fishbase
starting httpd help server ... done
> fishbase
Error: object 'fishbase' not found
> sessionInfo()
R version 4.1.0 (2021-05-18)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 18363)

Matrix products: default

locale:
[1] LC_COLLATE=English_United States.1252  LC_CTYPE=English_United States.1252    LC_MONETARY=English_United States.1252 LC_NUMERIC=C                          
[5] LC_TIME=English_United States.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] rfishbase_4.0.0

loaded via a namespace (and not attached):
 [1] magrittr_2.0.1    hms_1.1.1         progress_1.2.2    tidyselect_1.1.1  R6_2.5.1          rlang_0.4.12      fastmap_1.1.0     fansi_0.5.0       stringr_1.4.0    
[10] httr_1.4.2        dplyr_1.0.7       tools_4.1.0       utf8_1.2.2        DBI_1.1.1         dbplyr_2.1.1      askpass_1.1       ellipsis_0.3.2    openssl_1.4.5    
[19] assertthat_0.2.1  tibble_3.1.6      lifecycle_1.0.1   crayon_1.4.2      tzdb_0.2.0        readr_2.1.1       purrr_0.3.4       vctrs_0.3.8       fs_1.5.2         
[28] contentid_0.0.15  curl_4.3.2        cachem_1.0.6      memoise_2.0.1     glue_1.6.0        stringi_1.7.6     compiler_4.1.0    pillar_1.6.4      prettyunits_1.1.1
[37] generics_0.1.1    pkgconfig_2.0.3  
cboettig commented 2 years ago

Hey @James-Thorson-NOAA , always nice to hear from you. Yes, apologies for that.

For taxonomic tables, there is still load_taxa(), which merely joins the underlying taxonomy tables of the database for you (see https://github.com/ropensci/rfishbase/blob/master/R/load_taxa.R, working around a few oddities and differences between sealifebase and fishbase taxa).

Actually I think fishbase as a function for the taxa tables should have been deprecated earlier, it was merely a 'pre-computed' load_taxa() call.

I'm still not particularly happy with representation of taxonomy or synonyms, rfishbase 4.0 really aims at exposing the underlying tables as truthfully and efficiently as possible, warts and all (no fault to the Fishbase team -- that's what happens when you have to continue to extend a SQL database with more and new types of data for nearly three decades and who knows how many versions of database software...). Really one day we ought to have more overlay functions that can smooth over these things, but have never gotten around to it. I'd like taxonomy and synonym handling closer to what we have in taxadb following darwin core; though as you know programmatic approaches to taxonomy are inherently a dicey business!

James-Thorson-NOAA commented 2 years ago

Thanks for explaining all of this! I have my fix for FishLife, which just involves installing the most recent rfishbase release that still contained fishbase. I'm closing the issue, and will aim to update to using load_taxa() in future updates.

So that I fully understand, does load_taxa use the API of the current FishBase version, so that load_taxa will always give a up-to-date version of their taxa tables? That seems very helpful, but I ask so that I can know whether to keep a static copy in FishLife (which currently just does a static download of data and then runs models on that output, which I update every couple years).

cboettig commented 2 years ago

FishBase doesn't have an API, they just send us snapshot MySQL dumps roughly semi-annually. Historically @sckott maintained a Ruby-based API at https://fishbase.ropensci.org/ as a frontend to those dumps, but it was never strictly "current". Since these were already static snapshots to begin with, I dispensed with the API middleman in fishbase 3.0, which just downloaded the tables in tab-separated-value format directly. Once downloaded, rfishbase cached those (in the directory set by FISHBASE_HOME or otherwise the default cache). The 4.0 version keeps this pattern with a few minor tweaks (data are formatted in parquet, which preserves things like data type (logical/integer/character) from the original dump, and do not need to be imported into a relational database in order to query). All snapshots are versioned, so you can specify which dump you want to pull (with the default always going to the latest).

rfishbase has always had logic so that it can get the "latest" snapshot without necessarily updating to the latest release of rfishbase. (previously it would check the GitHub releases tab, in 4.0 it checks the online data provenance log instead, with a fallback to the packaged copy).

The deprecated function fishbase was the exception to this. It wasn't accessing the raw dump files that are external to the package, but instead accessed a precomputed copy that was actually bundled in the rfishbase R package as an internal data object. This meant that it would eventually lag behind the "latest" snapshot unless it was updated. (There was also the chance for error where I would forget to update the packaged version). Since load_taxa() should be sufficiently fast on repeated use (let me know otherwise), I figured this old function was just asking for trouble and dropped it. In rfishbase 4.0, all data should correspond directly the snapshots we get from the official FishBase team.

It should also be relatively easy now to actually bypass rfishbase and access any specific snapshots of any specific tables directly. Because the snapshots are just provided as middleware for rfishbase, I don't feel I can officially deposit them in a formal DOI-granting archive like Zenodo, so they are merely cached here on GitHub and also in the software heritage archive. Instead of DOIs, rfishbase uses content identifiers to resolve each table. For example, the parquet-serialized species table has identifier hash://sha256/11284f8036fdb3599ebeb503c6e32dab6642ffbc1f5be1083c1590eb962a188b. We can read this file in R by just resolving the identifier and then reading in the file:

species.parquet <- contentid::resolve("hash://sha256/11284f8036fdb3599ebeb503c6e32dab6642ffbc1f5be1083c1590eb962a188b", store=TRUE)
arrow::read_parquet(species.parquet)

Using store=TRUE on the identifier resolution preserves a copy of the file locally, so it won't be downloaded a second time if you were to re-run the same code. Otherwise it is downloaded to temporary file. Using sha256 hashes ensures precise version identification and file integrity, and means we can positively identify the data file from any number of locations, rather than trusting to a single central authority such as a data repository.

Ok that's probably more than you wanted...

James-Thorson-NOAA commented 2 years ago

Once again, I really appreciate all the careful thought and time you've put into this. I'll think through the parquet stuff, but as you say load_taxa() seems sufficient for future purposes.