ropensci / c14bazAAR

R Package - Download and Prepare C14 Dates from Different Source Databases
https://docs.ropensci.org/c14bazAAR
GNU General Public License v2.0
30 stars 12 forks source link

Retain online lookup tables for backwards compatibility? #134

Closed joeroe closed 3 years ago

joeroe commented 3 years ago

128 removed the lookup tables (e.g. url_reference.csv) from this repository, which means previous versions of the package are no longer functional:

remotes::install_github("ropensci/c14bazAAR@1.3.0", force = TRUE, quiet = TRUE)
c14bazAAR::get_c14data("14sea")
#> Trying to download all dates from the requested databases...
#>   |                                                          |                                                  |   0%  |                                                          |++++++++++++++++++++++++++++++++++++++++++++++++++|  99%
#> Warning in c14bazAAR::get_c14data("14sea"): There were errors:
#> 
#> 14sea --> https://raw.githubusercontent.com/ropensci/c14bazAAR/master/data-raw/url_reference.csv is not available. No internet connection?
#> 
#> Not all data might have been downloaded accurately!
#> Error in c14bazAAR::get_c14data("14sea"): 
#> 
#> Download failed for all databases.

Created on 2021-04-05 by the reprex package (v1.0.0)

While hopefully most users will upgrade to 2.0.0, this is might be a problem for the reproducibility of analyses using previous versions? It's also not necessarily clear to users who encounter this error that they need to upgrade the package to fix it.

nevrome commented 3 years ago

Thanks for bringing this up explicitly :+1: - I take this very seriously.

128 introduced a breaking change (v.2.0.0) and I think it's the superior solution. The old implementation only existed because of the long maintenance cycles for CRAN packages, which should not be updated too often. Now that we abandoned the CRAN version we enjoy more freedom.

So what about the broken pre-2.0.0 versions? Of course we may compromise the reproducibility of some scripts that rely on a specific c14bazAAR version, but c14bazAAR is already from a conceptional point of view not perfectly suitable for long-term reproducibility. We download the data from the raw source databases which might change or go offline any time. We have no control over that. For radon, for example, we even download a daily dump of the database! No reproducibility here. You should always make your code reproducible in a way that does not rely on any foreign archive or data source, except this data source is explicitly committed to long-term storage.

So for c14bazAAR I suggest to use it to download the data, but then only work with a freeze. Just like I did here:

https://github.com/nevrome/cultrans.bronzeageburials.article2019/blob/e4ff688ffd0bdbc9a2be022260e0eaff61a7e0e3/analysis/code/01_data_preparation_scripts/02_prepare_c14_data.R#L14-L16

An easy fix for c14bazAAR now would be to add the old files again to keep the old versions functional. But at some point we have to deprecate them. How long should this cycle be? I don't want to keep things around without a clear deadline. Unfortunately there is no way to warn the users of old versions: We can not monkey patch code on people's devices.

I suggest to add a well visible warning in the README, but still stick with the hard stop for everything pre-2.0.0. I expect the number of users for c14bazAAR to be fairly small anyway and as the rest of the download API did not change, the handful of users who experience this should be able to update very easily.

nevrome commented 3 years ago

In #120 you write that you can not update to the latest c14bazAAR version now. Why that, @joeroe?

joeroe commented 3 years ago

That was just me being in a rush and switching between versions in the same session, sorry. It works fine now!

This is a strange situation because it's more than an ordinary breaking change and more like a "retrospective breaking change" (I don't know if there's a term for that). Of course you're right that relying on a live was always a bad idea from a reproducibility point of view, given the volatility of the underlying databases, and hopefully not many people do that, but this is still a rather radical change even for users using the package to download and cache data.

So would it be worth at least retaining the old versions of url_reference.csv etc. for a short period of time? Perhaps a few months? Just to avoid a sudden shock for people who only updated their installed packages occasionally.

nevrome commented 3 years ago

Alright - let's do this then. Doesn't cost much and maybe it helps somebody.