Closed andrew-edwards closed 10 months ago
rnaturalearhhires
has a download script in its data-raw folder (https://github.com/ropensci/rnaturalearthhires/blob/master/data-raw/data_download_script.r) which the package developer runs source()
on, making or overwriting RDA files in the data folder, all of which is pushed to the GitHub repo. These are large, and what are on GitHub in the rnaturalearhhires
repo. This would be done whenever the spatial data are updated. Those are binary files so the git repo will get huge if you're updating it a lot (at once a year it will be a few years before it gets unmanageable) because all copies of binary files remain in the git history. In that case you might want a direct download from some data-hosting URL (Google drive or whatever) with a versioning check comparing what's on the user's disk with the current version (at that URL) and download if it's a lower version. That's the way I'd do it.
To avoid the issues you mention, you could have a prompt asking if they want to download and apply the new version and also have a function to get an old version back if they want, and a function to tell them which version they currently have, which you would run and paste into the message asking if they want to upgrade.
How I would proceed:
Create files on the server you want to download from:
a. The data file(s) you want to start with, with a unique version number in the name (use a single digit, it is simplest)
\
b. A .csv
or binary .rds
file containing a data frame with three or more columns: the version number, 1 column each for the file names for that version number, and a boolean for if it is the newest version or not - there will only be one TRUE
Create these functions:
a. get_newest_version_num()
: Check the website for the current version number from the file on there you set up in 1(b)
\
b. get_local_version_num()
: Read a file you have set up locally containing the currently-installed version number and report number
\
c. set_local_version_num(version_num)
: Set the version number in a file you have set up locally containing the currently-installed version number
\
d. update_content_toversion(version_num)
: downloads version
associated RDA file and stores it as the current RDA file for data (keep the filename the same, don't have version in the local filename - it would require more complex code to deal with), and stores the current version number in the file mentioned in 2(b). The lookup file on the website, as described in 1(b), is used to extract the download file name(s) so you can build the request HTTP string(s).
When the package is loaded (https://search.r-project.org/CRAN/refmans/rlang/html/on_load.html) run your code base on this pseudocode:
newest_version_num <- get_newest_version_num()
if(get_local_version_num() < newest_version_num){
if(prompt("want to upgrade?")){
update_content_toversion(newest_version_num)
}
I would start by looking at the RCurl package since it uses the dependable and old tried and true curl library on the OS for downloads (https://cran.r-project.org/web/packages/curl/vignettes/intro.html). Or httr
package which is based on curl but simpler to use (https://cran.r-project.org/web/packages/httr/vignettes/quickstart.html)
By the way the rnaturalearhhires
RDA files contain single SpatialPolygonsDataFrame
objects (sp
package)
Wow this is more than I intended to write, I got carried away
Thanks Chris for spelling this out. We vaguely some of these kind of ideas in mind, but your details are helpful. Am away right now.
A few things.
1.
I'm writing a function get_data()
to download data from a github url that would store our data files(e.g. "...github...paceaData/data/somedatafile.rda"). Seems to work so far, and this allows users to choose which data they want to store locally. I think writing a function to then delete the data once they are done would also be helpful. There will also be 'shortcut' functions written that source the original get_data
function, but the functions will be named whatever data layer they want (e.g. if they want roms_sst
then the function roms_sst()
would indirectly call get_data("roms_sst")
. Once the data is downloaded, calling the function again will just return the dataset.
I based this loosely on the bc_maps
package, one difference being they have a huge online portal where all their data is stored and can be downloaded, in which bc_maps directly sources the data from the website instead of storing the data in a package.
2.
When this function is called and the user specifies the data object, it will download the data from github and save it to a local cache directory. I'm using pacea_cache()
to identify and create the local directory. Currently the pacea_cache()
function creates the directory, e.g.:
C:\Users\TAIT\AppData\Local/pacea/Cache
I'm wondering whether we want it to be ".../Local/R/pacea/Cache"? Just need a small change in the function if so.
3.
I like what Chris has suggested, regarding the rnaturalearthhires
. Would also be a good way to incorporate into the code whether or not the data has been updated. Currently my function is interactive and asks the user whether they want to download the data to a local cache folder, perhaps the function could also have a T/F parameter on whether it will look in the github paceaData for an updated version.
I did have a look through rnaturalearth
and rnaturalearthhires
. I think what happens is if you want data from rnaturalearth
, it looks to whether you want normal data (rnaturalearthdata
) or high resolution data (rnaturalearthhires
) and downloads that entire data package. I think we want to avoid having a user download all of paceaData if they are just interested in one variable.
So many different ways to do things!
So we now have ROMS outputs mapped onto a grid (2x2 km inshore, 10x10 km offshore), and Travis has been testing different filesizes for saving it.
Sean suggested the caching idea, and I set up a
pacea-cache()
function for that (it just returns a local directory for the user). But now we're thinking a separatepaceaData
may be best, which is how therNaturalEarth
(ish) package does. So we have three options:Option 1: Ideally just store everything in
pacea
, but this may not prove feasible. We're not quite sure yet, but testing should be finished soon.Option 2:
paceaData
, which would be a standalone data package (the R packages book does talk about this sort of idea). It would house the large spatiotemporal data sets like ROMS and satellite SST data, simple temporal indices will remain inpacea
. The ROMS output would be updated once a year, though the SST data could be every week if wanted (probably wouldn't want that).Option 3: Have some functions in
pacea
that get run to download outputs/data from somewhere (may just be GitHub anyway), and cache them locally on the users computer (always in thepacea-cache()
directory. This just seems a bit complicated - if the data get update automatically then users' analyses may confusingly change without the user being aware of it (especially if we automate it all directly from an existing SST GitHub repository). Whereas the package would have to be manually updated by the user, so they will be aware that something has changed.Tagging @seananderson @cgrandin for any thoughts? Thanks!
I think we're leaning towards Option 2 if Option 1 is infeasible. Seems a bit more straightforward.