Storing large spatiotemporal ROMS (etc.) outputs - is making paceaData package best?

andrew-edwards commented 1 year ago

So we now have ROMS outputs mapped onto a grid (2x2 km inshore, 10x10 km offshore), and Travis has been testing different filesizes for saving it.

Sean suggested the caching idea, and I set up a pacea-cache() function for that (it just returns a local directory for the user). But now we're thinking a separate paceaData may be best, which is how the rNaturalEarth (ish) package does. So we have three options:

Option 1: Ideally just store everything in pacea, but this may not prove feasible. We're not quite sure yet, but testing should be finished soon.

Option 2: paceaData, which would be a standalone data package (the R packages book does talk about this sort of idea). It would house the large spatiotemporal data sets like ROMS and satellite SST data, simple temporal indices will remain in pacea. The ROMS output would be updated once a year, though the SST data could be every week if wanted (probably wouldn't want that).

Option 3: Have some functions in pacea that get run to download outputs/data from somewhere (may just be GitHub anyway), and cache them locally on the users computer (always in the pacea-cache() directory. This just seems a bit complicated - if the data get update automatically then users' analyses may confusingly change without the user being aware of it (especially if we automate it all directly from an existing SST GitHub repository). Whereas the package would have to be manually updated by the user, so they will be aware that something has changed.

Tagging @seananderson @cgrandin for any thoughts? Thanks!

I think we're leaning towards Option 2 if Option 1 is infeasible. Seems a bit more straightforward.

cgrandin commented 1 year ago

rnaturalearhhires has a download script in its data-raw folder (https://github.com/ropensci/rnaturalearthhires/blob/master/data-raw/data_download_script.r) which the package developer runs source() on, making or overwriting RDA files in the data folder, all of which is pushed to the GitHub repo. These are large, and what are on GitHub in the rnaturalearhhires repo. This would be done whenever the spatial data are updated. Those are binary files so the git repo will get huge if you're updating it a lot (at once a year it will be a few years before it gets unmanageable) because all copies of binary files remain in the git history. In that case you might want a direct download from some data-hosting URL (Google drive or whatever) with a versioning check comparing what's on the user's disk with the current version (at that URL) and download if it's a lower version. That's the way I'd do it. To avoid the issues you mention, you could have a prompt asking if they want to download and apply the new version and also have a function to get an old version back if they want, and a function to tell them which version they currently have, which you would run and paste into the message asking if they want to upgrade.

How I would proceed:

Create files on the server you want to download from: a. The data file(s) you want to start with, with a unique version number in the name (use a single digit, it is simplest) \ b. A .csv or binary .rds file containing a data frame with three or more columns: the version number, 1 column each for the file names for that version number, and a boolean for if it is the newest version or not - there will only be one TRUE
Create these functions: a. get_newest_version_num(): Check the website for the current version number from the file on there you set up in 1(b) \ b. get_local_version_num(): Read a file you have set up locally containing the currently-installed version number and report number \ c. set_local_version_num(version_num): Set the version number in a file you have set up locally containing the currently-installed version number \ d. update_content_toversion(version_num): downloads version associated RDA file and stores it as the current RDA file for data (keep the filename the same, don't have version in the local filename - it would require more complex code to deal with), and stores the current version number in the file mentioned in 2(b). The lookup file on the website, as described in 1(b), is used to extract the download file name(s) so you can build the request HTTP string(s).

When the package is loaded (https://search.r-project.org/CRAN/refmans/rlang/html/on_load.html) run your code base on this pseudocode:

newest_version_num <- get_newest_version_num()
if(get_local_version_num() < newest_version_num){
  if(prompt("want to upgrade?")){
    update_content_toversion(newest_version_num)
}

I would start by looking at the RCurl package since it uses the dependable and old tried and true curl library on the OS for downloads (https://cran.r-project.org/web/packages/curl/vignettes/intro.html). Or httr package which is based on curl but simpler to use (https://cran.r-project.org/web/packages/httr/vignettes/quickstart.html)

By the way the rnaturalearhhires RDA files contain single SpatialPolygonsDataFrame objects (sp package)

Wow this is more than I intended to write, I got carried away

andrew-edwards commented 1 year ago

Thanks Chris for spelling this out. We vaguely some of these kind of ideas in mind, but your details are helpful. Am away right now.

travistai2 commented 1 year ago

A few things.

1. I'm writing a function get_data() to download data from a github url that would store our data files(e.g. "...github...paceaData/data/somedatafile.rda"). Seems to work so far, and this allows users to choose which data they want to store locally. I think writing a function to then delete the data once they are done would also be helpful. There will also be 'shortcut' functions written that source the original get_data function, but the functions will be named whatever data layer they want (e.g. if they want roms_sst then the function roms_sst() would indirectly call get_data("roms_sst"). Once the data is downloaded, calling the function again will just return the dataset.

I based this loosely on the bc_maps package, one difference being they have a huge online portal where all their data is stored and can be downloaded, in which bc_maps directly sources the data from the website instead of storing the data in a package.

2. When this function is called and the user specifies the data object, it will download the data from github and save it to a local cache directory. I'm using pacea_cache() to identify and create the local directory. Currently the pacea_cache() function creates the directory, e.g.: C:\Users\TAIT\AppData\Local/pacea/Cache I'm wondering whether we want it to be ".../Local/R/pacea/Cache"? Just need a small change in the function if so.

3. I like what Chris has suggested, regarding the rnaturalearthhires. Would also be a good way to incorporate into the code whether or not the data has been updated. Currently my function is interactive and asks the user whether they want to download the data to a local cache folder, perhaps the function could also have a T/F parameter on whether it will look in the github paceaData for an updated version.

I did have a look through rnaturalearth and rnaturalearthhires. I think what happens is if you want data from rnaturalearth, it looks to whether you want normal data (rnaturalearthdata) or high resolution data (rnaturalearthhires) and downloads that entire data package. I think we want to avoid having a user download all of paceaData if they are just interested in one variable.

So many different ways to do things!

pbs-assess / pacea

Storing large spatiotemporal ROMS (etc.) outputs - is making paceaData package best? #21