pbs-assess / pacea

An R package to house Pacific Region ecosystem data to help facilitate an ecosystem approach to fisheries.
Other
14 stars 0 forks source link

Finalise storing of data #9

Closed andrew-edwards closed 1 year ago

andrew-edwards commented 1 year ago

Want to avoid people needing to directly use the spatial packages such as sf, and as we are saving all spatial data on a master grid we don't need to keep repeating the spatial information for each data set; i.e. the master grid just gets defined once, and each cell is numbered.

  1. Need consistency in naming objects. Suggest keeping them simple and all lower case, such as pdf, enso, alpi, ...

  2. Need consistency in storing time. For non-spatial objects (e.g. PDO) seems best to have tibble with columns year month val low high

    where month is numeric 1:12, val is the actual value (e.g. PDO), and low and high are uncertainty bounds if they are available (won't be for all time series).

Each row is then a year-month combination. This is simpler than combining all, say, oceanographic indices into a single data object (with columns for each index), because they won't be defined for the same year-month combinations, and there'll be lots of wasteful NAs.

  1. For objects with space and time components. Columns will be year month 1 2 3 ..... max_cell_number

where 1, 2, 3, ... max_cell_number are the labels for the master grid. All spatial data is stored in the package on the master grid - the fiddly conversions from original data to master grid are done by developers in the data-raw/ folder.

If there are uncertainty values available (e.g. lower confidence interval of SST), then save each of these as a new data object, something like sst_low.

  1. For objects that are just spatial. Columns will probably be simply year month 1 2 3 ..... max_cell_number NA NA ....

can then have switches in plotting code, for example, that if years are all NA then it's obviously not a spatio-temporal object. Or give each object a class (see below).

  1. Save each data object separately, so can easily query what it is using ?. Can lump things like oceanographic indices together in one help file, but still keep them separate. So ?pdo, ?alpi would link to the same help file, but each object will be it's own tibble.

  2. Look at current ERDDAP_DF that Joe made:

  3. > ERDDAP_DF
    # A tibble: 18,156 × 7
        Year Month Poly_ID Mean_sst Mean_chlorophyll Fraction_NA_sst Fraction_NA_c…¹
       <int> <int>   <int>    <dbl>            <dbl>           <dbl>           <dbl>
     1  2006     1       1     9.13            0.456          0.0927          0.0927
     2  2006     1       2     9.38            0.734          0               0
     3  2006     1       3     9.12            2.15           0.132           0.132
     4  2006     1       4     8.95            3.09           0.577           0.577
     5  2006     1       5     8.14            1.74           0.436           0.436
     6  2006     1       6     7.54            2.17           0.673           0.673
     7  2006     1       7     7.12            7.86           0.155           0.155
     8  2006     1       8     7.04            3.94           0.486           0.486
     9  2006     1       9   NaN             NaN              1               1
    10  2006     1      10     8.70            0.626          0               0
    # … with 18,146 more rows, and abbreviated variable name
    #   ¹​Fraction_NA_chlorophyll
    # - Use `print(n = ...)` to see more rows

    He has a row for each cell-month-year combination, but also has a Fraction_NA_SST column that (I think) gives the fraction of the cell that is technically NA (for example being on land, or no data for it; so this has to be made when building the data set in data-raw/. Think we may still need this, may be it's better to just repeat each tibble as na_<dataset> to store these fractions, so they are available. Putting them into the data object itself may get cumbersome, unless we go with the long format that Joe started with as shown above.

    travistai2 commented 1 year ago

    Raster data doesn't seem to store properly as .rda file with usethis::use_data(). When package is loaded, raster object contains no information.

    sf objects seem to work better as .rda file. I was able to directly converted the netCDF data from .nc to SpatRaster using terra::rast(). A SpatRaster can then be assigned a terra::crs(), and then converted it to sf object using stars::st_as_stars() and sf::st_as_sf(). When loading package and sf data, you can directly use the plot() function without loading sf and plotting the sf object still works.

    So might be good to store data as sf object? Then the geometries are all available. Dataframe format:

    lyr.1 lyr.2 ... geometry

    where each lyr is the monthly data.

    seananderson commented 1 year ago

    I'm not sure what's going on with the use_data(). Maybe raster needs to be loaded? Regardless, I'd vote for sf. It's an augmented data frame format that plays well with dplyr and ggplot.

    andrew-edwards commented 1 year ago

    Realised that another reason to try and keep file sizes of data object small is that load_all() loads them in, and this might start taking a while if there are two many big ones (maybe it won't, but we might get big files quite quickly). I'm using load_all() all the time when developing (I tried attach = FALSE but that still loads in the data objects). Anyway, something to keep in mind.

    seananderson commented 1 year ago

    Keeping things as small as possible is definitely good. However, most users will be using library(), in which case the data can by lazy loaded assuming the DESCRIPTION file is set up with that. I.e., they won't appear in memory until used. Apparently you could also experiment with compression settings: https://r-pkgs.org/data.html

    If your plan is to store the data on GitHub then I imagine that will be more limiting then waiting for anything to load in R. If you're getting into that size then it might be worth thinking about external storage, downloading, and caching. rOpenSci has some packages that are smart about that.

    andrew-edwards commented 1 year ago

    Thanks Sean - we're thinking we might have to do that sort of thing as the data files get big.

    Conceptual question - is there way of forcing R to always download a data set into a given directory, that would work for all users? Say there's some spatial dataset that someone might want to use, but isn't changing over time. So if they don't have it we'd have a function that downloads it automatically, but then won't re-download it again next time they use R. Is it maybe best just to force them to specify a path? Be nice to have something automatic.

    andrew-edwards commented 1 year ago

    Thanks - super useful. Do you know all this stuff? Or just better at googling than me? :)

    andrew-edwards commented 1 year ago

    Closing this (and opening a related one), as we've moved on from the original idea. Thanks Sean - we're are indeed going with being dependent on sf - hard to avoid that really. So I created a cache function, but we're not sure if we'll use it or not. Am creating a new issue...