hytest-org / hytest

https://hytest-org.github.io/hytest/
22 stars 13 forks source link

R access for CONUS404 #72

Open amsnyder opened 2 years ago

amsnyder commented 2 years ago

Creating a thread to discuss R access to the CONUS404 data subset on S3+Caldera. The data are in zarr format, which can be read in easily by python, but the R community is still finding solutions for. We have many R users who will want access to this dataset, and we would like to be able to provide guidance to them. Some solutions that have been considered:

  1. Asking R users to download their data subset into a netcdf file, which can then be read into R. This is possible, but not ideal to ask R users to download and set up a python environment just for data access.
  2. @jesse-ross has done some initial exploration of the stars package in R, based on this blog. This seems to work, but it required specific R package versions, so a Docker container might be needed if this is our recommended approach.
  3. Use reticulate to run python code that reads zarr data in R. This approach has not been explored by our team, but Lauren Koenig may have previously done some work on this.
  4. @rsignell-usgs has suggested looking into the latest NetCDF library, which can also read Zarr. He thinks that's what GDAL is using to read Zarr, and it is the way most of the "R-reading-Zarr" demos he has seen have been based on.

Jesse will be leading this exploration, and we can use this thread to discuss and document our learnings along the way.

jesse-ross commented 2 years ago

For the stars/GDAL approach (2), I think a docker image will definitely be needed at present, because development versions of several geospatial packages are required (the blog post linked above is missing some details that are in its canonical version here). The image code.usgs.gov:5001/jross/zarr-in-r:latest has the necessary versions.

amsnyder commented 2 years ago

Jesse here are some example notebooks you could try to replicate: https://github.com/hytest-org/hytest/tree/dev/dataset_access

I would start with the explore notebook. These notebooks will likely be updated in the coming weeks with additional instructional material, but they will give a sense of what we want to provide to our users.

amsnyder commented 1 year ago

@jesse-ross - have you looked into RNetCDF at all? I am not familiar with it, but Dave B. mentioned it in this issue about updating the geoknife package to work with zarr.

jesse-ross commented 1 year ago

Looks interesting, thanks! It looks like it's gotten zarr support. I will look into it as well when I get to this work.