mjwoods / RNetCDF

Read and write netcdf format in R
Other
24 stars 9 forks source link

Get data from a variable based on a range in associated coordinate variables #143

Open lhmarsden opened 7 months ago

lhmarsden commented 7 months ago

Firstly, apologies if what I am asking for is already possible...

When loading in data from a NetCDF file, it is often useful to be able to select only a subset of the data. Using RNetCDF, I can do this:

netcdf_file <- 'https://opendap1.nodc.no/opendap/physics/point/cruise/nansen_legacy-single_profile/NMDC_Nansen-Legacy_PR_CT_58US_2021708/CTD_station_P1_NLEG01-1_-_Nansen_Legacy_Cruise_-_2021_Joint_Cruise_2-1.nc'
data <- open.nc(netcdf_file)

var.get.nc(data, "TEMP", c(5), c(3))

This give me 3 values from the TEMP variable starting with the 5th value in the array.

However, often I want to be able to extract data based on some coordinate range. For example, working with the data shown above, I might want to retrieve the data between 10 and 20 metres depth.

In python xarray, it is possible to select a subset of the whole xarray object using sel https://docs.xarray.dev/en/latest/generated/xarray.DataArray.sel.html

This has a number of useful features, and even allows you to select the nearest data to the coordinate that you have provided as an argument.

I think something similar would be very helpful in RNetCDF, and I haven't seen how to do this in the documentation - though I might have missed something.

I am currently writing a tutorial series showing people how to work with NetCDF files using R, and I would like to use RNetCDF throughout this course. This is something that I would like to include in this.

For reference, here is my equivalent course to teach people how to work with NetCDF files in Python, predominantly using xarray: https://lhmarsden.github.io/NetCDF_in_Python_from_beginner_to_pro/intro.html

A course in R will follow the same structure with most of the same chapters.

mjwoods commented 7 months ago

Hi @lhmarsden , currently there is no easy way to specify array subsets using data values, because the start and count arguments of var.get.nc only support index numbers.

If a user knows the name of the coordinate variables (e.g. from print.nc), they can read the coordinate values and calculate the index range they need.

Otherwise, they can use var.inq.nc to get the dimids, then use dim.inq.nc to find the dimension names, then use var.get.nc to read the dimension values.

As you can see, RNetCDF is still quite a low-level interface to netcdf. It would be possible to add your requested feature, although it may be more suitable for a higher level package like ncmeta, which extends RNetCDF. @mdsumner , what do you think?

mdsumner commented 7 months ago

there's only one dimension in this file, PRES

I think if you read PRES, then use that for start and count ranges in the TEMP, but you might as well read both in whole, and do your own filtering in memory.

TEMP <- var.get.nc(data, "TEMP")
PRES <- var.get.nc(data, "PRES")

There's higher level tools in tidync which uses ncmeta (tidync is similar to but nowhere near as powerful as xarray). It's possible to explore what will be returned (via start/min/max values in the printout) from hyper_filter, but note we only have one grid in this file - when there are more you need activate().

library(tidync)
library(dplyr)
nc <- tidync(netcdf_file)
## here we stop being lazy and pull data
hyper_tibble(hyper_filter(nc, PRES = between(PRES, 15, 30)), select_var = c("PRES", "TEMP"))

... attempting remote connection

Connection succeeded.
# A tibble: 16 × 2
    PRES  TEMP
   <dbl> <dbl>
 1    15  3.83
 2    16  3.81
 3    17  3.79
 4    18  3.79
 5    19  3.81
 6    20  3.85
 7    21  3.86
 8    22  3.83
 9    23  3.82
10    24  3.82
11    25  3.81
12    26  3.81
13    27  3.79
14    28  3.71
15    29  3.70
16    30  3.64

That's quite convenient but it doesn't take long until you run into limitations (compared to xarray). Also it might be plagued by library or build variations on different OS ... it could do with a refactor but frankly I'm probably more inclined to use xarray via reticulate these days :)

HTH

mjwoods commented 7 months ago

Thanks @mdsumner.

I have often considered adding higher-level capabilities to RNetCDF, but given limited time, my main focus has been on improving support for features of NetCDF datasets (e.g. nested user-defined types) while maintaining backward compatibility and portability.

I also doubt that RNetCDF is the right starting point for a user-friendly, high-level interface to NetCDF. Although most of the code was rewritten for version 2.0, the core API of this package is almost 20 years old this year! I suspect that it would be easier to provide a convenient API from a more modern package, running RNetCDF under the hood. The stars package provides an example of what is possible with additional layers on top of RNetCDF, with features including lazy evaluation across multiple datasets - see https://r-spatial.github.io/stars/articles/stars2.html .

Part of the challenge is that NetCDF is so flexible. It provides a variety of features, with few restrictions on how they are used. RNetCDF aims to support the low-level capabilities, no matter how they are used or abused. A higher-level API will often need to assume that NetCDF conventions have been followed, but there are several different conventions, they continue to evolve, and they are sometimes misused or misinterpreted. Where RNetCDF has strayed into supporting conventions, it has become messy - as in the numerous options for handling missing values. For this reason, a separation of concerns may be beneficial, with one or more packages implementing the NetCDF conventions, all leveraging the lower-level (and hopefully more stable) RNetCDF interface.

I would appreciate your thoughts on this issue. Should RNetCDF be expanded, or should it be enhanced by other packages?

mdsumner commented 7 months ago

I don't think it should expand, matching the API is good.

Big picture it's a shame there's a few downstream high level APIs, a shame that work wasn't done at a lower level. But that's where we're at , xarray is very young still and very python focused only so far, but I think it's impact is here to stay and will deepen. stars is good, but it's not fleshed out enough for the RNetCDF-as-engine workflow, perhaps effort there would be most effective.

mjwoods commented 7 months ago

Food for thought Michael, thanks again.

mjwoods commented 7 months ago

I still haven't decided whether or not to add the requested feature. I think the changes involved are relatively simple, and they should apply to most netcdf datasets, although they would only be meaningful for numeric data types. One problem is that I am unlikely to work on the changes before @lhmarsden needs to run the tutorial series. If someone submits a PR with suitable changes, I would consider including them in the package.

lhmarsden commented 7 months ago

It was interesting to read through the discussion above and to learn about how these things develop. Thanks!

You may be interested in this video tutorial that I have now posted to YouTube. Use it however you like. In it I explain how to open a NetCDF file using RNetCDF, understand the contents and extract the data and metadata. I also discuss the CF and ACDD conventions.

https://youtu.be/Xer1XBm3sns?si=VfrDZIrkrzuKcy4F

This will be the first in a series of videos working with NetCDF files in R, including

  1. Plotting the data
  2. Converting to Excel/CSV
  3. Creating a NetCDF file.

Each video is accompanied by a chapter in this Jupyter book. You can get a sneak peak at the code and explanations for the next few chapters here: https://lhmarsden.github.io/NetCDF_in_R_from_beginner_to_pro/intro.html

pvanlaake commented 5 months ago

How about this:

> library(ncdfCF)

> netcdf_file <- 'https://opendap1.nodc.no/opendap/physics/point/cruise/nansen_legacy-single_profile/NMDC_Nansen-Legacy_PR_CT_58US_2021708/CTD_station_P1_NLEG01-1_-_Nansen_Legacy_Cruise_-_2021_Joint_Cruise_2-1.nc'

> data <- open_ncdf(netcdf_file)

# Dimension PRES is in dbar. Assuming a dbar corresponds approximately to a meter of depth
> subset(data[["TEMP"]], list(PRES = c(10, 20)), rightmost.closed = TRUE)
   10    11    12    13    14    15    16    17    18    19    20 
3.738 3.783 3.833 3.838 3.837 3.830 3.806 3.792 3.790 3.814 3.846 
attr(,"axis")
PRES 
 "Z" 

Note that PRES = c(10, 20) refers to the pressure levels in the dimension, not the index positions along the dimension data (even though they are identical in this specific case).

ncdfCF is a new package available on CRAN to read NetCDF resources using RNetCDF and applying CF Metadata Conventions (a few elements, to begin with). The whole philosophy of the package is to provide CF intelligence and that "user-friendly, high-level interface to NetCDF" that @mjwoods mentioned above, with familiar R constructs for selection for instance. It builds on RNetCDF, exactly because it is an excellent interface to the API.

Note that ncdfCF currently lacks more features than its supports: no writing, no groups (netcdf4). But the plan is to develop the package vigorously over the coming months to make it a viable alternative to RNetCDF and ncdf4 for accessing NetCDF data, and with an expanding support for CF conventions. One item on my TODO list: select variables using the standard_name attribute, to reference another open issue.

@mjwoods Also check out package CFtime, with full support for all CF defined calendars (climatology calendars coming soonish). ncmeta is already using CFtime as an extended attribute. An alternative for your use of udunits partial calendar conversion support?

lhmarsden commented 5 months ago

This looks very interesting, @pvanlaake. I look forward to seeing how the project progresses!

mjwoods commented 5 months ago

Many thanks to you both, @lhmarsden and @pvanlaake . It is very satisfying to see the training materials and advanced packages building on RNetCDF.

mjwoods commented 5 months ago

I think RNetCDF contains the functionality needed by @pvanlaake to create ncdfCF, but if you discover gaps or inefficiencies, please let me know.

mjwoods commented 5 months ago

On the matter of time conversions, I have mainly continued supporting udunits time conversions for backward compatibility. I agree that this support is made redundant by CFtime and could be removed in a future major release of RNetCDF (e.g. 3.x).

pvanlaake commented 5 months ago

Hi Milton, package ncdfCF will use whatever netcdf functionality is exposed to R through RNetCDF. I'll reach out if I find any issues. Working on reading groups and UDTs now. Writing support will follow soon.

On time format conversions, CFtime can produce POSIXct and Dates from standard, gregorian and proleptic_gregorian calendars, just like RNetCDF does with udunits. Calendar conversions in the way that PCICt does are not supported and neither will they be in the future; that is just bad practice if you ask me. The CFtime package has a vignette explaining this in some detail.

Best, Patrick

dblodgett-usgs commented 5 months ago

Note that much of what is needed is baked into stars internals. I'm working on something akin to xarray (NetCDF and ZARR) for R over here. https://github.com/DOI-USGS/rnz that will very likely end up containing utility functions like are being requested here but maybe there's a path to adopt some of what you have in ncdfCF @pvanlaake -- Glad to see that's happening.

pvanlaake commented 5 months ago

Hi Dave @dblodgett-usgs, the focus of ncdfCF is two-fold. First is to provide easy access to NetCDF data using familiar R vocabulary, including basics like dim(), dimnames(), and [ and subset() filtering to read slabs of data from file. Second is to apply stated CF Conventions to the data. That latter part is still quite marginal but growing. Once the package is more complete and stable it will be usable as a stand-alone tool for users who want to analyse NetCDF data with conventions applied, or as a data access engine for other packages.

Similar easy access to ZARR would be great, looking forward to seeing that coming to fruition.

dblodgett-usgs commented 5 months ago

Sounds great @pvanlaake ! Have you seen https://hypertidy.github.io/ncmeta/reference/index.html ? In particular, we put some handling for finding coordinate variables and grid mappings in ncmeta.

My hope is that rnz can provide an API that is functionally equivalent to RNetCDF (esentially passes through RNetCDF) but also allows access to ZARR stores using pizzarr which is nearing an initial release state.

mdsumner commented 5 months ago

Append "#mode=zarr" for using RNetCDF

https://gist.github.com/mdsumner/492d2a98bffc6de5974a96f50a0b75f2?permalink_comment_id=5071910#gistcomment-5071910

I don't know if that works for real remote sources, like the dodsC/ vs fileServer/ thing 🤔