Open lhmarsden opened 7 months ago
Hi @lhmarsden , currently there is no easy way to specify array subsets using data values, because the start
and count
arguments of var.get.nc
only support index numbers.
If a user knows the name of the coordinate variables (e.g. from print.nc
), they can read the coordinate values and calculate the index range they need.
Otherwise, they can use var.inq.nc
to get the dimids, then use dim.inq.nc
to find the dimension names, then use var.get.nc
to read the dimension values.
As you can see, RNetCDF is still quite a low-level interface to netcdf. It would be possible to add your requested feature, although it may be more suitable for a higher level package like ncmeta
, which extends RNetCDF. @mdsumner , what do you think?
there's only one dimension in this file, PRES
I think if you read PRES, then use that for start and count ranges in the TEMP, but you might as well read both in whole, and do your own filtering in memory.
TEMP <- var.get.nc(data, "TEMP")
PRES <- var.get.nc(data, "PRES")
There's higher level tools in tidync which uses ncmeta (tidync is similar to but nowhere near as powerful as xarray). It's possible to explore what will be returned (via start/min/max values in the printout) from hyper_filter, but note we only have one grid in this file - when there are more you need activate()
.
library(tidync)
library(dplyr)
nc <- tidync(netcdf_file)
## here we stop being lazy and pull data
hyper_tibble(hyper_filter(nc, PRES = between(PRES, 15, 30)), select_var = c("PRES", "TEMP"))
... attempting remote connection
Connection succeeded.
# A tibble: 16 × 2
PRES TEMP
<dbl> <dbl>
1 15 3.83
2 16 3.81
3 17 3.79
4 18 3.79
5 19 3.81
6 20 3.85
7 21 3.86
8 22 3.83
9 23 3.82
10 24 3.82
11 25 3.81
12 26 3.81
13 27 3.79
14 28 3.71
15 29 3.70
16 30 3.64
That's quite convenient but it doesn't take long until you run into limitations (compared to xarray). Also it might be plagued by library or build variations on different OS ... it could do with a refactor but frankly I'm probably more inclined to use xarray via reticulate these days :)
HTH
Thanks @mdsumner.
I have often considered adding higher-level capabilities to RNetCDF, but given limited time, my main focus has been on improving support for features of NetCDF datasets (e.g. nested user-defined types) while maintaining backward compatibility and portability.
I also doubt that RNetCDF is the right starting point for a user-friendly, high-level interface to NetCDF. Although most of the code was rewritten for version 2.0, the core API of this package is almost 20 years old this year! I suspect that it would be easier to provide a convenient API from a more modern package, running RNetCDF under the hood. The stars
package provides an example of what is possible with additional layers on top of RNetCDF, with features including lazy evaluation across multiple datasets - see https://r-spatial.github.io/stars/articles/stars2.html .
Part of the challenge is that NetCDF is so flexible. It provides a variety of features, with few restrictions on how they are used. RNetCDF aims to support the low-level capabilities, no matter how they are used or abused. A higher-level API will often need to assume that NetCDF conventions have been followed, but there are several different conventions, they continue to evolve, and they are sometimes misused or misinterpreted. Where RNetCDF has strayed into supporting conventions, it has become messy - as in the numerous options for handling missing values. For this reason, a separation of concerns may be beneficial, with one or more packages implementing the NetCDF conventions, all leveraging the lower-level (and hopefully more stable) RNetCDF interface.
I would appreciate your thoughts on this issue. Should RNetCDF be expanded, or should it be enhanced by other packages?
I don't think it should expand, matching the API is good.
Big picture it's a shame there's a few downstream high level APIs, a shame that work wasn't done at a lower level. But that's where we're at , xarray is very young still and very python focused only so far, but I think it's impact is here to stay and will deepen. stars is good, but it's not fleshed out enough for the RNetCDF-as-engine workflow, perhaps effort there would be most effective.
Food for thought Michael, thanks again.
I still haven't decided whether or not to add the requested feature. I think the changes involved are relatively simple, and they should apply to most netcdf datasets, although they would only be meaningful for numeric data types. One problem is that I am unlikely to work on the changes before @lhmarsden needs to run the tutorial series. If someone submits a PR with suitable changes, I would consider including them in the package.
It was interesting to read through the discussion above and to learn about how these things develop. Thanks!
You may be interested in this video tutorial that I have now posted to YouTube. Use it however you like. In it I explain how to open a NetCDF file using RNetCDF, understand the contents and extract the data and metadata. I also discuss the CF and ACDD conventions.
https://youtu.be/Xer1XBm3sns?si=VfrDZIrkrzuKcy4F
This will be the first in a series of videos working with NetCDF files in R, including
Each video is accompanied by a chapter in this Jupyter book. You can get a sneak peak at the code and explanations for the next few chapters here: https://lhmarsden.github.io/NetCDF_in_R_from_beginner_to_pro/intro.html
How about this:
> library(ncdfCF)
> netcdf_file <- 'https://opendap1.nodc.no/opendap/physics/point/cruise/nansen_legacy-single_profile/NMDC_Nansen-Legacy_PR_CT_58US_2021708/CTD_station_P1_NLEG01-1_-_Nansen_Legacy_Cruise_-_2021_Joint_Cruise_2-1.nc'
> data <- open_ncdf(netcdf_file)
# Dimension PRES is in dbar. Assuming a dbar corresponds approximately to a meter of depth
> subset(data[["TEMP"]], list(PRES = c(10, 20)), rightmost.closed = TRUE)
10 11 12 13 14 15 16 17 18 19 20
3.738 3.783 3.833 3.838 3.837 3.830 3.806 3.792 3.790 3.814 3.846
attr(,"axis")
PRES
"Z"
Note that PRES = c(10, 20)
refers to the pressure levels in the dimension, not the index positions along the dimension data (even though they are identical in this specific case).
ncdfCF
is a new package available on CRAN to read NetCDF resources using RNetCDF
and applying CF Metadata Conventions (a few elements, to begin with). The whole philosophy of the package is to provide CF intelligence and that "user-friendly, high-level interface to NetCDF" that @mjwoods mentioned above, with familiar R constructs for selection for instance. It builds on RNetCDF
, exactly because it is an excellent interface to the API.
Note that ncdfCF
currently lacks more features than its supports: no writing, no groups (netcdf4). But the plan is to develop the package vigorously over the coming months to make it a viable alternative to RNetCDF
and ncdf4
for accessing NetCDF data, and with an expanding support for CF conventions. One item on my TODO list: select variables using the standard_name attribute, to reference another open issue.
@mjwoods Also check out package CFtime
, with full support for all CF defined calendars (climatology calendars coming soonish). ncmeta
is already using CFtime
as an extended attribute. An alternative for your use of udunits
partial calendar conversion support?
This looks very interesting, @pvanlaake. I look forward to seeing how the project progresses!
Many thanks to you both, @lhmarsden and @pvanlaake . It is very satisfying to see the training materials and advanced packages building on RNetCDF.
I think RNetCDF contains the functionality needed by @pvanlaake to create ncdfCF
, but if you discover gaps or inefficiencies, please let me know.
On the matter of time conversions, I have mainly continued supporting udunits
time conversions for backward compatibility. I agree that this support is made redundant by CFtime
and could be removed in a future major release of RNetCDF (e.g. 3.x).
Hi Milton, package ncdfCF
will use whatever netcdf
functionality is exposed to R through RNetCDF
. I'll reach out if I find any issues. Working on reading groups and UDTs now. Writing support will follow soon.
On time format conversions, CFtime
can produce POSIXct
and Date
s from standard
, gregorian
and proleptic_gregorian
calendars, just like RNetCDF
does with udunits
. Calendar conversions in the way that PCICt
does are not supported and neither will they be in the future; that is just bad practice if you ask me. The CFtime
package has a vignette explaining this in some detail.
Best, Patrick
Note that much of what is needed is baked into stars
internals. I'm working on something akin to xarray (NetCDF and ZARR) for R over here. https://github.com/DOI-USGS/rnz that will very likely end up containing utility functions like are being requested here but maybe there's a path to adopt some of what you have in ncdfCF @pvanlaake -- Glad to see that's happening.
Hi Dave @dblodgett-usgs, the focus of ncdfCF
is two-fold. First is to provide easy access to NetCDF data using familiar R vocabulary, including basics like dim()
, dimnames()
, and [
and subset()
filtering to read slabs of data from file. Second is to apply stated CF Conventions to the data. That latter part is still quite marginal but growing. Once the package is more complete and stable it will be usable as a stand-alone tool for users who want to analyse NetCDF data with conventions applied, or as a data access engine for other packages.
Similar easy access to ZARR would be great, looking forward to seeing that coming to fruition.
Sounds great @pvanlaake ! Have you seen https://hypertidy.github.io/ncmeta/reference/index.html ? In particular, we put some handling for finding coordinate variables and grid mappings in ncmeta
.
My hope is that rnz
can provide an API that is functionally equivalent to RNetCDF (esentially passes through RNetCDF) but also allows access to ZARR stores using pizzarr which is nearing an initial release state.
Append "#mode=zarr" for using RNetCDF
I don't know if that works for real remote sources, like the dodsC/ vs fileServer/ thing 🤔
Firstly, apologies if what I am asking for is already possible...
When loading in data from a NetCDF file, it is often useful to be able to select only a subset of the data. Using RNetCDF, I can do this:
This give me 3 values from the TEMP variable starting with the 5th value in the array.
However, often I want to be able to extract data based on some coordinate range. For example, working with the data shown above, I might want to retrieve the data between 10 and 20 metres depth.
In python xarray, it is possible to select a subset of the whole xarray object using sel https://docs.xarray.dev/en/latest/generated/xarray.DataArray.sel.html
This has a number of useful features, and even allows you to select the nearest data to the coordinate that you have provided as an argument.
I think something similar would be very helpful in RNetCDF, and I haven't seen how to do this in the documentation - though I might have missed something.
I am currently writing a tutorial series showing people how to work with NetCDF files using R, and I would like to use RNetCDF throughout this course. This is something that I would like to include in this.
For reference, here is my equivalent course to teach people how to work with NetCDF files in Python, predominantly using xarray: https://lhmarsden.github.io/NetCDF_in_Python_from_beginner_to_pro/intro.html
A course in R will follow the same structure with most of the same chapters.