Closed mkoohafkan closed 5 years ago
It should already work, it does in rgdal too. On windows and macos it's built-in to the CRAN binaries, on Linux it depends on your system when GDAL is installed
It's my understanding that netCDF files are a special case of HDF5 and are created using the HDF5 libraries.
Only netCDF4 files, FYI ;)
@mdsumner thanks for the info! I see now that there is R support for accessing HDF5 files from rgdal
. I'm having a bit of a hard time working with HDF5 files where the datasets are nested in groups. If I know the (flattened) index of the dataset beforehand, read_stars
works great with the sub
argument:
read_stars("path/to/file.hdf" sub = 44)
The problem is that I'm really struggling to get dataset listings using rgdal
and/or gdalUtils
when the datasets are located in nested groups. gdalUtils::get_subdatasets
fails and I can't figure out how to access the datasets using the lower-level rgdal
functions GDAL.open
and readGDAL
. Can you provide any guidance on how to list the datasets from an HDF5 file using stars
or rgdal
, or decribe how stars
flattens the dataset indices from an HDF5 file that contains nested groups?
Both rgdal
and stars
will only read hdf5 files through the GDAL drivers, so I'd advice to start reading here.
@mkoohafkan I feel your pain! read_ncdf
and ncmeta::nc_vars
on your file may be helpful, can you try, and share a file too? I don't know one clear way to get at this info, and probably needs particular efforts for stars. ( I use rhdf5 for files with compound types in groups, it's very low level and bespoke - I don't have much experience overall).
OK, I think I have a working method now---the key is to use sf::gdal_utils
which lists the dataset names and the flattened numeric index that read_stars
needs. For some reason I had trouble doing the equivalent using rgdal::GDALinfo
, gdalUtils::gdalinfo
, or even gdalinfo
directly. @mdsumner, I have a couple of example HEC-RAS datasets here that we can use for examples:
library(sf)
library(stars)
hdfile = "SampleQuasiUnsteady.hdf"
info = gdal_utils("info", hdfile) # returns a long character vector, same as printed to console
info.lines = strsplit(info, "\n")[[1]] # split by line
sds = info.lines[grep("SUBDATASET.*NAME", info.lines)] # extract the subdataset labels
sds.index = grep("Flow", sds) # get the index of dataset "Flow"
read_stars(hdfile, sub = sds.index)
The output of sf::gdal_utils
is a bit messy but the required string manipulation and matching is straightforward. I still need to do some sleuthing to see if stars
/sf
/rgdal
can handle compound datasets (which HEC-RAS uses to store some metadata on the numerical grid cells/cross sections), but this is a major milestone (for me at least).
It would probably be nice to add an HDF5 example to the stars
documentation.
The Sentinel 5P example uses the HDF5 GDAL driver; I believe this file is intended to be netcdf, and there's sth wrong either with the format or with the GDAL driver, that's why this isn't advertised.
@edzer if the Sentinel 5P example data is a netCDF4 format that makes sense, since as @adrfantini pointed out netCDF4 files are actually HDF5 files. The HDF5 driver seems to work just fine, provided you can point to the specific subdataset you want (otherwise it tries to aggregate datasets in some way and fails).
Reading the documentation for gdalinfo
a bit more carefully, it looks like there is no support for reading compound datasets or even scalar string datasets. This is a fairly large limitation for working with HEC-RAS data specifically, but outside the scope of stars
, sf
and rgdal
. I doubt there is intention for gdal
to provide more robust support for HDF5 files, but I might try to contact the developers just in case. What I do find interesting is that gdalinfo
can read string attributes just fine...
Thanks everyone for your assistance working through this.
(I didn't realize HEC RAS was HDF5 ...)
That's right about compound types and scalars, pretty sure you can't get numeric scalars either. I asked a while ago in GDAL about a particular set of compound type but I think it's just so open-ended it's better to use more generic tools: https://trac.osgeo.org/gdal/ticket/6551
I've used rhdf5 to great effect for the L3bin ocean colour, they are really just a pair of tables - presented as a ragged array, a structural row-index in the compound types. If you explore your files from a stars perspective perhaps that's a good way to go?
ncdf4
can open your file and does list detailed contents, so maybe that's enough to get going?
I just also tried RNetCDF which doesn't see any data in your file, but might in version 2.0 if it can delve into the groups.
There's also MDAL, which promises to be a shared library rather than a QGIS-specific tool and is more aligned to how stars is conceived.
@mdsumner thanks for the info. stars
is great for pulling the bulk of the data---the only hiccup is the row/column identifiers which are stored separately in compound or string scalars. These identifiers are typically time stamps and cell IDs or cross section IDs , with the XY coordinates of these features stored in yet another table. It's not really a problem to use multiple packages to get what I need---the biggest issue I had was that hdf5r
is really slow. I'll check out ncdf4
and see if that is faster. Using stars
to get the data matrices is incredibly fast, so I'm already better off than I was before.
I opened an issue at OSGeo/gdal/issues/1348 to start a conversation about the HDF5 drivers and some level of support for scalars and compound dataset support. I agree the problem is open-ended, and I don't want to encourage developing specific support for specific data structures. Some general utilities for working with compound datasets would be helpful though---for example, GDAL support for extracting a single column from a compound table would allow me to sequentially pull the identifiers I need and use them in the stars
framework.
OK. closing here?
Yes, thank you @edzer.
Are there plans for adding HDF5 support to
stars
? HDF5 is commonly for used by models to record spatiotemporal data (for example, CoRTAD, EOS, and HEC-RAS use HDF5 format). It's my understanding that netCDF files are a special case of HDF5 and are created using the HDF5 libraries. Furthermore, the recent release ofhdfql
has out-of-the-box support for reading/writing HDF5 from R.There seems to be some support for HDF5 in GDAL but I don't think it has made its way into rgdal. There is currently the package
hdf5r
available on CRAN, but there some performance issues. I started developinghdfqlr
, an R front-end for HDFql, but I have not had time to work on it (I moved to a new job where i no longer was working regularly with HDF5 files). HDF5 support instars
would likely supercede both of these packages.