r-spatial / stars

Spatiotemporal Arrays, Raster and Vector Data Cubes
https://r-spatial.github.io/stars/
Apache License 2.0
563 stars 94 forks source link

HDF5 support? #137

Closed mkoohafkan closed 5 years ago

mkoohafkan commented 5 years ago

Are there plans for adding HDF5 support to stars? HDF5 is commonly for used by models to record spatiotemporal data (for example, CoRTAD, EOS, and HEC-RAS use HDF5 format). It's my understanding that netCDF files are a special case of HDF5 and are created using the HDF5 libraries. Furthermore, the recent release of hdfql has out-of-the-box support for reading/writing HDF5 from R.

There seems to be some support for HDF5 in GDAL but I don't think it has made its way into rgdal. There is currently the package hdf5r available on CRAN, but there some performance issues. I started developing hdfqlr, an R front-end for HDFql, but I have not had time to work on it (I moved to a new job where i no longer was working regularly with HDF5 files). HDF5 support in stars would likely supercede both of these packages.

mdsumner commented 5 years ago

It should already work, it does in rgdal too. On windows and macos it's built-in to the CRAN binaries, on Linux it depends on your system when GDAL is installed

adrfantini commented 5 years ago

It's my understanding that netCDF files are a special case of HDF5 and are created using the HDF5 libraries.

Only netCDF4 files, FYI ;)

mkoohafkan commented 5 years ago

@mdsumner thanks for the info! I see now that there is R support for accessing HDF5 files from rgdal. I'm having a bit of a hard time working with HDF5 files where the datasets are nested in groups. If I know the (flattened) index of the dataset beforehand, read_stars works great with the sub argument:

read_stars("path/to/file.hdf" sub = 44)

The problem is that I'm really struggling to get dataset listings using rgdal and/or gdalUtils when the datasets are located in nested groups. gdalUtils::get_subdatasets fails and I can't figure out how to access the datasets using the lower-level rgdal functions GDAL.open and readGDAL. Can you provide any guidance on how to list the datasets from an HDF5 file using stars or rgdal, or decribe how stars flattens the dataset indices from an HDF5 file that contains nested groups?

edzer commented 5 years ago

Both rgdal and stars will only read hdf5 files through the GDAL drivers, so I'd advice to start reading here.

mdsumner commented 5 years ago

@mkoohafkan I feel your pain! read_ncdf and ncmeta::nc_vars on your file may be helpful, can you try, and share a file too? I don't know one clear way to get at this info, and probably needs particular efforts for stars. ( I use rhdf5 for files with compound types in groups, it's very low level and bespoke - I don't have much experience overall).

mkoohafkan commented 5 years ago

OK, I think I have a working method now---the key is to use sf::gdal_utils which lists the dataset names and the flattened numeric index that read_stars needs. For some reason I had trouble doing the equivalent using rgdal::GDALinfo, gdalUtils::gdalinfo, or even gdalinfo directly. @mdsumner, I have a couple of example HEC-RAS datasets here that we can use for examples:

library(sf)
library(stars)
hdfile = "SampleQuasiUnsteady.hdf"
info = gdal_utils("info", hdfile) # returns a long character vector, same as printed to console
info.lines = strsplit(info, "\n")[[1]] # split by line
sds = info.lines[grep("SUBDATASET.*NAME", info.lines)] # extract the subdataset labels
sds.index = grep("Flow", sds) # get the index of dataset "Flow"
read_stars(hdfile, sub = sds.index)

The output of sf::gdal_utils is a bit messy but the required string manipulation and matching is straightforward. I still need to do some sleuthing to see if stars/sf/rgdal can handle compound datasets (which HEC-RAS uses to store some metadata on the numerical grid cells/cross sections), but this is a major milestone (for me at least).

It would probably be nice to add an HDF5 example to the stars documentation.

edzer commented 5 years ago

The Sentinel 5P example uses the HDF5 GDAL driver; I believe this file is intended to be netcdf, and there's sth wrong either with the format or with the GDAL driver, that's why this isn't advertised.

mkoohafkan commented 5 years ago

@edzer if the Sentinel 5P example data is a netCDF4 format that makes sense, since as @adrfantini pointed out netCDF4 files are actually HDF5 files. The HDF5 driver seems to work just fine, provided you can point to the specific subdataset you want (otherwise it tries to aggregate datasets in some way and fails).

Reading the documentation for gdalinfo a bit more carefully, it looks like there is no support for reading compound datasets or even scalar string datasets. This is a fairly large limitation for working with HEC-RAS data specifically, but outside the scope of stars, sf and rgdal. I doubt there is intention for gdal to provide more robust support for HDF5 files, but I might try to contact the developers just in case. What I do find interesting is that gdalinfo can read string attributes just fine...

Thanks everyone for your assistance working through this.

mdsumner commented 5 years ago

(I didn't realize HEC RAS was HDF5 ...)

That's right about compound types and scalars, pretty sure you can't get numeric scalars either. I asked a while ago in GDAL about a particular set of compound type but I think it's just so open-ended it's better to use more generic tools: https://trac.osgeo.org/gdal/ticket/6551

I've used rhdf5 to great effect for the L3bin ocean colour, they are really just a pair of tables - presented as a ragged array, a structural row-index in the compound types. If you explore your files from a stars perspective perhaps that's a good way to go?

ncdf4 can open your file and does list detailed contents, so maybe that's enough to get going?

I just also tried RNetCDF which doesn't see any data in your file, but might in version 2.0 if it can delve into the groups.

There's also MDAL, which promises to be a shared library rather than a QGIS-specific tool and is more aligned to how stars is conceived.

mkoohafkan commented 5 years ago

@mdsumner thanks for the info. stars is great for pulling the bulk of the data---the only hiccup is the row/column identifiers which are stored separately in compound or string scalars. These identifiers are typically time stamps and cell IDs or cross section IDs , with the XY coordinates of these features stored in yet another table. It's not really a problem to use multiple packages to get what I need---the biggest issue I had was that hdf5r is really slow. I'll check out ncdf4 and see if that is faster. Using stars to get the data matrices is incredibly fast, so I'm already better off than I was before.

I opened an issue at OSGeo/gdal/issues/1348 to start a conversation about the HDF5 drivers and some level of support for scalars and compound dataset support. I agree the problem is open-ended, and I don't want to encourage developing specific support for specific data structures. Some general utilities for working with compound datasets would be helpful though---for example, GDAL support for extracting a single column from a compound table would allow me to sequentially pull the identifiers I need and use them in the stars framework.

edzer commented 5 years ago

OK. closing here?

mkoohafkan commented 5 years ago

Yes, thank you @edzer.