Closed LTLA closed 3 years ago
Actually, that wasn't hard at all:
library(rhdf5)
h5createFile("ex_hdf5file.h5")
# write a matrix
B = matrix(runif(100), 10, 10)
dimnames(B) <- list(LETTERS[1:10], letters[10:19])
h5write(B, "ex_hdf5file.h5","B")
h5read("ex_hdf5file.h5", "B", start=c(1,2), stride=c(4, 3), block=c(3,2), count=c(2, 3))
## l m o p r s
## B 0.2358985 0.66352289 0.39027587 0.9880186 0.5638421 0.8282238
## C 0.8724017 0.06961246 0.85280121 0.1921349 0.6555160 0.2661366
## D 0.8151867 0.43678885 0.67171752 0.9481734 0.1130484 0.2923864
## F 0.1451874 0.46678574 0.45811005 0.4645230 0.9669605 0.6421477
## G 0.1414784 0.50892343 0.01013866 0.7655203 0.1363534 0.9254932
## H 0.8191585 0.36855542 0.51680822 0.9040001 0.7627248 0.5818725
A mild concern is the fact that HDF5 attributes are not compressed, so for thin arrays, the dimnames may actually take up more space than the data. The "correct" solution would be to store HDF5 object references (https://support.hdfgroup.org/HDF5/doc/H5.user/References.html) in the attributes and to have the dimnames live as datasets in their own right, but this is tricky to manage as you now need to keep two different datasets in sync. Maybe it's not a big deal.
Maybe it's not a big deal.
Having thought about it, it's probably not a big deal. If people are concerned about space efficiency, they should store the dimnames separately as datasets in a manner that is comfortable for them. At least rhdf5 now provides some sensible default behavior.
Thanks for the pull request, it looks like a useful addition. I'm doing a bit of an overhaul of a number of different things, so I'll add checking this over to the list and hopefully merge it in a couple of days.
HDF5 has the concept of Dimension Scale datasets which is the natural thing to use for storing datasets associated with the dimensions of a given dataset like its dimnames. This is what HDF5Array::h5writeDimnames()
and HDF5Array::h5readDimnames()
use.
Would be good to coordinate about this otherwise the dimnames written with HDF5Array::h5writeDimnames()
won't be found by rhdf5::h5read()
and those written by rhdf5::h5write()
won't be found by HDF5Array::h5readDimnames()
.
Is there any update on this? This would be great to have.
We should probably go with @hpages's solution.
@nuno-agostinho h5writeDimnames()
and h5readDimnames()
are available in the HDF5Array package. I might factor out the low-level hdf5 manipulation facilities that I implemented in HDF5Array (e.g. h5mread()
, h5writeDimnames()
, h5readDimnames()
, etc...) to their own package at some point.
Packs array dimnames into the attributes as character vectors in
h5writeDataset.array
and pulls them out again inh5read
. This should be entirely backwards compatible, if there are nodimnames
attributes the function will just shrug and move on.I haven't yet bothered to try to get the start/count/block/stride stuff right.