grimbough / rhdf5

Package providing an interface between HDF5 and R
http://bioconductor.org/packages/rhdf5
60 stars 21 forks source link

Added support for storing and retrieving array dimnames. #65

Closed LTLA closed 3 years ago

LTLA commented 4 years ago

Packs array dimnames into the attributes as character vectors in h5writeDataset.array and pulls them out again in h5read. This should be entirely backwards compatible, if there are no dimnames attributes the function will just shrug and move on.

library(rhdf5)
h5createFile("ex_hdf5file.h5")

# write a matrix
B = array(seq(0.1,2.0,by=0.1),dim=c(5,2,2))
dimnames(B) <- list(1:5, LETTERS[1:2], letters[20:21])
h5write(B, "ex_hdf5file.h5","B")

h5read("ex_hdf5file.h5","B")
## , , t
## 
##     A   B
## 1 0.1 0.6
## 2 0.2 0.7
## 3 0.3 0.8
## 4 0.4 0.9
## 5 0.5 1.0
## 
## , , u
## 
##     A   B
## 1 1.1 1.6
## 2 1.2 1.7
## 3 1.3 1.8
## 4 1.4 1.9
## 5 1.5 2.0

I haven't yet bothered to try to get the start/count/block/stride stuff right.

LTLA commented 4 years ago

Actually, that wasn't hard at all:

library(rhdf5)
h5createFile("ex_hdf5file.h5")

# write a matrix
B = matrix(runif(100), 10, 10)
dimnames(B) <- list(LETTERS[1:10], letters[10:19])
h5write(B, "ex_hdf5file.h5","B")

h5read("ex_hdf5file.h5", "B", start=c(1,2), stride=c(4, 3), block=c(3,2), count=c(2, 3))
##           l          m          o         p         r         s
## B 0.2358985 0.66352289 0.39027587 0.9880186 0.5638421 0.8282238
## C 0.8724017 0.06961246 0.85280121 0.1921349 0.6555160 0.2661366
## D 0.8151867 0.43678885 0.67171752 0.9481734 0.1130484 0.2923864
## F 0.1451874 0.46678574 0.45811005 0.4645230 0.9669605 0.6421477
## G 0.1414784 0.50892343 0.01013866 0.7655203 0.1363534 0.9254932
## H 0.8191585 0.36855542 0.51680822 0.9040001 0.7627248 0.5818725
LTLA commented 4 years ago

A mild concern is the fact that HDF5 attributes are not compressed, so for thin arrays, the dimnames may actually take up more space than the data. The "correct" solution would be to store HDF5 object references (https://support.hdfgroup.org/HDF5/doc/H5.user/References.html) in the attributes and to have the dimnames live as datasets in their own right, but this is tricky to manage as you now need to keep two different datasets in sync. Maybe it's not a big deal.

LTLA commented 4 years ago

Maybe it's not a big deal.

Having thought about it, it's probably not a big deal. If people are concerned about space efficiency, they should store the dimnames separately as datasets in a manner that is comfortable for them. At least rhdf5 now provides some sensible default behavior.

grimbough commented 4 years ago

Thanks for the pull request, it looks like a useful addition. I'm doing a bit of an overhaul of a number of different things, so I'll add checking this over to the list and hopefully merge it in a couple of days.

hpages commented 4 years ago

HDF5 has the concept of Dimension Scale datasets which is the natural thing to use for storing datasets associated with the dimensions of a given dataset like its dimnames. This is what HDF5Array::h5writeDimnames() and HDF5Array::h5readDimnames() use.

Would be good to coordinate about this otherwise the dimnames written with HDF5Array::h5writeDimnames() won't be found by rhdf5::h5read() and those written by rhdf5::h5write() won't be found by HDF5Array::h5readDimnames().

nuno-agostinho commented 3 years ago

Is there any update on this? This would be great to have.

LTLA commented 3 years ago

We should probably go with @hpages's solution.

hpages commented 3 years ago

@nuno-agostinho h5writeDimnames() and h5readDimnames() are available in the HDF5Array package. I might factor out the low-level hdf5 manipulation facilities that I implemented in HDF5Array (e.g. h5mread(), h5writeDimnames(), h5readDimnames(), etc...) to their own package at some point.