grimbough / rhdf5

Package providing an interface between HDF5 and R
http://bioconductor.org/packages/rhdf5
60 stars 21 forks source link

implement reading/writing HDF5 object references in attribute #96

Closed ilia-kats closed 2 years ago

ilia-kats commented 3 years ago

h5writeAttribute(object_to_reference, object, attribute_name) will write a reference to object_to_reference into attribute_name of object. Similarly, if attr is an attribute containing an object reference, H5Aread(attr) will return an H5IdComponent of the referenced object.

codecov[bot] commented 2 years ago

Codecov Report

Merging #96 (d045d45) into master (996bbfb) will decrease coverage by 0.04%. The diff coverage is 75.00%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master      #96      +/-   ##
==========================================
- Coverage   74.84%   74.80%   -0.05%     
==========================================
  Files          34       34              
  Lines        1805     1814       +9     
==========================================
+ Hits         1351     1357       +6     
- Misses        454      457       +3     
Impacted Files Coverage Δ
R/h5writeAttr.R 88.23% <71.42%> (-4.87%) :arrow_down:
R/H5A.R 73.68% <75.00%> (-0.29%) :arrow_down:
R/h5create.R 85.98% <100.00%> (+0.08%) :arrow_up:

Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update 996bbfb...d045d45. Read the comment docs.

grimbough commented 2 years ago

Thanks @ilia-kats for creating this pull request, and apologies for taking so long to get round to doing something with it.

I'm not against incorporating something like this in principle, but I wonder if adding this functionality to h5writeAttributes() is really the best place for it. My general design principle is that the h5x() functions should be relatively simple wrappers around common operations and work with standard R datatypes, without too many options and arguments for users to interpret. If someone really wants to get into the deep HDF5 details then the H5X() functions map to the C-API and should be used for the lower-level operations, where you get the full range of oesoteric stuff HDF5 allows. This feels like it falls into that later category, but I'm happy to be persuaded otherwise.

I've never used HDF5 object references before, so I'm curious what you're doing with them. Do you have any example code or schematics for the file type you're developing?

ilia-kats commented 2 years ago

Thanks for your reply. I'm working on a pure-R implementation of the AnnData format for single-cell omics data. AnnData is using object references to handle categorical (factor) columns in data frames. The HDF5 object for the column stores the integer codes along with a reference to another HDF5 object storing the labels (code). AnnData has been around for a while and there are tons of these files around, so I'm not really flexible regarding the format.

I briefly looked into a low-level wrapper around HDF5 references when I was writing this PR, and wrapping the entire API would require quite some time, which is why I chose to implement this directly in h5writeAttributes. I can try to do a partial wrapper implementing only what is required to get this particular functionality to work, unless you have a better idea?

grimbough commented 2 years ago

I've tried to make the complete H5R API from HDF5 1.10 available in the object-references branch. Thanks a lot for the starting point, was helpful to build on your code. This now supports the dataset region references too if you happen to need those at any point.

Having used the functions I can see why some wrapper functions do do the dereferencing automatically would be nice, and I'll probably add those fairly soon, but I don't have time right now. However this API should remain pretty stable if you want to work with that. I'll merge it into bioc-devel once I've written a few tests and the manual pages.

Hopefully the examples below are useful, but it looks like you know what you're doing. Let me know if anything is missing or doesn't behave as expected.

## create an example file with a group and a dataset
library(rhdf5)
file_name <- tempfile()
h5createFile(file_name)
h5createGroup(file = file_name, group = "/foo")
#> [1] TRUE
h5write(1:100, file=file_name, name="/foo/baa")

###################################################
## Writing references as an attribute #############
###################################################

## open file and create referece to /foo/baa dataset
fid <- H5Fopen(file_name)
ref_to_dataset <- H5Rcreate(fid, name = "/foo/baa")

## create an attribute to contain our object ref
sid <- H5Screate_simple( length(ref_to_dataset) )
tid <- H5Tcopy(dtype_id = "H5T_STD_REF_OBJ")
obj_ref_attr <- H5Acreate(fid, name = "object_refs", dtype_id = tid, h5space = sid)

## write our references to the attribute & close
H5Awrite(h5attribute = obj_ref_attr, buf = ref_to_dataset)
#> Object reference

## tidy up
H5Aclose(obj_ref_attr)
H5Sclose(sid)
H5Fclose(fid)

###################################################
## Reading reference & dereferencing dataset ######
###################################################

## open file and read attribute 
fid <- H5Fopen(file_name)
aid <- H5Aopen(h5obj = fid, name = 'object_refs')
references <- H5Aread(h5attribute = aid)
## this is an H5Ref object
references
#> HDF5 REFERENCE
#> Type: H5R_OBJECT 
#> Length: 1

## apply the ref to the file handle and recieve a dataset identifier
dset_from_ref <- H5Rdereference(ref = references, h5loc = fid)
H5Dread(dset_from_ref)
#>   [1]   1   2   3   4   5   6   7   8   9  10  11  12  13  14  15  16  17  18
#>  [19]  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36
#>  [37]  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51  52  53  54
#>  [55]  55  56  57  58  59  60  61  62  63  64  65  66  67  68  69  70  71  72
#>  [73]  73  74  75  76  77  78  79  80  81  82  83  84  85  86  87  88  89  90
#>  [91]  91  92  93  94  95  96  97  98  99 100

## tidy up
H5Aclose(aid)
H5Dclose(dset_from_ref)
H5Fclose(fid)