grimbough / rhdf5

Package providing an interface between HDF5 and R
http://bioconductor.org/packages/rhdf5
59 stars 22 forks source link

`H5Awrite` for strings has premature null termination #132

Closed LTLA closed 8 months ago

LTLA commented 8 months ago
library(rhdf5)
tmp <- tempfile(fileext=".h5")
fhandle <- H5Fcreate(tmp, "H5F_ACC_TRUNC")
ghandle <- H5Gcreate(fhandle, "whee")

tid <- H5Tcopy("H5T_C_S1")
H5Tset_strpad(tid, strpad = "NULLPAD")
H5Tset_size(tid, 5L) # size of 5 bytes

ahandle <- H5Acreate(ghandle, "name", dtype_id=tid, h5space=H5Screate("H5S_SCALAR"))
H5Awrite(ahandle, "Aaron") # string of length 5

H5Aclose(ahandle)
H5Gclose(ghandle)
H5Fclose(fhandle)

One would expect my name to fit inside the attribute, but alas:

h5readAttributes(tmp, "whee")
## $name
## [1] "Aaro"

This seems to be caused by

https://github.com/grimbough/rhdf5/blob/cb102ba93d22a703de118e4e8e65ae73c4aaff0c/src/H5A.c#L457

where the loop stops prematurely because of the j < (stsize-1) condition. The fix is probably quite simple; just make this j < stsize instead, which would cause the entire string to be written.

Incidentally, datasets do the right thing, so I don't see why attributes have this weird behavior.

library(rhdf5)
tmp <- tempfile(fileext=".h5")
fhandle <- H5Fcreate(tmp, "H5F_ACC_TRUNC")

tid <- H5Tcopy("H5T_C_S1")
H5Tset_strpad(tid, strpad = "NULLPAD")
H5Tset_size(tid, 5L) # size of 5 bytes

dhandle <- H5Dcreate(fhandle, "name", dtype_id=tid, h5space=H5Screate("H5S_SCALAR"))
H5Dwrite(dhandle, "Aaron") # string of length 5

H5Dclose(dhandle)
H5Fclose(fhandle)

h5read(tmp, "name")
## [1] "Aaron"
Session information ``` R Under development (unstable) (2023-11-10 r85507) Platform: x86_64-pc-linux-gnu Running under: Ubuntu 20.04.6 LTS Matrix products: default BLAS: /home/luna/Software/R/trunk/lib/libRblas.so LAPACK: /home/luna/Software/R/trunk/lib/libRlapack.so; LAPACK version 3.11.0 locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 [7] LC_PAPER=en_US.UTF-8 LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C time zone: America/Los_Angeles tzcode source: system (glibc) attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] rhdf5_2.47.0 loaded via a namespace (and not attached): [1] compiler_4.4.0 rhdf5filters_1.15.1 Rhdf5lib_1.25.0 ```