grimbough / rhdf5

Package providing an interface between HDF5 and R
http://bioconductor.org/packages/rhdf5
61 stars 22 forks source link

`h5write` truncates UTF-8 strings #111

Closed LTLA closed 2 years ago

LTLA commented 2 years ago
library(rhdf5)
h5createFile("ex_hdf5file.h5")
h5write("α ≤ 0.1", "ex_hdf5file.h5", "WHEE")
h5read("ex_hdf5file.h5", "WHEE")
## [1] "α ≤ "

This is because nchar should have been called with type="byte" to get the actual byte length, not character length.

While we're at it, if encoding = NULL, you can choose "UTF-8" based on the value of Encoding(obj). This would be more faithful to the in-memory representation of the string in R.

grimbough commented 2 years ago

This is hopefully fixed in version 2.41.1

packageVersion("rhdf5")
#> [1] ‘2.41.1’
library(rhdf5)
h5createFile("ex_hdf5file.h5")
h5write("α ≤ 0.1", "ex_hdf5file.h5", "WHEE")
output <- h5read("ex_hdf5file.h5", "WHEE")
output
#> [1] "α ≤ 0.1"

It also sets the encoding to UTF-8 in R, if that's what the HDF5 dataset uses.

Encoding(output)
#> [1] "UTF-8"