grimbough / rhdf5

Package providing an interface between HDF5 and R
http://bioconductor.org/packages/rhdf5
61 stars 22 forks source link

Update h5writeDataFrame.c #50

Closed jhgronau closed 4 years ago

jhgronau commented 4 years ago

My understanding is that the current approach attempts to avoid strings of length zero. However, by doing so, it prevents R from creating single character columns in a compound dataset (as a single character column automatically becomes a string of length 2). This is causing a problem for the interoperability with languages which have single character data types and attempt to map an H5 compound data type to a struct containing such a data type.

codecov[bot] commented 4 years ago

Codecov Report

Merging #50 into master will not change coverage. The diff coverage is n/a.

Impacted file tree graph

@@           Coverage Diff           @@
##           master      #50   +/-   ##
=======================================
  Coverage   80.22%   80.22%           
=======================================
  Files          30       30           
  Lines        1512     1512           
=======================================
  Hits         1213     1213           
  Misses        299      299

Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update 1eda087...777ed8d. Read the comment docs.

grimbough commented 4 years ago

Hi, thanks for the pull request. Do you have an example file where this is happening?

I'm just wondering if this is actually related to https://github.com/grimbough/rhdf5/blob/1eda087a96dc1e9f1748144fbaf687d062e94e57/R/h5write.R#L198 which was recent change aimed to help deal with NA strings in R.

jhgronau commented 4 years ago

Hi, that looks like a similar problem that was addressed

You can test this easily by running below R code, once with rhdf5 master and once with my patch. When you compare the H5 files in HDFView, you will find that the column "type" has been saved as "String, length = 2" by rhdf5 master and as "String, length = 1" with my patch applied

example <- data.table(time = as.POSIXct(c("2019-02-01 01:00:00","2019-02-01 01:00:00","2019-02-01 02:00:00","2019-02-01 02:00:00")), type = c("A","B","A","B"), value = c(1.1,1.0,1.2,1.1)) h5fhandle <- H5Fcreate("someFile.h5") h5ghandle <- H5Gcreate(h5fhandle,"testGroup") h5writeDataset.data.frame(example, h5ghandle, "testDataset") H5Gclose(h5ghandle) H5Fflush(h5fhandle) H5Fclose(h5fhandle)

grimbough commented 4 years ago

Hi, I've finally got round to looking into this. The reason for the datatype size always being larger than the length of the string was to account for the null terminator following the guidelines in:

The size set for a string datatype should include space for the null-terminator character, otherwise it will not be stored on (or retrieved from) disk. https://portal.hdfgroup.org/display/HDF5/H5T_SET_SIZE

I think your patch could potentially lead to issues if the strings are supposed to be null terminated, but the terminator isn't actually written. However HDF5 also allows for 'Null Padded' strings, which I think will meet your needs. I've made a new pull request at #57 that will write character vectors as null padded by default, both in isolation and as columns in a compound dataset.