Closed jhgronau closed 4 years ago
Merging #50 into master will not change coverage. The diff coverage is
n/a
.
@@ Coverage Diff @@
## master #50 +/- ##
=======================================
Coverage 80.22% 80.22%
=======================================
Files 30 30
Lines 1512 1512
=======================================
Hits 1213 1213
Misses 299 299
Continue to review full report at Codecov.
Legend - Click here to learn more
Δ = absolute <relative> (impact)
,ø = not affected
,? = missing data
Powered by Codecov. Last update 1eda087...777ed8d. Read the comment docs.
Hi, thanks for the pull request. Do you have an example file where this is happening?
I'm just wondering if this is actually related to https://github.com/grimbough/rhdf5/blob/1eda087a96dc1e9f1748144fbaf687d062e94e57/R/h5write.R#L198 which was recent change aimed to help deal with NA
strings in R.
Hi, that looks like a similar problem that was addressed
You can test this easily by running below R code, once with rhdf5 master and once with my patch. When you compare the H5 files in HDFView, you will find that the column "type" has been saved as "String, length = 2" by rhdf5 master and as "String, length = 1" with my patch applied
example <- data.table(time = as.POSIXct(c("2019-02-01 01:00:00","2019-02-01 01:00:00","2019-02-01 02:00:00","2019-02-01 02:00:00")), type = c("A","B","A","B"), value = c(1.1,1.0,1.2,1.1))
h5fhandle <- H5Fcreate("someFile.h5")
h5ghandle <- H5Gcreate(h5fhandle,"testGroup")
h5writeDataset.data.frame(example, h5ghandle, "testDataset")
H5Gclose(h5ghandle)
H5Fflush(h5fhandle)
H5Fclose(h5fhandle)
Hi, I've finally got round to looking into this. The reason for the datatype size always being larger than the length of the string was to account for the null terminator following the guidelines in:
The size set for a string datatype should include space for the null-terminator character, otherwise it will not be stored on (or retrieved from) disk. https://portal.hdfgroup.org/display/HDF5/H5T_SET_SIZE
I think your patch could potentially lead to issues if the strings are supposed to be null terminated, but the terminator isn't actually written. However HDF5 also allows for 'Null Padded' strings, which I think will meet your needs. I've made a new pull request at #57 that will write character vectors as null padded by default, both in isolation and as columns in a compound dataset.
My understanding is that the current approach attempts to avoid strings of length zero. However, by doing so, it prevents R from creating single character columns in a compound dataset (as a single character column automatically becomes a string of length 2). This is causing a problem for the interoperability with languages which have single character data types and attempt to map an H5 compound data type to a struct containing such a data type.