grimbough / rhdf5

Package providing an interface between HDF5 and R
http://bioconductor.org/packages/rhdf5
60 stars 21 forks source link

Changing default string padding #57

Closed grimbough closed 4 years ago

grimbough commented 4 years ago

This is a response to #50. That pull request was made requesting that single character strings be represented by HDF5 string datatype with length 1, rather than length 2 as has always been the case in rhdf5. The additional length is to allow for the null terminator based on:

The size set for a string datatype should include space for the null-terminator character, otherwise it will not be stored on (or retrieved from) disk. https://portal.hdfgroup.org/display/HDF5/H5T_SET_SIZE

However HDF5 can also use 'Null Padded' strings, where the null terminator is omitted. As long as the datatype is aware that it's now null padded (via a call to H5Tset_strpad()), reading should be fine. This is the default setting used by other projects like h5py, and hopefully solves the issue in #50. I can't see any reason to keep the null terminated version as default if this solves a problem, and all existing tests for reading/writing character vectors seem to stay the same.

@hpages & @LTLA you're the biggest users of rhdf5, so I just wanted to give you a heads up about this change. If you have any obvious reasons why changing this default is a bad idea let me know, otherwise I'll merge it in a few days and we'll see if anything breaks.

codecov[bot] commented 4 years ago

Codecov Report

Merging #57 into master will increase coverage by 0.02%. The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master      #57      +/-   ##
==========================================
+ Coverage   80.53%   80.56%   +0.02%     
==========================================
  Files          30       30              
  Lines        1526     1533       +7     
==========================================
+ Hits         1229     1235       +6     
- Misses        297      298       +1
Impacted Files Coverage Δ
R/H5P.R 66.66% <100%> (+1.62%) :arrow_up:
R/h5create.R 91.17% <100%> (+0.06%) :arrow_up:
R/h5write.R 86.52% <0%> (-0.71%) :arrow_down:

Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update 4986a53...448e2d3. Read the comment docs.

hpages commented 4 years ago

Right, using H5T_STR_NULLPAD over H5T_STR_NULLTERM makes a lot of sense. I can't even think of a good reason why the HDF5 format would support the latter.

Nice improvement! Everything looks fine on the HDF5Array side (HDF5Array:::h5mread() uses mkCharLen() to turn strings into CHARSXP so doesn't care whether the strings are NULL-terminated or NULL-padded).

Thanks Mike!