grimbough / rhdf5

Package providing an interface between HDF5 and R
http://bioconductor.org/packages/rhdf5
60 stars 21 forks source link

Support variable-length, UTF-8 encoded string datasets. #88

Closed LTLA closed 3 years ago

LTLA commented 3 years ago

Following in the footsteps of #80:

library(rhdf5)
unlink("ex_hdf5file.h5")
h5createFile("ex_hdf5file.h5")

# write a matrix
h5write(c("Aaron", "was", "here", "1"), "ex_hdf5file.h5","A", variableLengthString=TRUE)
h5write(c("Aaron", "was", "here", "2"), "ex_hdf5file.h5","B", variableLengthString=FALSE)
h5write(c("Aaron", "was", "here", "3"), "ex_hdf5file.h5","C", variableLengthString=TRUE, encoding="UTF8")
h5write(c("Aaron", "was", "here", "4"), "ex_hdf5file.h5","D", variableLengthString=FALSE, encoding="UTF8")

h5read("ex_hdf5file.h5", "A")
## [1] "Aaron" "was"   "here"  "1"    
h5read("ex_hdf5file.h5", "B")
## [1] "Aaron" "was"   "here"  "2"    
h5read("ex_hdf5file.h5", "C")
## [1] "Aaron" "was"   "here"  "3"    
h5read("ex_hdf5file.h5", "D")
## [1] "Aaron" "was"   "here"  "4"    

Looking at the h5dump ex_hdf5file.h5:

HDF5 "ex_hdf5file.h5" {
GROUP "/" {
   DATASET "A" {
      DATATYPE  H5T_STRING {
         STRSIZE H5T_VARIABLE;
         STRPAD H5T_STR_NULLPAD;
         CSET H5T_CSET_ASCII;
         CTYPE H5T_C_S1;
      }
      DATASPACE  SIMPLE { ( 4 ) / ( 4 ) }
      DATA {
      (0): "Aaron", "was", "here", "1"
      }
      ATTRIBUTE "rhdf5-NA.OK" {
         DATATYPE  H5T_STD_I32LE
         DATASPACE  SIMPLE { ( 1 ) / ( 1 ) }
         DATA {
         (0): 1
         }
      }
   }
   DATASET "B" {
      DATATYPE  H5T_STRING {
         STRSIZE 5;
         STRPAD H5T_STR_NULLPAD;
         CSET H5T_CSET_ASCII;
         CTYPE H5T_C_S1;
      }
      DATASPACE  SIMPLE { ( 4 ) / ( 4 ) }
      DATA {
      (0): "Aaron", "was\000\000", "here\000", "2\000\000\000\000"
      }
      ATTRIBUTE "rhdf5-NA.OK" {
         DATATYPE  H5T_STD_I32LE
         DATASPACE  SIMPLE { ( 1 ) / ( 1 ) }
         DATA {
         (0): 1
         }
      }
   }
   DATASET "C" {
      DATATYPE  H5T_STRING {
         STRSIZE H5T_VARIABLE;
         STRPAD H5T_STR_NULLPAD;
         CSET H5T_CSET_UTF8;
         CTYPE H5T_C_S1;
      }
      DATASPACE  SIMPLE { ( 4 ) / ( 4 ) }
      DATA {
      (0): "Aaron", "was", "here", "3"
      }
      ATTRIBUTE "rhdf5-NA.OK" {
         DATATYPE  H5T_STD_I32LE
         DATASPACE  SIMPLE { ( 1 ) / ( 1 ) }
         DATA {
         (0): 1
         }
      }
   }
   DATASET "D" {
      DATATYPE  H5T_STRING {
         STRSIZE 5;
         STRPAD H5T_STR_NULLPAD;
         CSET H5T_CSET_UTF8;
         CTYPE H5T_C_S1;
      }
      DATASPACE  SIMPLE { ( 4 ) / ( 4 ) }
      DATA {
      (0): "Aaron", "was\000\000", "here\000", "4\000\000\000\000"
      }
      ATTRIBUTE "rhdf5-NA.OK" {
         DATATYPE  H5T_STD_I32LE
         DATASPACE  SIMPLE { ( 1 ) / ( 1 ) }
         DATA {
         (0): 1
         }
      }
   }
}
}

... which verifies that it is, indeed, being stored as a variable length string.

Note that variable length strings do not seem to be compressed, so YMMV with respect to actual size savings.

Reoxygenation and implementation of unit tests is at your discretion.

grimbough commented 3 years ago

Thanks @LTLA I've merged this with some added examples etc. What's your use case for the variable length strings? In my limited testing the compression available with the fixed length datatype results in much smaller files for the same input compared to using the variable length version.

I can't find a definitive reference in the HDF5 docs, but I'm also no clear whether partial IO is also available for variable length string datasets. If compression isn't available, is chunking?

LTLA commented 2 years ago

What's your use case for the variable length strings?

In the end, nothing.

I had some data frames that I wanted to save into HDF5. Each row of the DF corresponded to a gene, and whoever made the DF had decided to create a column containing comma-separated identifiers for all gene sets to which that gene belonged. This meant that the width of the string was highly variable, from zero to several thousand characters.

Fixed length strings didn't handle this well. I had hoped that VL strings would be able to do better, but it that turned out to not be the case. Regardless, I had already written the code, hence the PR.

If compression isn't available, is chunking?

Compression, chunking and partial IO are "available" for VL strings... but not in the way that one might expect. From what I understand, these things are only applied to the pointers to the VL strings. Which is not very helpful, as the VL strings themselves are stored in an uncompressed space somewhere. Which is also why it didn't solve my problem.