grimbough / rhdf5

Package providing an interface between HDF5 and R
http://bioconductor.org/packages/rhdf5
61 stars 22 forks source link

Handling NA strings #42

Open LTLA opened 5 years ago

LTLA commented 5 years ago
library(rhdf5)
h5createFile("whee.h5")
h5write(LETTERS, "whee.h5", "stuff") # okay
h5write(c(NA_character_, LETTERS), "whee.h5", "more_stuff")
## Error in if (chunk_size > 2^32 - 1) { :
##   missing value where TRUE/FALSE needed

I don't have a good idea on how to represent a NA string. Maybe if we add a character at the end of the fixed-len array (after the null terminator), the only purpose of which is to tell us if the rest of it is NA or not? Yeah, a bit wasteful, but it's the least of all evils.

Session information ``` R version 3.6.0 Patched (2019-05-10 r76483) Platform: x86_64-apple-darwin17.7.0 (64-bit) Running under: macOS High Sierra 10.13.6 Matrix products: default BLAS: /Users/luna/Software/R/R-3-6-branch-dev/lib/libRblas.dylib LAPACK: /Users/luna/Software/R/R-3-6-branch-dev/lib/libRlapack.dylib locale: [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8 attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] rhdf5_2.29.0 loaded via a namespace (and not attached): [1] compiler_3.6.0 Rhdf5lib_1.7.4 ```
grimbough commented 5 years ago

If nothing else, it's not a helpful error message. Why's a chunk size check failing here? I'll take a look at what's happening and try to come up with a solution for NA strings.

grimbough commented 5 years ago

The error was being thrown because it was trying to find the length of the longest string, which was returning NA and then using that to determine the chunk size. That's now fixed.

It now writes an NA_character_ to file as a literal "NA" and sets an attribute if this has occurred. It should then coerce them back to NA_character_ if it's read with h5read().

library(rhdf5)
h5createFile("whee.h5")
h5write(c(NA_character_, LETTERS), "whee.h5", "more_stuff")
h5read("whee.h5", "more_stuff")
[1] NA  "A" "B" "C" "D" "E" "F" "G" "H" "I" "J" "K" "L" "M" "N" "O" "P" "Q" "R" "S" "T" "U" "V" "W" "X" "Y" "Z"

If you happen to write "NA" then that will be preserved, but this will cause an issue if you try h5write(c(NA_character_, "NA")) as they'll both be be converted. Hopefully that's not something that occurs too often and it will throw a warning if it's detected.

Let me know if I've missed an obvious reader out. This only works in h5read() for the moment.

LTLA commented 5 years ago

Good enough for the time being, but it seems a bit fragile... better hope no one's working with NAs on neuraminidase.