grimbough / rhdf5

Package providing an interface between HDF5 and R
http://bioconductor.org/packages/rhdf5
61 stars 22 forks source link

Logical NAs converted to FALSE with recent rhdf5 #58

Closed hpages closed 4 years ago

hpages commented 4 years ago

Hi Mike,

This was originally reported here.

In release (rhdf5 2.30.1):

library(rhdf5)
m <- matrix(c(FALSE, TRUE, NA, FALSE, TRUE, NA), nrow=2)
h5write(m, "test.h5", "M1")

h5ls("test.h5", all=TRUE)
#   group name         ltype corder_valid corder cset       otype num_attrs
# 0     /   M1 H5L_TYPE_HARD        FALSE      0    0 H5I_DATASET         1
#    dclass         dtype  stype rank   dim maxdim
# 0 INTEGER H5T_STD_I32LE SIMPLE    2 2 x 3  2 x 3

h5read("test.h5", "M1")
#       [,1]  [,2] [,3]
# [1,] FALSE    NA TRUE
# [2,]  TRUE FALSE   NA
# Warning message:
# In H5Dread(h5dataset = h5dataset, h5spaceFile = h5spaceFile, h5spaceMem = h5spaceMem,  :
#   integer value -2^63 replaced NA. See the section 'Large integer data types' in the 'rhdf5' vignette for more details.

In devel (rhdf5 2.31.6):

library(rhdf5)

m <- matrix(c(FALSE, TRUE, NA, FALSE, TRUE, NA), nrow=2)

h5write(m, "test.h5", "M1")

h5ls("test.h5", all=TRUE)
#   group name         ltype corder_valid corder cset       otype num_attrs
# 0     /   M1 H5L_TYPE_HARD        FALSE      0    0 H5I_DATASET         1
#    dclass        dtype  stype rank   dim maxdim
# 0 INTEGER H5T_STD_U8LE SIMPLE    2 2 x 3  2 x 3

h5read("test.h5", "M1")
#       [,1]  [,2]  [,3]
# [1,] FALSE FALSE  TRUE
# [2,]  TRUE FALSE FALSE

I see that in devel you're now using H5 type H5T_STD_U8LE instead of H5T_STD_I32LE to store logical data (probably related to that discussion) and it makes a lot of sense to use an 8-bit type instead of a 32-bit type here. However the fact that H5T_STD_U8LE is an unsigned type is a problem because AFAIK the HDF5 library replaces negative values in the input (the NAs) with zeroes to make them "fit" in the unsigned type. This causes NAs to be stored as FALSEs.

I believe that if H5T_STD_I8LE was used instead of H5T_STD_U8LE then the HDF5 library would replace NAs with the most negative value in the H5T_STD_I8LE range (-128 ?) so I guess that would work. h5read() would still need to make sure to map -128 values back to INT_MIN (the value R uses to represent integer and logical NAs).

Thanks, H.

hpages commented 4 years ago

Mike, any thought on this? Thanks!

grimbough commented 4 years ago

Sorry, been a bit swamped with other things coming up to release time. I'll bump this up the to-do list

grimbough commented 4 years ago

This is actually a little trickier than I expect, as the 8bit H5T_STD_I8LE gets read into a RAWSXP and even -128 gets converted to 0 at that point. I'll think about a strategy for handling this more tomorrow.

grimbough commented 4 years ago

This should be fixed in version 2.33.3:

library(rhdf5)

m <- matrix(c(FALSE, TRUE, NA, FALSE, TRUE, NA), nrow=2)
file <- tempfile(fileext = '.h5')

h5write(m, file, "M1")
h5ls(file, all=TRUE)
#>   group name         ltype corder_valid corder cset       otype num_attrs
#> 0     /   M1 H5L_TYPE_HARD        FALSE      0    0 H5I_DATASET         1
#>    dclass        dtype  stype rank   dim maxdim
#> 0 INTEGER H5T_STD_I8LE SIMPLE    2 2 x 3  2 x 3
h5read(file, "M1")
#>       [,1]  [,2] [,3]
#> [1,] FALSE    NA TRUE
#> [2,]  TRUE FALSE   NA

It no longer prints the warning about converting a number to NA. For logicals you can only end up in the -128 → NA mapping if there's also an attribute with storage.mode == logical so I feel comfortable this isn't going to catch out many users that didn't create the original file in R.

This is a bug, so I'll give it a few days to make sure nothing funky happens in other packages, and then also patch the release branch.

hpages commented 4 years ago

Yep, works as expected now.

Not sure why you have to read into a RAWSXP first. Wouldn't it be possible to alloc a LGLSXP on detection of the storage.mode == logical attribute and read directly into that?

hpages commented 4 years ago

@grimbough I was wondering if you have any plans to port the "NA handling" stuff to the RELEASE_3_11 branch. In particular, the RELEASE_3_11 branch still uses H5T_STD_U8LE to store logical data so the issue of NAs getting stored as FALSEs remains there. Thanks!

grimbough commented 4 years ago

Hi @hpages I think I've pushed the required changes to RELEASE_3_11 (rhdf5 version 2.32.2).

hpages commented 4 years ago

Great. Thanks Mike!

hpages commented 4 years ago

Yep, I can confirm that rhdf5 2.32.2 addresses the original issue. Thanks!