grimbough / rhdf5

Package providing an interface between HDF5 and R
http://bioconductor.org/packages/rhdf5
60 stars 21 forks source link

Warning when storing NA (integer value -2^63 replaced NA) #61

Closed jonocarroll closed 4 years ago

jonocarroll commented 4 years ago

I'd like to clarify the warning I get when storing NAinteger ...

library(rhdf5)
m <- matrix(c(0L, 1L, NA_integer_, 0L, 1L, NA_integer_), nrow=2)
h5write(m, "test.h5", "M1")
h5read("test.h5", "M1")
#      [,1] [,2] [,3]
# [1,]    0   NA    1
# [2,]    1    0   NA
# Warning message:
# In H5Dread(h5dataset = h5dataset, h5spaceFile = h5spaceFile, h5spaceMem = h5spaceMem,  :
#   integer value -2^63 replaced NA. See the section 'Large integer data types' in the 'rhdf5' vignette for more details.

(rhdf5 2.30.1)

I see #58 has this (the issue there is more severe) and another discussion in #42 but is this still an expected warning? I spent quite a while trying to figure out why I apparently had large negative ints in my input data when I really only had some NA.

Is it possible to identify when NA is being used and avoid this warning? I couldn't immediately find where this was documented (if it is). The aforementioned section does not seem to appear in this vignette https://bioconductor.org/packages/release/bioc/vignettes/rhdf5/inst/doc/rhdf5.html

grimbough commented 4 years ago

Thanks for the report. I've set aside the next couple of days to look at the outstanding rhdf5 issues, hopefully I'll get round to addressing this by the end of the week.

grimbough commented 4 years ago

This message was intended to warn someone that had created an HDF5 file outside R that any instance of the "smallest integer" had been replaced by NA in the resulting R object. It looks like for 64-bit integers there's a typo and it should read "integer value -2^63 replaced by NA" - which might have made the intention a little clearer.

The side effect to this is that any R object that contains NA and is written the HDF5 will then trigger the warning when read back, because the original NA values will be stored as "smallest int" in the file. I guess this probably happens at least as frequently as someone reading a file generated outside R.

The warning is annoying, but if you're writing and reading things contains NA with rhdf5 they should be preserved despite the message.

I propose to add an attribute to anything written with rhdf5 containing NA values and use this to ignore the warning. Then it should only show up for someone encountering the original usecase.

grimbough commented 4 years ago

As of rhdf5 v. 2.33.3 you shouldn't get this warning if the original file was created with rhdf5.

library(rhdf5)
m <- matrix(c(0L, 1L, NA_integer_, 0L, 1L, NA_integer_), nrow=2)
file <- tempfile(fileext = '.h5')
h5write(m, file, "M1")
h5read(file, "M1")
#>      [,1] [,2] [,3]
#> [1,]    0   NA    1
#> [2,]    1    0   NA

For a dataset not generated with rhdf5 the information is still printed, but downgraded to a message since there's nothing a user can do about R using those values to represent NA.

## This code removes the 'rhdf5-NA.OK' attribute to simulate data not written by rhdf5
fid <- H5Fopen(name = h5File)
did <- H5Dopen(fid, name = "M1")
H5Adelete(did, "rhdf5-NA.OK")
H5Dclose(did)
H5Fclose(fid)

h5read(file, "M1")
#> The value -2^31 was detected in the dataset.
#> This has been converted to NA within R.
#>      [,1] [,2] [,3]
#> [1,]    0   NA    1
#> [2,]    1    0   NA
jonocarroll commented 4 years ago

Confirmed resolved in 2.33.7 - thank you!!!