grimbough / rhdf5

Package providing an interface between HDF5 and R
http://bioconductor.org/packages/rhdf5
60 stars 21 forks source link

rhdf5-NA.OK #84

Open WanxiangHuang opened 3 years ago

WanxiangHuang commented 3 years ago

Hi,I met a problem that when I use h5writeDataset() , there's always an attribute called rhdf5-NA.OK coming out. It's like what I showed in the picture. Can someone help me with this problem? Thanks a lot. image

grimbough commented 3 years ago

Does this cause an actual problem, or would you just rather it wasn't there?

The attribute is added so that rhdf5 knows the dataset originated in R. Then, when reading the dataset back to R, if it encounters the value corresponding to -2^31 it knows that value was originally an NA before the data were written to the file. If the attribute is missing rhdf5 doesn't know the source of the dataset and will warn the user that R is unable to represent -2^31$ and that value has been converted toNA`.

My feeling was that adding this attribute was generally harmless. I'd assume most other software would ignore it, and adding it doesn't require much time or space. Please tell me if that isn't the case.

If you are sure that your dataset doesn't contain an NA values it would be harmless to remove the attribute. However this can take a long time to check for really large datasets, which is why I prefer to include the attribute on every dataset that is written.


If you want to get rid of the attribute, you can use the function h5deleteAttribute() which is available in the version 2.35.5. The example demonstrates how to use it, as well as the warning message that will be printed if the attribute isn't there and the dataset contains an NA.

library(rhdf5)

## create an HDF5 file with two datasets,
## dataset A contains an NA value
h5File <- tempfile(fileext = ".h5")
h5createFile(h5File)
h5write(c(1:10, NA), h5File, "A")
h5write(c(1:10), h5File, "B")

## both datasets are read silently
h5read(h5File, name = "A")
#>  [1]  1  2  3  4  5  6  7  8  9 10 NA
h5read(h5File, name = "B")
#>  [1]  1  2  3  4  5  6  7  8  9 10

## remove the rhdf5-NA.OK attribute
h5deleteAttribute(h5File, name = "A", attribute = "rhdf5-NA.OK")

## now a warning is presented for the dataset including NA
## no warning for dataset B where there is no NA
h5read(h5File, name = "A")
# The value -2^31 was detected in the dataset.
# This has been converted to NA within R.
#>  [1]  1  2  3  4  5  6  7  8  9 10 NA
h5read(h5File, name = "B")
#>  [1]  1  2  3  4  5  6  7  8  9 10

The h5deleteAttribute() function is very new. If you are using an older version of rhdf5 you can delete the attribute with something like this:

## list attributes for dataset "B", just to confirm what is there
h5readAttributes(file = h5File, name = "B")
#> $`rhdf5-NA.OK`
#> [1] 1

## remove the attribute the long way
fid <- H5Fopen(h5File)
did <- H5Dopen(fid, name = "B")
H5Adelete(did, "rhdf5-NA.OK")
#> [1] 0
H5Dclose(did)
H5Fclose(fid)

## check again.  Now the NA.OK attribute is gone.
h5readAttributes(file = h5File, name = "B")
#> list()
mikej888 commented 3 years ago

I am a developer on https://github.com/riboviz/riboviz and this new attribute caused problems for our team also. We use h5diff to compare H5 files for equality within our regression tests and the presence of this attribute causes h5diff comparisons to fail if HDF5 files produced in an environment with rhdf 2.34.0 or above are compared to those produced in an environment with rhd5 prior to 2.34.0 (even if all other attributes and data within the HDF5 files are identical).

An enhancement to rhdf5 would be to support an optional parameter to allow R code that uses rhdf5 to specify that this rhdf5-NA.OK attribute not be added to the HD5 file. Though I do appreciate that longer term this will not be an issue as, in time, our users' will all end up using rhdf 2.34.0 plus anyway.

grimbough commented 3 years ago

Thanks @mikej888 for the feedback. I hadn't appreciated that the change would have this effect.

I wonder if a equivalent h5diff() function in rhdf5 with some more options than the standard h5diff would be useful.

mikej888 commented 3 years ago

@grimbough I think that an "equivalent h5diff() function in rhdf5 with some more options than the standard h5diff" would be very useful. We currently call h5diff via operating system calls so to be able to do such comparisons via in-code function calls would be great!