grimbough / rhdf5

Package providing an interface between HDF5 and R
http://bioconductor.org/packages/rhdf5
61 stars 22 forks source link

How to write values greater than 2^31 to a dataset of type INTEGER #21

Closed hpages closed 6 years ago

hpages commented 6 years ago

I know that HDF5 datasets of type INTEGER can store values greater than 2^31 and that I need to call rhdf5::h5read() with bit64conversion="double" to read those values into R.

However I don't know how to write them. rhdf5::h5write() silently truncates them to 2^31 - 1. For example:

library(rhdf5)
h5createFile("test.h5")
h5createDataset("test.h5", "A", dims=50, storage.mode="integer")
h5ls("test.h5")
#   group name       otype  dclass dim
# 0     /    A H5I_DATASET INTEGER  50
h5write(2e9, "test.h5", "A", start=1, count=1)
h5read("test.h5", "A", start=1, count=1)
# [1] 2000000000
h5write(3e9, "test.h5", "A", start=2, count=1)
h5read("test.h5", "A", start=1, count=2)
# [1] 2000000000 2147483647
h5read("test.h5", "A", start=1, count=2, bit64conversion="double")
# [1] 2000000000 2147483647

Any help would be appreciated. Thanks!

grimbough commented 6 years ago

I'm sure it's possible in the current form. I'll take a look at how to get this working once I've integrated HDF 1.10 with the package

hpages commented 6 years ago

Thanks Mike!

grimbough commented 6 years ago

I've added a new option to the storage.mode argument integer64. This allows you to create a dataset of type signed 64-bit integer and you can write your values >= 2^31 to this.

library(rhdf5)
h5createFile("test.h5")

h5createDataset("test.h5", "B", dims=2, storage.mode="integer64")

h5write(2e9, "test.h5", "B", start=1, count=1)
h5write(3e9, "test.h5", "B", start=2, count=1)

h5read("test.h5", "B", start=1, count=2)
# [1] 2000000000         NA
# warnings....
h5read("test.h5", "B", start=1, count=2, bit64conversion="double")
# [1] 2e+09 3e+09
h5read("test.h5", "B", start=1, count=2, bit64conversion="bit64")
# integer64
# [1] 2000000000 3000000000

The existing storage.mode="integer" only created a dataset with type signed 32-bit int. I think the possible values here were designed to correspond to the values returned by the R storage.mode() function, and this breaks that convention, but it feels like an acceptable change.

Alternatively, it is also possible to used other datatypes by utilising the H5type argument e.g for unsigned 32-bit integers

h5createDataset(h5File, "uint32", dims=50, H5type = "H5T_NATIVE_UINT32")
grimbough commented 6 years ago

I should also add that I'm not entirely sure why this works at all. We're passing R doubles and they are treated as H5T_NATIVE_DOUBLE in the internal H5Dwrite code, but we're writing to a dataset with type H5T_NATIVE_INT32 or H5T_NATIVE_INT64 and relying on HDF5 to do the conversion.

However the HDF5 User's Guide states:

he datatype of the source and destination must have the same datatype class ... integers can be converted to other integers, and floats to other floats, but integers cannot (yet) be converted to floats.

This might explain why there's no warning when values are going to be truncated since it's not clear to me this is actually supported.

hpages commented 6 years ago

Hi Mike,

Thanks for adding support for h5createDataset(..., storage.mode="integer64"). This sounds like a useful addition. Would be nice if one could pass an integer64 vector to h5write() so for example one could write as.integer64(1e18) + 1L to the dataset created with h5createDataset(..., storage.mode="integer64"). Right now, it seems impossible to write this value because of the loss of precision due to the usage of a double:

library(rhdf5)
h5createFile("test.h5")
h5createDataset("test.h5", "B", dims=2, storage.mode="integer64")
h5write(1e18 + 1L, "test.h5", "B", start=1, count=1)
h5read("test.h5", "B", start=1, count=1, bit64conversion="bit64")
# integer64
# [1] 1000000000000000000

Anyway, that's a different issue.

As far as my original issue is concerned, I realize now that all I needed to do was to use h5createDataset(..., H5type="H5T_NATIVE_UINT32"). Thanks for mentioning that!

Let me explain: when I wrote my original post, I didn't realilze that h5ls() would display INTEGER for many different possible integer representations. So when I saw that h5ls() displayed INTEGER for the dataset I was trying to write to, I naively assumed that this dataset should be able to represent 3e9 (because I've seen such big values stored in other datasets of type INTEGER). However I realize now that the dataset I was trying to write to was of type H5T_STD_I32LE (signed 32-bit integer). I can see this with h5ls(..., all=TRUE), which shows me the dtype column in addition to the dclass column. This is why writing 3e9 to it would truncate the value to 2^31-1. But if I create the dataset with h5createDataset(..., H5type="H5T_NATIVE_UINT32") like you suggested, now I can write 3e9 to it. My integer values are >= 0 and are not going to be bigger than 2^32 for now so using unsigned 32-bit integers is all I need. Problem solved for me!

Thanks again, H.