grimbough / rhdf5

Package providing an interface between HDF5 and R
http://bioconductor.org/packages/rhdf5
61 stars 22 forks source link

Support for writing raw datatype. #55

Closed rcastelo closed 4 years ago

rcastelo commented 4 years ago

hi,

i opened an issue on the package HDF5Array to ask for support for the 'raw' datatype. This has been provided with the caveat that the h5write() function from the rhdf5 package does not support the 'raw' datatype and for that reason the update to the HDF5Array package converts the data into 'integer' before calling h5write() but ideally h5write() should natively support the 'raw' datatype.

if i disable the coercion to integer described in that issue, then i get the following error with rhdf5 2.31.5:

suppressPackageStartupMessages(library(HDF5Array))
set.seed(123)
len <- sample(1:10000, size=10000, replace=TRUE)
val <- sample(0:255, size=10000, replace=TRUE)
rawrle <- Rle(as.raw(rep(val, len)))
rawrlearr <- RleArray(rawrle, dim=length(rawrle))
rawrlearrh5 <- writeHDF5Array(rawrlearr, "rawrlearr.h5", "rlearr") 
Error in H5Dwrite(h5dataset, obj, h5spaceMem = h5spaceMem, h5spaceFile = h5spaceFile) : 
  Writing 'raw' not supported.

this pull request fixes this error, having commented the integer coercion i mentioned in HDF5Array. however the resulting HDF5 file still has the same size as with the coercion, which makes me think that my update, while it fixes the error, it might be not sufficient to natively support the 'raw' datatype. i hope you can easily find whether the pull request is correct and what else should be modified to fully enable the support for the 'raw' datatype.

thanks!

robert.

grimbough commented 4 years ago

Hi @rcastelo thanks for the pull request. I'll take a look in the next few days.

Presumably you also want to be able to read back into raw? I think this is why Bernd didn't include support for raw originally, as HDF5 does't have an easy way to determine if an 8-bit integer should be treated as raw or integer when reading it in. Perhaps it would be sufficient to read all 8-bit integer data types as raw, or alternatively include some sort of meta data as an attribute when writing from R.

rcastelo commented 4 years ago

Yes, i want to be able to read back into raw but, apparently, the update of @hpages to HDF5Aarray is already doing that:

suppressPackageStartupMessages(library(HDF5Array))
set.seed(123)
len <- sample(1:10000, size=10000, replace=TRUE)
val <- sample(0:255, size=10000, replace=TRUE)
rawrle <- Rle(as.raw(rep(val, len)))
rawrlearr <- RleArray(rawrle, dim=length(rawrle))
rawrlearrh5 <- writeHDF5Array(rawrlearr, "rawrlearr.h5", "rlearr")
Rle(rawrlearrh5[1:10])
raw-Rle of length 10 with 1 run
  Lengths: 10
  Values : 11

note that i'm already getting back a raw-Rle object, but probably i'm missing something here, i'm unfortunately not familiar with the internals of HDF5Array and rhdf5.

hpages commented 4 years ago

@grimbough

Perhaps it would be sufficient to read all 8-bit integer data types as raw,

Sounds very reasonable to me. I would even suggest to do this for anything that is 8-bit:

> grep("8", h5const("H5T"), value=TRUE)
 [1] "H5T_STD_I8BE"           "H5T_STD_I8LE"           "H5T_STD_U8BE"          
 [4] "H5T_STD_U8LE"           "H5T_STD_B8BE"           "H5T_STD_B8LE"          
 [7] "H5T_NATIVE_B8"          "H5T_NATIVE_INT8"        "H5T_NATIVE_UINT8"      
[10] "H5T_NATIVE_INT_LEAST8"  "H5T_NATIVE_UINT_LEAST8" "H5T_NATIVE_INT_FAST8"  
[13] "H5T_NATIVE_UINT_FAST8" 

H5Tget_size() returns the size (in bytes) of a given H5 type so can be useful here.

or alternatively include some sort of meta data as an attribute when writing from R

I think that's what Bernd did to deal with the "logical" situation i.e. he looks at the "storage.mode" attribute, and, if it's set, then this overrides the default (which would be to read the data as "integer"):

h5write(matrix(logical(6), ncol=2), "tt.h5", "A")

h5ls("tt.h5", all=TRUE)
#  group name         ltype corder_valid corder cset       otype num_attrs
# 0     /    A H5L_TYPE_HARD        FALSE      0    0 H5I_DATASET         1
#    dclass         dtype  stype rank   dim maxdim
# 0 INTEGER H5T_STD_I32LE SIMPLE    2 3 x 2  3 x 2

h5read("tt.h5", "A")
#       [,1]  [,2]
# [1,] FALSE FALSE
# [2,] FALSE FALSE
# [3,] FALSE FALSE

(mmh... looks like an 8-bit H5 type could be used instead of a 32-bit H5 type for this.)

The same mechanism could be used for the "raw" situation, except that, by default (i.e. if the "storage.mode" attribute is not set), 8-bit stuff is read as "raw". Better use the smallest R type by default. It's easy enough for the user to change the storage.mode to "integer" (or "logical") once the data is in R if they want to.

@rcastelo : HDF5Array uses its own reading function (h5mread()) to bring H5 data into R so what happens exactly with HDF5Array is a different story. Yesterday I modified h5mread() to make it read 8-bit H5 data as "raw" instead of "integer". The change is part of the following commit.

grimbough commented 4 years ago

Thanks for the insights. Is there any specific reason you've selected H5T_STD_U8LE as the datatype? Elsewhere I've picked the NATIVE versions e.g. H5T_NATIVE_INT32 but I don't think that was necessarily an informed choice, it just sounded like the most universal option. I'm happy to be guided if there's a good reason for using something else, the HDF5 docs don't seem to offer much guidance.

hpages commented 4 years ago

My understanding is that the NATIVE types are mapped to real types. When you choose H5T_NATIVE_INT32 or H5T_NATIVE_UINT8 to create datasets, then inspecting the datasets with h5ls("myfile.h5", all=TRUE) reveals that they were actually created with H5T_STD_I32LE or H5T_STD_U8LE. Now I suspect that this mapping is not hardcoded and depends on the endianness of your system. The world today is largely dominated by little-endian machines so for 99.9% of our users H5T_NATIVE_INT32 will also be mapped to H5T_STD_I32LE. But for 0.1% of them it will be mapped to H5T_STD_I32BE.

The reason I'm favoring the use of real types over NATIVE types is to have "integer" and "raw" data mapped to the same H5 type for everybody, even for users who are on big-endian systems. Said otherwise, I like that the behavior of the writing functions is deterministic.

The small downside of this is that people on big-endian systems will pay a small performance price because the bytes will need to be flipped for them. They will actually pay this price again when they read the data in. But in a context where the main purpose of dumping data into an HDF5 file is to share with others (e.g. via ExperimentHub), I think it's preferable to have them produce something that is optimized for the rest of the world rather than for their own exotic hardware.

grimbough commented 4 years ago

Reading and writing raw vectors should now be supported in rhdf 2.31.6

h5File <- tempfile(pattern = "raw_values", fileext = ".h5")
set.seed(1234)
raw_vals <- as.raw(sample(0:255, size = 10))
h5createFile(h5File)
h5write(obj = raw_vals, file = h5File, name = "raw")

Reading this we get back a raw vector

> h5read(h5File, name = "raw")
 [1] 1b 4f f9 95 64 eb 6e 88 84 a5

And we can check the on disk datatype via:

> system2("h5dump", args = c("-H", h5File))
HDF5 "/tmp/RtmptUWvz8/raw_values2d6959559df1.h5" {
GROUP "/" {
   DATASET "raw" {
      DATATYPE  H5T_STD_U8LE
      DATASPACE  SIMPLE { ( 10 ) / ( 10 ) }
   }
}
}

Let me know if you encounter any issues.

hpages commented 4 years ago

Perfect, thanks! Just got rid of my no longer needed workarounds in HDF5Array. Also thanks for the tip about using h5dump to check the type.

rcastelo commented 4 years ago

@grimbough @hpages you're fantastic!! thank you very much for providing native support for the raw data type in rhdft and HDF5Array in just a few days. from my end, you may close the issue.