Closed rcastelo closed 4 years ago
Hi @rcastelo thanks for the pull request. I'll take a look in the next few days.
Presumably you also want to be able to read back into raw? I think this is why Bernd didn't include support for raw originally, as HDF5 does't have an easy way to determine if an 8-bit integer should be treated as raw or integer when reading it in. Perhaps it would be sufficient to read all 8-bit integer data types as raw, or alternatively include some sort of meta data as an attribute when writing from R.
Yes, i want to be able to read back into raw but, apparently, the update of @hpages to HDF5Aarray
is already doing that:
suppressPackageStartupMessages(library(HDF5Array))
set.seed(123)
len <- sample(1:10000, size=10000, replace=TRUE)
val <- sample(0:255, size=10000, replace=TRUE)
rawrle <- Rle(as.raw(rep(val, len)))
rawrlearr <- RleArray(rawrle, dim=length(rawrle))
rawrlearrh5 <- writeHDF5Array(rawrlearr, "rawrlearr.h5", "rlearr")
Rle(rawrlearrh5[1:10])
raw-Rle of length 10 with 1 run
Lengths: 10
Values : 11
note that i'm already getting back a raw-Rle
object, but probably i'm missing something here, i'm unfortunately not familiar with the internals of HDF5Array
and rhdf5
.
@grimbough
Perhaps it would be sufficient to read all 8-bit integer data types as raw,
Sounds very reasonable to me. I would even suggest to do this for anything that is 8-bit:
> grep("8", h5const("H5T"), value=TRUE)
[1] "H5T_STD_I8BE" "H5T_STD_I8LE" "H5T_STD_U8BE"
[4] "H5T_STD_U8LE" "H5T_STD_B8BE" "H5T_STD_B8LE"
[7] "H5T_NATIVE_B8" "H5T_NATIVE_INT8" "H5T_NATIVE_UINT8"
[10] "H5T_NATIVE_INT_LEAST8" "H5T_NATIVE_UINT_LEAST8" "H5T_NATIVE_INT_FAST8"
[13] "H5T_NATIVE_UINT_FAST8"
H5Tget_size()
returns the size (in bytes) of a given H5 type so can be useful here.
or alternatively include some sort of meta data as an attribute when writing from R
I think that's what Bernd did to deal with the "logical"
situation i.e. he looks at the "storage.mode"
attribute, and, if it's set, then this overrides the default (which would be to read the data as "integer"
):
h5write(matrix(logical(6), ncol=2), "tt.h5", "A")
h5ls("tt.h5", all=TRUE)
# group name ltype corder_valid corder cset otype num_attrs
# 0 / A H5L_TYPE_HARD FALSE 0 0 H5I_DATASET 1
# dclass dtype stype rank dim maxdim
# 0 INTEGER H5T_STD_I32LE SIMPLE 2 3 x 2 3 x 2
h5read("tt.h5", "A")
# [,1] [,2]
# [1,] FALSE FALSE
# [2,] FALSE FALSE
# [3,] FALSE FALSE
(mmh... looks like an 8-bit H5 type could be used instead of a 32-bit H5 type for this.)
The same mechanism could be used for the "raw"
situation, except that, by default (i.e. if the "storage.mode"
attribute is not set), 8-bit stuff is read as "raw"
. Better use the smallest R type by default. It's easy enough for the user to change the storage.mode to "integer"
(or "logical"
) once the data is in R if they want to.
@rcastelo : HDF5Array uses its own reading function (h5mread()
) to bring H5 data into R so what happens exactly with HDF5Array is a different story. Yesterday I modified h5mread()
to make it read 8-bit H5 data as "raw"
instead of "integer"
. The change is part of the following commit.
Thanks for the insights. Is there any specific reason you've selected H5T_STD_U8LE
as the datatype? Elsewhere I've picked the NATIVE
versions e.g. H5T_NATIVE_INT32
but I don't think that was necessarily an informed choice, it just sounded like the most universal option. I'm happy to be guided if there's a good reason for using something else, the HDF5 docs don't seem to offer much guidance.
My understanding is that the NATIVE
types are mapped to real types. When you choose H5T_NATIVE_INT32
or H5T_NATIVE_UINT8
to create datasets, then inspecting the datasets with h5ls("myfile.h5", all=TRUE)
reveals that they were actually created with H5T_STD_I32LE
or H5T_STD_U8LE
. Now I suspect that this mapping is not hardcoded and depends on the endianness of your system. The world today is largely dominated by little-endian machines so for 99.9% of our users H5T_NATIVE_INT32
will also be mapped to H5T_STD_I32LE
. But for 0.1% of them it will be mapped to H5T_STD_I32BE
.
The reason I'm favoring the use of real types over NATIVE
types is to have "integer"
and "raw"
data mapped to the same H5 type for everybody, even for users who are on big-endian systems. Said otherwise, I like that the behavior of the writing functions is deterministic.
The small downside of this is that people on big-endian systems will pay a small performance price because the bytes will need to be flipped for them. They will actually pay this price again when they read the data in. But in a context where the main purpose of dumping data into an HDF5 file is to share with others (e.g. via ExperimentHub), I think it's preferable to have them produce something that is optimized for the rest of the world rather than for their own exotic hardware.
Reading and writing raw vectors should now be supported in rhdf 2.31.6
h5File <- tempfile(pattern = "raw_values", fileext = ".h5")
set.seed(1234)
raw_vals <- as.raw(sample(0:255, size = 10))
h5createFile(h5File)
h5write(obj = raw_vals, file = h5File, name = "raw")
Reading this we get back a raw vector
> h5read(h5File, name = "raw")
[1] 1b 4f f9 95 64 eb 6e 88 84 a5
And we can check the on disk datatype via:
> system2("h5dump", args = c("-H", h5File))
HDF5 "/tmp/RtmptUWvz8/raw_values2d6959559df1.h5" {
GROUP "/" {
DATASET "raw" {
DATATYPE H5T_STD_U8LE
DATASPACE SIMPLE { ( 10 ) / ( 10 ) }
}
}
}
Let me know if you encounter any issues.
Perfect, thanks! Just got rid of my no longer needed workarounds in HDF5Array. Also thanks for the tip about using h5dump
to check the type.
@grimbough @hpages you're fantastic!! thank you very much for providing native support for the raw data type in rhdft
and HDF5Array
in just a few days. from my end, you may close the issue.
hi,
i opened an issue on the package
HDF5Array
to ask for support for the 'raw' datatype. This has been provided with the caveat that theh5write()
function from therhdf5
package does not support the 'raw' datatype and for that reason the update to theHDF5Array
package converts the data into 'integer' before callingh5write()
but ideallyh5write()
should natively support the 'raw' datatype.if i disable the coercion to integer described in that issue, then i get the following error with
rhdf5
2.31.5:this pull request fixes this error, having commented the integer coercion i mentioned in
HDF5Array
. however the resulting HDF5 file still has the same size as with the coercion, which makes me think that my update, while it fixes the error, it might be not sufficient to natively support the 'raw' datatype. i hope you can easily find whether the pull request is correct and what else should be modified to fully enable the support for the 'raw' datatype.thanks!
robert.