grimbough / rhdf5

Package providing an interface between HDF5 and R
http://bioconductor.org/packages/rhdf5
59 stars 22 forks source link

Integer being read as raw #106

Closed joaocarr closed 1 year ago

joaocarr commented 2 years ago

Hi,

new to rhdf5 but when reading with h5read the following dataset:

7 algorithm_run_flag H5I_DATASET INTEGER 62823

I'm getting the data in raw data type instead of integer. I can amend that afterwards, but I was wondering if this is an issue or just a newbie mistake.

Thanks

joaocarr commented 2 years ago

Just an update...

I'm getting an hexadecimal output when in fact I should be getting an integer.

grimbough commented 2 years ago

Can you share an example file?

I wonder if the data are 8bit integers, which I think rhdf5 will treat as raw. Anything large than that will be converted to an R integer. That's because if you write an R raw value using rhdf5 it will be saved as an 8bit integer and I try to make it so that if you write a specific R type you get the same thing back.

joaocarr commented 2 years ago

Thanks! I'm including a link to an example file. Indeed, I'm having those issues with datasets that were written as 8-bit unsigned integers. I'm using as.numeric() to convert back to integer, but I'm not sure if I'm loosing any data in the process.

https://drive.google.com/file/d/1-QNwEZQN2Fv-zPNGgkC-ttUcLl_i-a11/view?usp=sharing

joaocarr commented 2 years ago

... I believe I'm not retrieving the right values for some datasets that are UINT8. If you have the time, the dataset "BEAM0000/predictor_limit_flag" was written as UINT8 and can only be 0, 1 or 2. However, when read with h5read I'm getting a raw vector with only "ff" values - this is converted to 255 if I use as.numeric() or as.integer()

grimbough commented 2 years ago

... I believe I'm not retrieving the right values for some datasets that are UINT8. If you have the time, the dataset "BEAM0000/predictor_limit_flag" was written as UINT8 and can only be 0, 1 or 2. However, when read with h5read I'm getting a raw vector with only "ff" values - this is converted to 255 if I use as.numeric() or as.integer()

To me it looks like the values in BEAM0000/predictor_limit_flag really are all 255 if I use the h5dump command line tool (so ignoring rhdf5 entirely). Here's a truncated output from that command.

-> % h5dump -d BEAM0000/predictor_limit_flag GEDI04_A_2020310172137_O10764_01_T08883_02_002_01_V002.h5 
HDF5 "GEDI04_A_2020310172137_O10764_01_T08883_02_002_01_V002.h5" {
DATASET "BEAM0000/predictor_limit_flag" {
   DATATYPE  H5T_STD_U8LE
   DATASPACE  SIMPLE { ( 20316 ) / ( H5S_UNLIMITED ) }
   DATA {
   (0): 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255,
...
joaocarr commented 1 year ago

Really sorry about this very late reply. You are absolutely right, somehow the data producers are not correctly encoding that dataset. I also get 255 all over the HDF5 files I have. My solution to UINT8 being read as raw was to convert these datasets to numeric (as.numeric) after being read using h5read.