grimbough / rhdf5

Package providing an interface between HDF5 and R
http://bioconductor.org/packages/rhdf5
59 stars 22 forks source link

h5ls doesn't output all of the data in the H5 file #107

Closed acope3 closed 2 years ago

acope3 commented 2 years ago

Hello,

I'm a developer for the riboviz 2 software. We make use of H5 files for storing data. I'm currently experiencing some issues with h5ls.

For example, I have a dataset with 3242 genes. If I run gene_names <- rhdf5::h5ls(h5.file,recursive=1)$names, I get a list of 3225 gene names and get 17 of the following warnings: In rhdf5::h5ls(h5.file, recursive = 1) : Identical objects found. However, if I look for the the 17 genes not output by h5ls using h5read, those data appear to be in the H5 file. Any insights into what's going on? Every gene has a unique identifier. Let me know if you need more information or if you would like me to provide the H5 file I'm using.

Thank you.

grimbough commented 2 years ago

Thanks for the report. It's not something I've seen before, so it'd be great if you could share the file and I'll take a look at what's happening. I guess the code for identifying duplicate object names is overly zealous.

acope3 commented 2 years ago

MGCL2_1.h5.zip Sorry for the delay on this. Here is one of the files I was experiencing the issue with. For your reference, I was using rhdf5_2.36.0.

grimbough commented 2 years ago

Thanks for the file. So far I've only unpacked them, but it seems suspicious to me that you're getting 17 warnings when there are 17 linked H5 files. I haven't worked with files that use linking like this very often, so I expect the code is not well tested for it.

Will report back when I've done some more digging.

grimbough commented 2 years ago

This was due to the fact the h5ls() uses the address of a group inside a file to declare duplicated groups and avoid getting stuck in an infinite loop visiting the same object over and over. However if there are external links, like in this case, the address is within the target file. For your files the first group is always found at address 800 and h5ls() only included one instance. I've updated the code to also use the file number, which should prevent this happening again.

This will be available in rhdf5 2.38.1 and 2.39.6.

library(rhdf5)
packageVersion("rhdf5")
#> [1] '2.39.6'
groups <- h5ls("/tmp/MGCL2_1.h5/MGCL2_1.h5", recursive = 1)
dim(groups)
#> [1] 4242    5

Please re-open the issue if it continues to be a problem.