grimbough / rhdf5

Package providing an interface between HDF5 and R
http://bioconductor.org/packages/rhdf5
61 stars 22 forks source link

MD5 check sums don't match up #51

Closed LTLA closed 4 years ago

LTLA commented 4 years ago

This is twilight zone stuff:

library(rhdf5)

set.seed(100)
X <- runif(20)

collected <- numeric(1000)
for (i in 1:1000) {
    h5write(file="a.h5", name="X", X)
    collected[i] <- digest::digest(file="a.h5")
    unlink("a.h5")
}

table(collected)
## collected
## 4820c2b9441af7608ef8c7cbf1868823 54bf8c102c775f3c8f902b3c7f177f7f
##                              256                              258
## a94dbc125823cb764bcbc8aedd7bd38e b905e312e56a6ca6910adff6b4b2be80
##                              248                              113
## ba9cd2f5cd1debaa5b161c81cf6ec877
##                              125
Session information ``` R version 3.6.1 Patched (2019-10-31 r77366) Platform: x86_64-apple-darwin17.7.0 (64-bit) Running under: macOS High Sierra 10.13.6 Matrix products: default BLAS: /Users/luna/Software/R/R-3-6-branch/lib/libRblas.dylib LAPACK: /Users/luna/Software/R/R-3-6-branch/lib/libRlapack.dylib locale: [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8 attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] rhdf5_2.30.1 loaded via a namespace (and not attached): [1] compiler_3.6.1 digest_0.6.23 Rhdf5lib_1.8.0 ```

Happens on my linux cluster as well. The question is: is rhdf5 doing this? Or is it a problem with the underlying HDF5 library? The actual content of the file seems to be the same across runs, but this fluctuation defeats the purpose of having these checksums.

Perhaps h5py/h5py#225 could be helpful.

grimbough commented 4 years ago

Eurgh, this sounds horrible! I'll put together something equivalent in C to try and identify whether it's rhdf5 or not. I can't think what it would be doing that might result in different structures on disk. Maybe the compression is non-deterministic?

grimbough commented 4 years ago

This definitely seems related to the time the file is created. Running your example I get 15 hashes, and they're arranged linearly in collected e.g.

> length(unique(collected)) 
[1] 15
> rle(collected) 
Run Length Encoding   
lengths: int [1:15] 47 52 67 72 68 72 75 72 73 74 ...   
values : chr [1:15] "76c952c9a2fef57425b0e96de6dd8e14" "ddb7b3ccec3b850f629f9d8db5322101" "14fdcdd670da5da1a2b573e97cd90695" ... 

  I'll take a look at where it might be adding timestamps.

grimbough commented 4 years ago

I'm still going to look into this, but would it be sufficient to do this?

digest::digest(h5ls('a.h5'))

That gives me consistent results:

> table(collected)
collected
96936dcd6632a25f1d39a581c5e02683 
                            1000 
grimbough commented 4 years ago

Definitely time stamp related. Here's a version where I turn of recording the last time the dataset was touched:

h5write2 <- function() {
    h5File <- "a.h5"
    if(file.exists(h5File)) file.remove(h5File)
    fid <- H5Fcreate(name = h5File)
    dcpl <- H5Pcreate(type = 'H5P_DATASET_CREATE')
    rhdf5:::H5Pset_obj_track_times(dcpl, FALSE)
    sid <- H5Screate_simple(dims = 10)
    did <- H5Dcreate(fid, name = "X", dtype_id = "H5T_NATIVE_DOUBLE", h5space = sid, dcpl = dcpl)
    H5Dwrite(did, X)
    H5Dclose(did)
    H5Sclose(sid)
    H5Pclose(dcpl)
    H5Fclose(fid)
}

set.seed(100)
X <- runif(20)

collected <- numeric(1000)
for (i in 1:1000) {
    h5write2()
    collected[i] <- digest::digest(file="a.h5")
    unlink("a.h5")
}
> table(collected)
collected
a6690eaca69a9e9c36d6dbc7adcd972f 
                            1000 

I'm not sure exactly what to do with the knowledge, but at least this isolates the cause.

LTLA commented 4 years ago

Can't we just always turn off the time stamp? Or at least provide a high-level option to do so, rather than requiring users to reach deep inside rhdf5:::H5Pset_obj_track_times.

grimbough commented 4 years ago

I wanted to think about whether changing the default was sensible vs adding arguments. I prefer to keep list of arguments short if I can, but i'm also in favour of staying close to default HDF5 settings.

Given that rhdf5 already makes quite a few choices for the user if they're employing h5write() it doesn't seem too bad to turn off the timestamps. In fact, there's actually no way of accessing them via rhdf5 at the moment, so there's a reasonable chance they aren't used by anyone ever.

So I've made it do this automatically in version 2.31.8 and you should now get:

set.seed(100)
X <- runif(20)

collected <- numeric(1000)
for (i in 1:1000) {
    h5write(file="a.h5", name="X", X)
    collected[i] <- digest::digest(file="a.h5")
    unlink("a.h5")
}

table(collected)
collected
234ab25049fd0aeea3cc4027998d6770 
                            1000