Closed LTLA closed 4 years ago
Eurgh, this sounds horrible! I'll put together something equivalent in C to try and identify whether it's rhdf5 or not. I can't think what it would be doing that might result in different structures on disk. Maybe the compression is non-deterministic?
This definitely seems related to the time the file is created. Running your example I get 15 hashes, and they're arranged linearly in collected
e.g.
> length(unique(collected))
[1] 15
> rle(collected)
Run Length Encoding
lengths: int [1:15] 47 52 67 72 68 72 75 72 73 74 ...
values : chr [1:15] "76c952c9a2fef57425b0e96de6dd8e14" "ddb7b3ccec3b850f629f9d8db5322101" "14fdcdd670da5da1a2b573e97cd90695" ...
I'll take a look at where it might be adding timestamps.
I'm still going to look into this, but would it be sufficient to do this?
digest::digest(h5ls('a.h5'))
That gives me consistent results:
> table(collected)
collected
96936dcd6632a25f1d39a581c5e02683
1000
Definitely time stamp related. Here's a version where I turn of recording the last time the dataset was touched:
h5write2 <- function() {
h5File <- "a.h5"
if(file.exists(h5File)) file.remove(h5File)
fid <- H5Fcreate(name = h5File)
dcpl <- H5Pcreate(type = 'H5P_DATASET_CREATE')
rhdf5:::H5Pset_obj_track_times(dcpl, FALSE)
sid <- H5Screate_simple(dims = 10)
did <- H5Dcreate(fid, name = "X", dtype_id = "H5T_NATIVE_DOUBLE", h5space = sid, dcpl = dcpl)
H5Dwrite(did, X)
H5Dclose(did)
H5Sclose(sid)
H5Pclose(dcpl)
H5Fclose(fid)
}
set.seed(100)
X <- runif(20)
collected <- numeric(1000)
for (i in 1:1000) {
h5write2()
collected[i] <- digest::digest(file="a.h5")
unlink("a.h5")
}
> table(collected)
collected
a6690eaca69a9e9c36d6dbc7adcd972f
1000
I'm not sure exactly what to do with the knowledge, but at least this isolates the cause.
Can't we just always turn off the time stamp? Or at least provide a high-level option to do so, rather than requiring users to reach deep inside rhdf5:::H5Pset_obj_track_times
.
I wanted to think about whether changing the default was sensible vs adding arguments. I prefer to keep list of arguments short if I can, but i'm also in favour of staying close to default HDF5 settings.
Given that rhdf5 already makes quite a few choices for the user if they're employing h5write()
it doesn't seem too bad to turn off the timestamps. In fact, there's actually no way of accessing them via rhdf5 at the moment, so there's a reasonable chance they aren't used by anyone ever.
So I've made it do this automatically in version 2.31.8 and you should now get:
set.seed(100)
X <- runif(20)
collected <- numeric(1000)
for (i in 1:1000) {
h5write(file="a.h5", name="X", X)
collected[i] <- digest::digest(file="a.h5")
unlink("a.h5")
}
table(collected)
collected
234ab25049fd0aeea3cc4027998d6770
1000
This is twilight zone stuff:
Session information
``` R version 3.6.1 Patched (2019-10-31 r77366) Platform: x86_64-apple-darwin17.7.0 (64-bit) Running under: macOS High Sierra 10.13.6 Matrix products: default BLAS: /Users/luna/Software/R/R-3-6-branch/lib/libRblas.dylib LAPACK: /Users/luna/Software/R/R-3-6-branch/lib/libRlapack.dylib locale: [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8 attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] rhdf5_2.30.1 loaded via a namespace (and not attached): [1] compiler_3.6.1 digest_0.6.23 Rhdf5lib_1.8.0 ```Happens on my linux cluster as well. The question is: is rhdf5 doing this? Or is it a problem with the underlying HDF5 library? The actual content of the file seems to be the same across runs, but this fluctuation defeats the purpose of having these checksums.
Perhaps h5py/h5py#225 could be helpful.