saalfeldlab / n5-hdf5

Best effort N5 implementation on HDF5 files (despite the irony).
BSD 2-Clause "Simplified" License
6 stars 10 forks source link

Datasets are created with scaleoffset filter #4

Open hanslovsky opened 5 years ago

hanslovsky commented 5 years ago

Create a dataset like this (kotlin):

import net.imglib2.img.array.ArrayImgs
import org.janelia.saalfeldlab.n5.GzipCompression
import org.janelia.saalfeldlab.n5.hdf5.N5HDF5Writer
import org.janelia.saalfeldlab.n5.imglib2.N5Utils
import kotlin.random.Random

fun main(args: Array<String>) {
    val filename = "/home/hanslovskyp/local/tmp/some-file.h5"
    val rai = ArrayImgs.unsignedLongs(10, 20, 30)
    val rng = Random(100L)
    rai.forEach { it.set(rng.nextLong()) }
    N5Utils.save(rai, N5HDF5Writer(filename), "dataset", intArrayOf(3,4,7), GzipCompression())
}

Look at dataset info using h5dump:

$ h5dump -H -p -d "dataset" /home/hanslovskyp/local/tmp/some-file.h5
HDF5 "/home/hanslovskyp/local/tmp/some-file.h5" {
DATASET "dataset" {
   DATATYPE  H5T_STD_U64LE
   DATASPACE  SIMPLE { ( 30, 20, 10 ) / ( H5S_UNLIMITED, H5S_UNLIMITED, H5S_UNLIMITED ) }
   STORAGE_LAYOUT {
      CHUNKED ( 7, 4, 3 )
      SIZE 52306 (0.918:1 COMPRESSION)
   }
   FILTERS {
      COMPRESSION SCALEOFFSET { MIN BITS 2 }
      COMPRESSION DEFLATE { LEVEL 6 }
   }
   FILLVALUE {
      FILL_TIME H5D_FILL_TIME_ALLOC
      VALUE  H5D_FILL_VALUE_DEFAULT
   }
   ALLOCATION_TIME {
      H5D_ALLOC_TIME_INCR
   }
}
}

Scale-offset is lossy compression and the hdf5 library seems to have a memory leak when the parameter is set (see also https://github.com/h5py/h5py/issues/984)

hanslovsky commented 5 years ago

Could be an bug upstream, either in the Java bindings or the hdf library itself.

axtimwalde commented 5 years ago

Citing http://svnsis.ethz.ch/doc/hdf5/current/ch/systemsx/cisd/hdf5/HDF5IntStorageFeatures.html#INT_AUTO_SCALING

Note that this compression is lossless if scalineFactor >= ceil(log2(max(values) - min(values) + 1). This in made sure when using INT_AUTO_SCALING, thus INT_AUTO_SCALING is always losless.

Nevertheless, I changed the code to use INT_AUTO_SCALING_UNSIGNED for unsigned types which sounds better although I do not understand the difference https://github.com/saalfeldlab/n5-hdf5/commit/a7b3735da5bda559e33982e8e2c3309b44a1250f .