Closed FelipeMoser closed 3 months ago
Thanks for mentioning this, @FelipeMoser.
When you compare file sizes, are you comparing the actual number of bytes in each chunk file (e.g. test.zarr/0/0/0/0/0/0/0
), or just the overall size in of the whole output directory? The difference between level=1
(or clevel=1
for blosc) and level=9
(or clevel=9
) can be pretty subtle, and is highly data-dependent, so may not be visible unless you're comparing the size in bytes of each file (and/or checksum).
For example, with bioformats2raw 0.9.3 and an artificial 512x512 image, convert the same input data with 4 different compression settings:
$ bin/bioformats2raw test.fake test-level1-zlib.zarr --compression zlib --compression-properties "level=1"
$ bin/bioformats2raw test.fake test-level9-zlib.zarr --compression zlib --compression-properties "level=9"
$ bin/bioformats2raw test.fake test-clevel1-blosc.zarr --compression-properties "clevel=1"
$ bin/bioformats2raw test.fake test-clevel9-blosc.zarr --compression-properties "clevel=9"
List the size in bytes of every file in both sets of zlib output, followed by the summary size of both outputs:
$ find test-*zlib.zarr -type f -exec ls -lgG '{}' \;
-rw-rw-r-- 1 33 Aug 6 13:11 test-level1-zlib.zarr/.zattrs
-rw-rw-r-- 1 591 Aug 6 13:11 test-level1-zlib.zarr/OME/METADATA.ome.xml
-rw-rw-r-- 1 24 Aug 6 13:11 test-level1-zlib.zarr/OME/.zattrs
-rw-rw-r-- 1 23 Aug 6 13:11 test-level1-zlib.zarr/OME/.zgroup
-rw-rw-r-- 1 23 Aug 6 13:11 test-level1-zlib.zarr/.zgroup
-rw-rw-r-- 1 1210 Aug 6 13:11 test-level1-zlib.zarr/0/.zattrs
-rw-rw-r-- 1 23 Aug 6 13:11 test-level1-zlib.zarr/0/.zgroup
-rw-rw-r-- 1 260 Aug 6 13:11 test-level1-zlib.zarr/0/0/.zarray
-rw-rw-r-- 1 2461 Aug 6 13:11 test-level1-zlib.zarr/0/0/0/0/0/0/0
-rw-rw-r-- 1 260 Aug 6 13:11 test-level1-zlib.zarr/0/1/.zarray
-rw-rw-r-- 1 838 Aug 6 13:11 test-level1-zlib.zarr/0/1/0/0/0/0/0
-rw-rw-r-- 1 33 Aug 6 13:12 test-level9-zlib.zarr/.zattrs
-rw-rw-r-- 1 591 Aug 6 13:12 test-level9-zlib.zarr/OME/METADATA.ome.xml
-rw-rw-r-- 1 24 Aug 6 13:12 test-level9-zlib.zarr/OME/.zattrs
-rw-rw-r-- 1 23 Aug 6 13:12 test-level9-zlib.zarr/OME/.zgroup
-rw-rw-r-- 1 23 Aug 6 13:12 test-level9-zlib.zarr/.zgroup
-rw-rw-r-- 1 1210 Aug 6 13:12 test-level9-zlib.zarr/0/.zattrs
-rw-rw-r-- 1 23 Aug 6 13:12 test-level9-zlib.zarr/0/.zgroup
-rw-rw-r-- 1 260 Aug 6 13:12 test-level9-zlib.zarr/0/0/.zarray
-rw-rw-r-- 1 1368 Aug 6 13:12 test-level9-zlib.zarr/0/0/0/0/0/0/0
-rw-rw-r-- 1 260 Aug 6 13:12 test-level9-zlib.zarr/0/1/.zarray
-rw-rw-r-- 1 436 Aug 6 13:12 test-level9-zlib.zarr/0/1/0/0/0/0/0
$ du -hs test*zlib.zarr
96K test-level1-zlib.zarr
96K test-level9-zlib.zarr
List the size in bytes of every file in both sets of blosc output, followed by the summary size of both outputs:
$ find test-*blosc.zarr -type f -exec ls -lgG '{}' \;
-rw-rw-r-- 1 33 Aug 6 13:18 test-clevel1-blosc.zarr/.zattrs
-rw-rw-r-- 1 591 Aug 6 13:18 test-clevel1-blosc.zarr/OME/METADATA.ome.xml
-rw-rw-r-- 1 24 Aug 6 13:18 test-clevel1-blosc.zarr/OME/.zattrs
-rw-rw-r-- 1 23 Aug 6 13:18 test-clevel1-blosc.zarr/OME/.zgroup
-rw-rw-r-- 1 23 Aug 6 13:18 test-clevel1-blosc.zarr/.zgroup
-rw-rw-r-- 1 1210 Aug 6 13:18 test-clevel1-blosc.zarr/0/.zattrs
-rw-rw-r-- 1 23 Aug 6 13:18 test-clevel1-blosc.zarr/0/.zgroup
-rw-rw-r-- 1 323 Aug 6 13:18 test-clevel1-blosc.zarr/0/0/.zarray
-rw-rw-r-- 1 2364 Aug 6 13:18 test-clevel1-blosc.zarr/0/0/0/0/0/0/0
-rw-rw-r-- 1 323 Aug 6 13:18 test-clevel1-blosc.zarr/0/1/.zarray
-rw-rw-r-- 1 430 Aug 6 13:18 test-clevel1-blosc.zarr/0/1/0/0/0/0/0
-rw-rw-r-- 1 33 Aug 6 13:19 test-clevel9-blosc.zarr/.zattrs
-rw-rw-r-- 1 591 Aug 6 13:19 test-clevel9-blosc.zarr/OME/METADATA.ome.xml
-rw-rw-r-- 1 24 Aug 6 13:19 test-clevel9-blosc.zarr/OME/.zattrs
-rw-rw-r-- 1 23 Aug 6 13:19 test-clevel9-blosc.zarr/OME/.zgroup
-rw-rw-r-- 1 23 Aug 6 13:19 test-clevel9-blosc.zarr/.zgroup
-rw-rw-r-- 1 1210 Aug 6 13:19 test-clevel9-blosc.zarr/0/.zattrs
-rw-rw-r-- 1 23 Aug 6 13:19 test-clevel9-blosc.zarr/0/.zgroup
-rw-rw-r-- 1 323 Aug 6 13:19 test-clevel9-blosc.zarr/0/0/.zarray
-rw-rw-r-- 1 2150 Aug 6 13:19 test-clevel9-blosc.zarr/0/0/0/0/0/0/0
-rw-rw-r-- 1 323 Aug 6 13:19 test-clevel9-blosc.zarr/0/1/.zarray
-rw-rw-r-- 1 434 Aug 6 13:19 test-clevel9-blosc.zarr/0/1/0/0/0/0/0
$ du -hs test-*blosc.zarr
96K test-clevel1-blosc.zarr
96K test-clevel9-blosc.zarr
In particular, note that test-level1-zlib.zarr/0/0/0/0/0/0/0
is 1093 bytes larger than test-level9-zlib.zarr/0/0/0/0/0/0/0
, and test-clevel1-blosc.zarr/0/0/0/0/0/0/0
is 214 bytes larger than test-clevel9-blosc.zarr/0/0/0/0/0/0/0
.
You should be able to run that same test to verify, as test.fake
is not a file on disk, but a shorthand for generating gradient test data (see https://bio-formats.readthedocs.io/en/stable/developers/generating-test-images.html).
Note too that bioformats2raw does not itself define what level
and clevel
mean - these values are simply passed through to the underlying zlib and blosc compressors.
Thanks for the quick reply @melissalinkert I followed your advice and checked the actual byte sizes and you are correct, there is a difference in the file sizes between level 1 and 9. So --compression-properties is definitely having an effect.
However, in the example you showed, while the overall size of the .zarr files is roughly the same, this is mainly due to the image being so small relative to the rest of the files. But there is a substantial difference in in the block sizes (1368÷2461=0.56, 2150/2634=0.91).
In my case, I'm working with images that are 5-10GB so I expected the difference to large since the meta files should have no significant impact on the final size. For example, for a microscopy image of size [ 1, 3, 1, 19600, 25708 ]:
1) Chunk size: [ 1, 1, 1, 512, 512 ]
$ du -d 1 ./Images_zarr
3079164 ./Images_zarr/zlib_lvl0.zarr
2023664 ./Images_zarr/zlib_lvl1.zarr
2011992 ./Images_zarr/zlib_lvl9.zarr
3079168 ./Images_zarr/blosc_lvl0.zarr
2690236 ./Images_zarr/blosc_lvl1.zarr
2679988 ./Images_zarr/blosc_lvl9.zarr
3055300 ./Images_zarr/null.zarr
2) Chunk size: [ 1, 1, 1, 1024, 1024 ]
$ du -d 1 ./Images_zarr
3201212 ./Images_zarr/zlib_lvl0.zarr
2031184 ./Images_zarr/zlib_lvl1.zarr
2021736 ./Images_zarr/zlib_lvl9.zarr
3201212 ./Images_zarr/blosc_lvl0.zarr
2677420 ./Images_zarr/blosc_lvl1.zarr
2668396 ./Images_zarr/blosc_lvl9.zarr
3194972 ./Images_zarr/null.zarr
3) Chunk size: [ 1, 1, 1, 5120, 5120 ]
$ du -d 1 ./Images_zarr
3687024 ./Images_zarr/zlib_lvl0.zarr
2063132 ./Images_zarr/zlib_lvl1.zarr
2061668 ./Images_zarr/zlib_lvl9.zarr
3686736 ./Images_zarr/blosc_lvl0.zarr
2676840 ./Images_zarr/blosc_lvl1.zarr
2670548 ./Images_zarr/blosc_lvl9.zarr
3686448 ./Images_zarr/null.zarr
As you can see, while there is a significant difference between lvl0 and lvl1, the differences between lvl1 and lvl9 are very small. For example, the relative difference between zlib_lvl1 and zlib_lvl9 is 0.58%, 0.47%, and 0.07% for chunk sizes 512, 1025, and 5120, respectively.
Is it normal for the difference between the minimal and maximal compression levels to be so small for such large images? Is there something I could be missing?
It's unfortunately pretty much impossible to make any general statement about the size of compressed data for different zlib and blosc levels. The concept of a level in these compression types is not an indicator of how small the compressed output will be, it's an indicator of how much effort the compressor tries to put into reducing the output size. As such, it's completely data dependent, and the actual percentage reduction in size will vary widely.
In particular, zlib level and blosc clevel should not be thought of in terms of image quality or compression ratio.
You might try converting your test data without compression, and then independent of bioformats2raw experiment with different compression options on the uncompressed chunk files. That would allow you to confirm that bioformats2raw is not directly causing poor compression, and would allow you to experiment with a wider variety of parameters more quickly; the chosen parameters can then be fed back to bioformats2raw for subsequent conversions. https://github.com/Blosc/bloscpack and/or https://github.com/madler/zlib may be places to start, and the following may also be helpful reading:
Since we've confirmed that --compression-properties
is working as intended with level/clevel, and bioformats2raw does not itself implement zlib or blosc, I am closing this issue for now. Feel free to add the results of any investigation though, or re-open if you find that properties other than level/clevel can't be specified correctly with --compression-properties
.
Hi, I've been playing around with bioformats2raw for a bit and I wanted to compare the different compression options available. I have some ome.tiffs that I'm converting to zarr.
However, I've noticed that the "--compression-properties" argument seems to have no effect on the compression itself outside of the values stored in the .zarray file.
For example, if I set --compression=zlib and --compression-properties="level=1", the file size is exactly the same I get if I set --compression-properties="level=9". Similarly, using the default compression and using --compression-properties="clevel=1" or --compression-properties="clevel=9" results in the exact same file size. There's also no difference in computation time. In both cases, however, the .zarr file does change accordingly.
I'm using bioformats2raw version 0.9.1
[edit] Adding to clarify, if I use zlib, the resulting file size is different than what I get if I use default (blosc). So the --compression argument does seem to work. The issue is only related to "--compression-properties".