glencoesoftware / bioformats2raw

Bio-Formats image file format to raw format converter
GNU General Public License v2.0
77 stars 35 forks source link

Reduce total number of files when converting to OME-zarr? #116

Open davidekhub opened 3 years ago

davidekhub commented 3 years ago

I am converting SVS files (~500+MB each) to OME-zarr. My command line looks like this: ${params.cmd} --max_workers=${params.max_workers} --compression=zlib --compression-properties level=9 --resolutions 6 ${slide_file} ${slide_file}.zarr

I end up with buckets filled with OME zarr that contain thousands of files, many extremely small (2K) and the largest are 1M. This is really too small for object storage (leads to many small operations that take a long time) so I'd like my files to average around 64MB or so. But the documentation doesn't say what flags affect this so I'm curious- is it the tile_height and width?

chris-allan commented 3 years ago

Firstly, yes, --tile_width, --tile_height as well as --tile_depth control the Zarr chunk [1] size. For the two most common Zarr storage [2] implementations (file system and object storage) each chunk is a single file (or object) where the filename (or key) is the chunk's index within the array separated by the dimension separator [3]. There is a more visual description of this available here:

However, multiple chunks are not colocated in the same file. So if compression is employed, which it is in the example you gave, even if you were to set a 5792x5792x2 (width, height, bytes per pixel; ~64MiB) chunk size a chunk may compress very well, perhaps it's full of zeros or completely white, and consequently could easily be 1KiB or smaller. Chunk colocation within the same file (also sometimes referred to in the community as sharding) is being discussed [4] but I am not aware of any current Zarr implementation. TileDB [5] addresses some of these concerns with a journaled approach but that is not without its own downsides such as reconciliation. The Zarr layout is simple by design and adding complexity to the chunk format will require significant discussion and strong community backing. You can read more about the design decisions and perspectives of simple (Zarr) vs. complex layouts (TileDB for example) on this issue if you so desire:

There is also a fairly detailed discussion around the precomputed, sharded, chunk based format that Neuroglancer uses available here:

A sharded format would however not necessarily relieve the short write, high volume of small operations that you are noticing when, I assume, when writing directly to S3 as the unit of work for bioformats2raw is a chunk. Latency per write is going to be very similar and the same number of writes still need to take place regardless of whether they are happening to one sharded object or many unsharded chunks. Obviously you could approach this by buffering colocated chunks locally first and transferring the shard only when all chunks are processed. This is just one of a plethora of optimizations one might consider however each comes with substantial implementation and maintenance burden as well as potential for the deep coupling of bioformats2raw to storage subsystem architectural design.

Furthermore, I would strongly caution against going beyond 1024 chunking in the Y and X dimensions in the pursuit of better write performance and a smaller number of larger chunks. This may improve write performance but will substantially impact read performance and first byte latency for streaming viewers. Projects such as the aforementioned Neuroglancer or webKnossos go as far as to have tiny 3D chunk sizes (32^3) to combat this. The source data in your example (the .svs file) will also be chunked (tiled in TIFF parlance) and compressed. Selection of output chunk sizes that are not aligned can result in substantial read slowdowns as the source data has to be rechunked and repeatedly decompressed in order to conform to a desired output chunk size.

In short, the behavior you are seeing is expected and I don't think a 64MiB object size is either practical or reasonably achievable at present.

Hope this helps.

  1. https://zarr.readthedocs.io/en/stable/spec/v2.html#chunks
  2. https://zarr.readthedocs.io/en/stable/spec/v2.html#storage
  3. https://zarr.readthedocs.io/en/stable/spec/v2.html#arrays
  4. https://forum.image.sc/t/sharding-support-in-ome-zarr/55409
  5. https://docs.tiledb.com/main/solutions/tiledb-embedded/internal-mechanics/architecture
davidekhub commented 3 years ago

OK. Thanks for sharing a detailed perspective. they are very different from my understanding:

Furthermore, I would strongly caution against going beyond 1024 chunking in the Y and X dimensions in the pursuit of better write performance and a smaller number of larger chunks. This may improve write performance but will substantially impact read performance and first byte latency for streaming viewers.

The other practical problems I'm seeing are: 1) I get a 10X file size increase from an SVS to an OME Zarr using zlib=9 (and blosc seems worse?) 2) Working with 10+K objects means even deleting a single dataset from S3 takes minutes instead of seconds

Is the size expansion roughly in line with what you see? Are other users complaining about having to deal with 10+K objects for a single image dataset?

On Thu, Sep 16, 2021 at 7:11 AM Chris Allan @.***> wrote:

Firstly, yes, --tile_width, --tile_height as well as --tile_depth control the Zarr chunk [1] size. For the two most common Zarr storage [2] implementations (file system and object storage) each chunk is a single file (or object) where the filename (or key) is the chunk's index within the array separated by the dimension separator [3]. There is a more visual description of this available here:

However, multiple chunks are not colocated in the same file. So if compression is employed, which it is in the example you gave, even if you were to set a 5792x5792x2 (width, height, bytes per pixel; ~64MiB) chunk size a chunk may compress very well, perhaps it's full of zeros or completely white, and consequently could easily be 1KiB or smaller. Chunk colocation within the same file (also sometimes referred to in the community as sharding) is being discussed [4] but I am not aware of any current Zarr implementation. TileDB [5] addresses some of these concerns with a journaled approach but that is not without its own downsides such as reconciliation. The Zarr layout is simple by design and adding complexity to the chunk format will require significant discussion and strong community backing. You can read more about the design decisions and perspectives of simple (Zarr) vs. complex layouts (TileDB for example) on this issue if you so desire:

There is also a fairly detailed discussion around the precomputed, sharded, chunk based format that Neuroglancer uses available here:

- https://github.com/google/neuroglancer/blob/master/src/neuroglancer/datasource/precomputed/sharded.md

A sharded format would however not necessarily relieve the short write, high volume of small operations that you are noticing when, I assume, when writing directly to S3 as the unit of work for bioformats2raw is a chunk. Latency per write is going to be very similar and the same number of writes still need to take place regardless of whether they are happening to one sharded object or many unsharded chunks. Obviously you could approach this by buffering colocated chunks locally first and transferring the shard only when all chunks are processed. This is just one of a plethora of optimizations one might consider however each comes with substantial implementation and maintenance burden as well as potential for the deep coupling of bioformats2raw to storage subsystem architectural design.

Furthermore, I would strongly caution against going beyond 1024 chunking in the Y and X dimensions in the pursuit of better write performance and a smaller number of larger chunks. This may improve write performance but will substantially impact read performance and first byte latency for streaming viewers. Projects such as the aforementioned Neuroglancer or webKnossos go as far as to have tiny 3D chunk sizes (32^3) to combat this. The source data in your example (the .svs file) will also be chunked (tiled in TIFF parlance) and compressed. Selection of output chunk sizes that are not aligned can result in substantial read slowdowns as the source data has to be rechunked and repeatedly decompressed in order to conform to a desired output chunk size.

In short, the behavior you are seeing is expected and I don't think a 64MiB object size is either practical or reasonably achievable at present.

Hope this helps.

  1. https://zarr.readthedocs.io/en/stable/spec/v2.html#chunks
  2. https://zarr.readthedocs.io/en/stable/spec/v2.html#storage
  3. https://zarr.readthedocs.io/en/stable/spec/v2.html#arrays
  4. https://forum.image.sc/t/sharding-support-in-ome-zarr/55409
  5. https://docs.tiledb.com/main/solutions/tiledb-embedded/internal-mechanics/architecture

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/glencoesoftware/bioformats2raw/issues/116#issuecomment-920938879, or unsubscribe https://github.com/notifications/unsubscribe-auth/AVUT6JS2VYVJG6MPE7DHHEDUCH3JXANCNFSM5ECWZNDA . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

NHPatterson commented 2 years ago

@davidekhub If the SVS you are converting is a 24-bit RGB, likely it is stored with lossy compression (JPEG, JPEG-2000) and that is the reason for the difference in file size. Z-lib and blosc are lossless compression algorithms so they will never achieve the same compression ratios (although the pixel values will be exactly the same between the SVS and zarr data). There may be a way to encode with something like JPEG using b2r, but bear in mind that there is an accumulation effect in compression errors. The large number of objects is ideal for some scenarios like web visualization and very fast conversion but has downsides that need to be weighed.

davidekhub commented 2 years ago

Hmm. I'm going to have to forward this on to the experts on our side. If we're seeing 10X expansion with max compression because the originals are lossy (which I need to verify) that makes me reconsider the value of OME-zarr (it's already an operational burden compared to working with the original SVS, and not any faster for vis in our case).

On Tue, Dec 21, 2021 at 11:54 AM Heath Patterson @.***> wrote:

@davidekhub https://github.com/davidekhub If the SVS you are converting is a 24-bit RGB, likely it is stored with lossy compression (JPEG, JPEG-2000) and that is the reason for the difference in file size. Z-lib and blosc are lossless compression algorithms so they will never achieve the same compression ratios (although the pixel values will be exactly the same between the SVS and zarr data). There may be a way to encode with something like JPEG using b2r, but bear in mind that there is an accumulation effect in compression errors. The large number of objects is ideal for some scenarios like web visualization and very fast conversion but has downsides that need to be weighed.

— Reply to this email directly, view it on GitHub https://github.com/glencoesoftware/bioformats2raw/issues/116#issuecomment-999053564, or unsubscribe https://github.com/notifications/unsubscribe-auth/AVUT6JQ3HZXYX7555EYRYVDUSDLOZANCNFSM5ECWZNDA . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

You are receiving this because you were mentioned.Message ID: @.***>