hubmapconsortium-graveyard / img2zarr

Python utility for converting images to zarr chunked arrays
MIT License
0 stars 0 forks source link

Some thoughts to get started #1

Open NHPatterson opened 4 years ago

NHPatterson commented 4 years ago

I've been stewing a little about how to best convert images to zarr. It seems the primary focus will be whole slide images (WSIs) and pyramidal formats for efficient serving of large planes of high resolution data with an obvious focus on microscopy but we should be open to other spatial data like astronomy or GIS as low priority. A second focus will be imaging mass spectrometry (IMS) data where the broadband spectral nature adds significantly more considerations in how the data in stored and served, i.e. serving image data vs serving spectral data. We may in fact want to package something like ims2zarr eventually. My reading of the temperature in the IMS community tells me people don't want to use imzML too much longer but it's ubiquity at the vendor level, and analysis software level propagates it.

My 1000 ft view for design is a python class for writing the zarr stores back-ended with different image readers that recognize different image formats or simply pass in a numpy array (with caveats listed below).

Important image metadata

Readers

Writing pyramids

My current thinking on this is to use the above readers to initialize a zarr store with the image's highest resolution, or base layer, and then use dask to process the lower res. layers using da.coarsen. The downsampling should optionally be Gaussian or Laplacian smoothed prior to downsampling to render images that look less pixelated in the lower res. pyramidal layers. dask_image has such filtering already. Each layer should have it's own metadata with the things listed above. For reading and initializing the store with the base layer, I would like to emphasize limited memory footprint so tiled images where we can randomly access smaller parts of the data set are a plus. An important question may be high resolution 3D data like cryo-EM or any reconstructed serial section microscopy but that seems more long term at the moment and there are some specific formats for that already that we may think of converting to zarr.

Ok, where to start...

I see two things that are priority, getting the pyramid writing highly optimized and getting readers for many different types of data.

Maybe we can discuss a little more here and bring in more major topics and then divvy up some smaller issues and actionable items as separate issues or add milestones to the repo, etc.

manzt commented 4 years ago

Thanks for this! I wanted to include this issue in the thread as well: https://github.com/hubmapconsortium/vitessce-image-viewer/issues/91

manzt commented 4 years ago

dimensions and their order. This is something I've encountered frequently in any python image analysis...what are the dimensions and how are they ordered? Is it CXY, YXC, TXYC, etc.

I think we should standardize this for zarr. In an napari, the last two dimensions of the array are y and x on the screen with the other dimension becoming sliders if an array > 2 dims is supplied. In vitessce, it is most efficient to have the data as row-major arrays, with the last two dimensions corresponding to y and x. This allows us to fetch contiguous 512 x 512 chunks from the store and has big performance benefits. Essentially, given any "T" and "C", we can retrieve corresponding tiles directly.

I've also seen conflicting things here with dimensions. I parsed an OME-TIFF using tifffile and the dimension order was ["X", "Y", "Z", "C", "T"], but when I used tf.asarray() the resulting array is in the opposite order. Is this typical?

manzt commented 4 years ago

My current thinking on this is to use the above readers to initialize a zarr store with the image's highest resolution, or base layer, and then use dask to process the lower res. layers using da.coarsen.

This makes a lot of sense. In some experimenting I've done, I've re-initialized the dask array from the recently saved downsampled version, so in the next iteration of the downsampling dask doens't need to recompute the previous sampling, just downsampling from the current layer.

The downsampling should optionally be Gaussian or Laplacian smoothed prior to downsampling to render images that look less pixelated in the lower res. pyramidal layers. dask_image has such filtering already.

How is this different from simply calling np.mean ? I'm not familiar with the different types of smoothing.

An important question may be high resolution 3D data like cryo-EM or any reconstructed serial section microscopy but that seems more long term at the moment and there are some specific formats for that already that we may think of converting to zarr.

In this case I would definitely suggest the final dimensions of the arrays be standardised to z, y, x.

manzt commented 4 years ago

I see two things that are priority, getting the pyramid writing highly optimized and getting readers for many different types of data.

I've played a bit recently with creating tiled images from OME-tiff. On early experimentation, the biggest bottleneck seems to be converting the data to the base zarr image at full resolution. There is some discussion here about a tool called xcube which they are developing a cli for computing data data tiles from zarr. Might be interesting to look into further, but no real documentation for these features from what I can tell.

Right now it is painfully slow to usedask_image.imread.imread to convert a an image to the base layer, but reading the image into memory (if it will fit) with tf.asarray(numworkers=8) is quite fast. Perhaps I need to configure something with dask to make this better, but I really don't know.

NHPatterson commented 4 years ago

Yes, I looked a lot into the logic of TiffFile for reading and essentially it creates an empty numpy array and then reads each tile into that array with the various workers. My thought and what I'd tried a little already was to stream those tiles directly to the zarr store. The TiffFile.asarray() documents have this

        Parameters
        ----------
        out : numpy.ndarray, str, or file-like object
            Buffer where image data will be saved.
            If None (default), a new array will be created.
            If numpy.ndarray, a writable array of compatible dtype and shape.
            If 'memmap', directly memory-map the image data in the TIFF file
            if possible; else create a memory-mapped array in a temporary file.
            If str or open file, the file name or file object used to
            create a memory-map to an array stored in a binary file on disk.

And if you look in the TiffFile.py it also has a create_output function that was important for initializing the target the tiles are streamed to, it may simply be a matter of altering this function to accept a zarr store and then altering the tile_decode function to shape the array in the correct way.

Sorry I just have clues, I wish I could show you a working version of what I wrote but I didn't get terribly far and have a bunch of other balls in the air right now but may find time early next week to revisit and make some decent running code.

manzt commented 4 years ago

No worries! Thanks for the pointer. I'll have a look a look into this as well.