Open NHPatterson opened 4 years ago
Thanks for this! I wanted to include this issue in the thread as well: https://github.com/hubmapconsortium/vitessce-image-viewer/issues/91
dimensions and their order. This is something I've encountered frequently in any python image analysis...what are the dimensions and how are they ordered? Is it CXY, YXC, TXYC, etc.
I think we should standardize this for zarr. In an napari, the last two dimensions of the array are y
and x
on the screen with the other dimension becoming sliders if an array > 2 dims is supplied. In vitessce, it is most efficient to have the data as row-major arrays, with the last two dimensions corresponding to y and x. This allows us to fetch contiguous 512 x 512 chunks from the store and has big performance benefits. Essentially, given any "T" and "C", we can retrieve corresponding tiles directly.
I've also seen conflicting things here with dimensions. I parsed an OME-TIFF using tifffile
and the dimension order was ["X", "Y", "Z", "C", "T"], but when I used tf.asarray()
the resulting array is in the opposite order. Is this typical?
My current thinking on this is to use the above readers to initialize a zarr store with the image's highest resolution, or base layer, and then use dask to process the lower res. layers using da.coarsen.
This makes a lot of sense. In some experimenting I've done, I've re-initialized the dask array from the recently saved downsampled version, so in the next iteration of the downsampling dask doens't need to recompute the previous sampling, just downsampling from the current layer.
The downsampling should optionally be Gaussian or Laplacian smoothed prior to downsampling to render images that look less pixelated in the lower res. pyramidal layers. dask_image has such filtering already.
How is this different from simply calling np.mean
? I'm not familiar with the different types of smoothing.
An important question may be high resolution 3D data like cryo-EM or any reconstructed serial section microscopy but that seems more long term at the moment and there are some specific formats for that already that we may think of converting to zarr.
In this case I would definitely suggest the final dimensions of the arrays be standardised to z
, y
, x
.
I see two things that are priority, getting the pyramid writing highly optimized and getting readers for many different types of data.
I've played a bit recently with creating tiled images from OME-tiff. On early experimentation, the biggest bottleneck seems to be converting the data to the base
zarr image at full resolution. There is some discussion here about a tool called xcube which they are developing a cli for computing data data tiles from zarr. Might be interesting to look into further, but no real documentation for these features from what I can tell.
Right now it is painfully slow to usedask_image.imread.imread
to convert a an image to the base layer, but reading the image into memory (if it will fit) with tf.asarray(numworkers=8)
is quite fast. Perhaps I need to configure something with dask to make this better, but I really don't know.
Yes, I looked a lot into the logic of TiffFile for reading and essentially it creates an empty numpy array and then reads each tile into that array with the various workers. My thought and what I'd tried a little already was to stream those tiles directly to the zarr
store. The TiffFile.asarray()
documents have this
Parameters
----------
out : numpy.ndarray, str, or file-like object
Buffer where image data will be saved.
If None (default), a new array will be created.
If numpy.ndarray, a writable array of compatible dtype and shape.
If 'memmap', directly memory-map the image data in the TIFF file
if possible; else create a memory-mapped array in a temporary file.
If str or open file, the file name or file object used to
create a memory-map to an array stored in a binary file on disk.
And if you look in the TiffFile.py
it also has a create_output
function that was important for initializing the target the tiles are streamed to, it may simply be a matter of altering this function to accept a zarr
store and then altering the tile_decode
function to shape the array in the correct way.
Sorry I just have clues, I wish I could show you a working version of what I wrote but I didn't get terribly far and have a bunch of other balls in the air right now but may find time early next week to revisit and make some decent running code.
No worries! Thanks for the pointer. I'll have a look a look into this as well.
I've been stewing a little about how to best convert images to zarr. It seems the primary focus will be whole slide images (WSIs) and pyramidal formats for efficient serving of large planes of high resolution data with an obvious focus on microscopy but we should be open to other spatial data like astronomy or GIS as low priority. A second focus will be imaging mass spectrometry (IMS) data where the broadband spectral nature adds significantly more considerations in how the data in stored and served, i.e. serving image data vs serving spectral data. We may in fact want to package something like
ims2zarr
eventually. My reading of the temperature in the IMS community tells me people don't want to use imzML too much longer but it's ubiquity at the vendor level, and analysis software level propagates it.My 1000 ft view for design is a python class for writing the
zarr
stores back-ended with different image readers that recognize different image formats or simply pass in anumpy
array (with caveats listed below).Important image metadata
zarr
store or does the format's original ordering persist?Readers
tifffile
: handles most tiffs, can extract tiff metadata, and pyramid levels in already pyramidalized images. Gives access to tiles from tiled tiffs with some "hacks" I've made. This gives us advantages to avoid loading entire image planes into memory, some of which may exceed memory on even high performance machines.czifile
: reads data from zeiss microscopes. Can access the native high resolution tiles, again, it has to be hacked a little.openslide
and it's python bindings: reads many RGB format whole slide images, this has some overlap withtifffile
, but I'm not sure the extent. I'd default totifffile
as it's not a dead project likeopenslide
appears to be.bioformats-python
andjavabridge
: the combination of these python packages gives access to everythingbioformats
can read, which is substantial.ITK
(andSimpleITK
) : gives access to medical image data like MRI, CT, etc. These images have a lot of different considerations.numpy
array. Some people may wish to process their data in python so having this option is important. Here is may be more difficult to define import metadata.Writing pyramids
My current thinking on this is to use the above readers to initialize a
zarr
store with the image's highest resolution, or base layer, and then usedask
to process the lower res. layers usingda.coarsen
. The downsampling should optionally be Gaussian or Laplacian smoothed prior to downsampling to render images that look less pixelated in the lower res. pyramidal layers.dask_image
has such filtering already. Each layer should have it's own metadata with the things listed above. For reading and initializing the store with the base layer, I would like to emphasize limited memory footprint so tiled images where we can randomly access smaller parts of the data set are a plus. An important question may be high resolution 3D data like cryo-EM or any reconstructed serial section microscopy but that seems more long term at the moment and there are some specific formats for that already that we may think of converting tozarr
.Ok, where to start...
I see two things that are priority, getting the pyramid writing highly optimized and getting readers for many different types of data.
Maybe we can discuss a little more here and bring in more major topics and then divvy up some smaller issues and actionable items as separate issues or add milestones to the repo, etc.