Open jakirkham opened 7 years ago
This should be a good use case for datashader+holoviews+xarray+dask. I'd personally like to try it out for a large stack of microscope images scanned from a volume of tissue, particularly for EM images where each individual image is large as well. So far, our raster-image testing has only used geo-located data, which tends to have only a dozen or so co-registered layers, not a deep z-stack. There have been lots of changes over the past few months that should make such usage vastly more practical:
conda install -c bokeh/label/dev bokeh
), provides binary array transport for the notebook, making it much, much faster to transfer image data from Python into the browserSo there is good reason to think that this should now be a practical way to work with large image stacks. Unfortunately, I haven't personally tested this case, and I would suspect that there will still be some practical problems when trying to get that set up. @philippjfr, maybe you could sketch the outlines of some HoloViews-based, Datashader-backed code that you think could work for such a case? The basic idea should be trivial; it's just trying to make sure that the data doesn't all get read into memory or copied that I'd be worried about.
Yes, it does sound like a good use-case and I agree with the approach you have outlined. Depending on the size of each individual image I'm not even sure that datashader is necessary. I'll briefly sketch out two approaches, here is one where you define a very simple function to load the data as a NumPy array, which will be lazily executed as you drag the slider
import holoviews as hv
from holoviews.operation.datashader import regrid
hv.extension('bokeh')
def load_image(z):
arr = ... # Load data for a particular z-value from disk as NumPy array
return hv.Image(arr, bounds=(left, bottom, right, top), vdims='Intensity')
# Define DynamicMap with z-dimension to slide through
image_stack = hv.DynamicMap(load_image, kdims=['z'])
# Apply regridding in case data is large and set a global Intensity range
regridded = regrid(image_stack).redim.range(Intensity=(0, 1000))
display_obj = regridded.opts(plot={'Image': dict(width=400, height=400, colorbar=True)})
another approach is using xarray to wrap and label your dask arrays (note that this depends on a datashader PR I just opened https://github.com/bokeh/datashader/pull/479):
import dask.array as da
import xarray as xr
import holoviews as hv
from holoviews.operation.datashader import regrid
hv.extension('bokeh')
# Load DataArray in some way
sizes = (1000, 1000, 100)
darr = da.random.random(sizes, chunks=100)
# Wrap in xarray DataArray and label coordinates
dims = ['z', 'y', 'x',]
coords = {d: np.arange(s) for d, s in zip(dims, sizes)}
xrstack = xr.DataArray(darr, dims=dims, coords=coords)
# Wrap in HoloViews Dataset
ds = hv.Dataset(xrstack)
# Convert to stack of images with x/y-coordinates along axes
image_stack = ds.to(hv.Image, ['x', 'y'], dynamic=True)
# Apply regridding if each image is large
regridded = regrid(image_stack)
# Set a global Intensity range
regridded = regridded.redim.range(Intensity=(0, 1000))
# Set plot options
display_obj = regridded.opts(plot={'Image': dict(colorbar=True, width=400, height=400)})
Once you have the display_obj
you can display it directly in the notebook or use hv.renderer('bokeh').server_doc(display_obj)
to turn the whole thing into a bokeh server app.
Sounds good, thanks! I've merged that PR, which looked uncontroversial to me. Is there a dataset that we could use to make this a public example? I'd like to try it out, myself, particularly for the case where each image is very large and we're only looking at a small patch of it at a time.
Thanks very much for the feedback. Sorry I've been slow replying on this front. Am in the very early stages of playing with holoviews
and datashader
. That said, this should work nicely based on my early prototyping, which borrows heavily from the comments here. :)
Most of the image data I'm playing with tends to have smaller frames ATM (e.g. roughly 1000x1000) though lots of frames (e.g. on the order of 10,000), but there definitely is some larger frame size data that might crop up in the future. Some good public data of this sort can be found from Neurofinder. Though this is all calcium imaging data.
As you mentioned EM data, would suggest taking a look at data hosted in our local DVID instance. This is all public data AFAIK. Maybe @stuarteberg can point you to some of interest. There is a viewer in the web interface to look at various portions and some open source software to retrieve data from DVID. Though you can also use the web API to grab pieces and stitch them together. @mrocklin did this with Dask on some older now unavailable(?) data in this gist.
Hope that helps. Please let me know if you have any other questions.
Before I embark on this journey myself, I just wanted to ask whether anyone has done further work in this direction?
Yes! To quote from issue https://github.com/ioam/geoviews/issues/144 :
You can see an example of navigating an EM stack at pangeo.pydata.org (launch a server and select scikit-image-example.ipynb from the examples). If you try that example on your own machine, you should use the very latest HoloViews alpha release 1.10.0a2 (from
conda install -c ioam/label/dev holoviews
), which has massive speed improvements for this case.
I'm attempting to follow the examples by @philippjfr except with hv.RGB
and I'm running into issues. I have multiple jpeg (3 channel) images in a directory that I've loaded into an xarray with dimension z, x, y, channel, where z is the number of images. I want to be able to display the images and have a slider to navigate through the z dimension. I get an error when
xrstack = ... # load images
ds = hv.Dataset(xrstack)
image_stack = ds.to(hv.RGB, ['x', 'y'], dynamic=True)
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-15-80b5bac31eea> in <module>
----> 1 image_stack = ds.to(hv.RGB, ['x', 'y'], dynamic=True)
/data/CondaEnvs/ya006948/envs/sbda/lib/python3.7/site-packages/holoviews/core/data/__init__.py in __call__(self, new_type, kdims, vdims, groupby, sort, **kwargs)
141 (max_d is not None and len(dims) > max_d)):
142 raise ValueError("%s %s must be between length %s and %s." %
--> 143 (type_name, dim_type, min_d, max_d))
144
145 if groupby is None:
ValueError: RGB vdims must be between length 3 and 4.
From looking at the docs for hv.RGB
it seems like inputs in the form of Z (NxMx(3 or 4)) should be accepted. If this isn't supported in this when declared through ds.to(...)
how can I reshape my dataset to be in the form hv.RGB
expects?
Trying to figure out to what degree this is possible with datashader. Having not really used this library at this point, this question may be pretty naive. Thanks in advance for any advice.
Am interested in taking a stack of arrays (larger than memory) and navigating through them in a Jupyter notebook with a slider. Would be nice to see the intensity range in a colorbar as well. The arrays themselves are usually
float32
though sometimes might be different (e.g.int16
). The values in the arrays are simply intensities. Would want to normalize the view based on the maximum and minimum values in the stack. Can pretty easily represent this stack of images using a Dask Array that either pulls from a computation or, if necessary, pulls from a large file on disk. Could also provide something that acts like a NumPy reading directly from disk.To what extent is something like this possible? Are there any useful examples that might be relevant for this use case? What sort of issues might one encounter trying to handle this use case?