pangeo-data / pangeo

Pangeo website + discussion of general issues related to the project.
http://pangeo.io
701 stars 189 forks source link

Adapting the pangeo approach to microscopy #144

Closed agvaughan closed 6 years ago

agvaughan commented 6 years ago

Hi all,

I'm a neuroscientist working on a new form of microscopy (in-situ sequencing) that has some particular challenges in visualization. I've been working to build an interactive visualization using holoviews/bokeh, and am now digging into the back end more. I've been admiring the pangeo project from afar, and am looking for advice on adapting some of your back-end infrastructure.

Our approach creates a 5-d matrix ( ie., data = np.array(time, channel, z,y,x) that we typically visualize as a 4-color image (i.e. a colorized 2d view from data[time, :, z, : , :]). We expect datasets with a maximum size ~20-100GB, with sizes up to about [30, 4, 50, 5k, 5k].

I've been using holoviews for visualization of in-memory datasets with good results. For example, I can show 4 channels interactively in separate panes using hv.Image(...).regrid() with sufficient detail and interactively to be biologically useful. This is fast enough to be usable on my laptop, including various add-ons such as hv.DynamicMap streams.

In the near future I'd like to make two big changes: embrace OOM datasets, and host data off of the user's machine. Currently data is passed around as numpy > xarray.Dataarray > holoviews > bokeh. The approach you folks have here (particularly dask/zarr/gcfs) seems like a great foundation, but I'm not sure whether our constraints are the same. For instance, I'm not sure where the bottleneck would likely be for my live visualization - networking in GCFS, passing data around between processes, or simple bandwidth to the user?

I'm almost certain I don't know where the real crux is going to be in this project, so would love any and all advice!

Thanks! Alex

PS, thanks to Matt for pointing me here. Matt, if there was anything relevant in my previous email that's worth mentioning, feel free.

mrocklin commented 6 years ago

What kinds of computations do you want to do on this data? Are you just looking at various slices? Are you doing aggregations? Would you want to re-arrange your data by space, by time, etc..

There is an example currently in the examples folder named something like scikit-image-....ipynb that faces a similar problem. It creates a 3d block of data from several 2d images and then rearranges them to be stored in a blockwise fashion. I wonder if seeing how it gets loaded and manipulated might provide some insight. I'm also curious to see people do something image-y with that data (or any data) I put it up hoping to attract someone exactly like you to the collaboration :)

So, some concrete questions:

  1. How is your data stored now?
  2. What kinds of computations do you want to run on your data?
  3. What is your time-budget (interactive, don't mind waiting a few minutes, ...)
  4. Did you want to keep things local or put it on a cluster?
agvaughan commented 6 years ago

Ha, I know that dataset - I did my Ph.D. in Drosophila neuro at Janelia (though not on the FlyEM side). Flashbacks abound...

I'd be happy to test my holoviews/bokeh visualization on the FlyEM data, but I think I'd need a couple of things that don't seem to be available in the pangeo stack (though I could be totally missing something here). First, holoviews, which is listed in pangeo/environments/environment.yml but doesn't show doesn't show up when I run help('modules') within the notebook.

Second, interactive bokeh plotting - currently I get this

from bokeh.io import output_notebook
output_notebook()
> JavaScript output is disabled in JupyterLab

which is presumably fixable by jupyter labextension install jupyterlab_bokeh..

I haven't tried running this (or installing holoviews) via the jupyter console - let me know if that's kosher.

...

Otherwise, here's the gist of the approach I'm using for now. To summarize - data storage is very flexible, and I'm mostly focused on (cloud) visualization for now as other things can be run offline.

How is your data stored now?

For my test vizualiations I load data from a tiff file and hold it in memory. I'm moving towards HDF5, but this part of the stack is still very flexible. My current wrapper could switch from hdf5 > zarr without too much trouble, I think.

What kinds of computations do you want to run on your data?

To start with I'd mostly like to do pretty simple visualizations - i.e. a 2D slice of the data across a few channels, possibly including color compositing.
There is more intensive computation do be done, but that is either extremely fast (e.g. base-calling a single pixel is a max-arg across channels) or can be done offline (filtering, some compressed sensing, neuronal reconstruction etc). I'm fine with either pre-computing or losing interactivity for any serious computation.

What is your time-budget (interactive, don't mind waiting a few minutes, ...)

I am most interested in interactive visualization. More complex computation can be offline, but the outputs of that computation will be either n-d rasters or point-clouds that will be visualized on top of the "raw-data" views that I'm working on now. I'm relying on the holoviews/bokeh stack to make these overlays, and to make plots interactive (e.g. pan, zoom, select ROIs).

Did you want to keep things local or put it on a cluster?

I'd like to put visualization on a cluster, though I could also set up a single beefy server if it made more sense. One possible outcome of this work is to develop a stack that would serve as a sort of lightweight online imagej/Fiji - mostly for visualization, with some image processing / biological interpretation functions available.

mrocklin commented 6 years ago

Interestingly, Holoviews just came up in #145

On Wed, Mar 7, 2018 at 11:25 AM, Alex Vaughan notifications@github.com wrote:

Ha, I know that dataset - I did my Ph.D. in Drosophila neuro at Janelia (though not on the FlyEM side). Flashbacks abound...

I'd be happy to test my holoviews/bokeh visualization on the FlyEM data, but I think I'd need a couple of things that don't seem to be available in the pangeo stack. First, holoviews, which is listed in pangeo/environments/environment.yml but doesn't show doesn't show up when I run help('modules') within the notebook.

Second, interactive bokeh plotting - currently I get this

from bokeh.io import output_notebook output_notebook()

JavaScript output is disabled in JupyterLab

which is presumably fixable by jupyter labextension install jupyterlab_bokeh..

I haven't tried running this (or installing holoviews) via the jupyter console - let me know if that's kosher.

...

Otherwise, here's the gist of the approach I'm using for now. To summarize

  • data storage is very flexible, and I'm mostly focused on (cloud) visualization for now as other things can be run offline.

How is your data stored now?

For my test vizualiations I load data from a tiff file and hold it in memory. I'm moving towards HDF5, but this part of the stack is still very flexible. My current wrapper could switch from hdf5 > zarr without too much trouble, I think.

What kinds of computations do you want to run on your data?

To start with I'd mostly like to do pretty simple visualizations - i.e. a 2D slice of the data across a few channels, possibly including color compositing. There is more intensive computation do be done, but that is either extremely fast (e.g. base-calling a single pixel is a max-arg across channels) or can be done offline (filtering, some compressed sensing, neuronal reconstruction etc). I'm fine with either pre-computing or losing interactivity for any serious computation.

What is your time-budget (interactive, don't mind waiting a few minutes, ...)

I am most interested in interactive visualization. More complex computation can be offline, but the outputs of that computation will be either n-d rasters or point-clouds that will be visualized on top of the "raw-data" views that I'm working on now. I'm relying on the holoviews/bokeh stack to make these overlays, and to make plots interactive (e.g. pan, zoom, select ROIs).

Did you want to keep things local or put it on a cluster?

I'd like to put visualization on a cluster, though I could also set up a single beefy server if it made more sense. One possible outcome of this work is to develop a stack that would serve as a sort of lightweight online imagej/Fiji - mostly for visualization, with some image processing / biological interpretation functions available.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/pangeo-data/pangeo/issues/144#issuecomment-371195178, or mute the thread https://github.com/notifications/unsubscribe-auth/AASszHDhxPJ_CmJEqk4usnWQkW1l9be3ks5tcAoPgaJpZM4SfzcJ .

agvaughan commented 6 years ago

Ah, some good solutions there. I can confirm that using the legacy notebook (/user/.../tree) allows bokeh to work fine.

After conda install -c ioam holoviews bokeh the holoviews examples seem to work in at least the legacy notebook, so I'll push foward with that in the next day or two.

agvaughan commented 6 years ago

Actually, that seems to do it for the basics! Code below will make a regridded 2d view of the EM data, with a slider for Z. Works after conda install -c ioam holoviews bokeh in the legacy notebook.

Tiling in X/Y isn't incredibly fast with default chunking (~ 2s update for a new tile), but is much faster when chunked as y =[200,200,200] (not seamless, but <0.5s). The interactivity of changing z depends a lot on zoom level etc.

import xarray as arr
import holoviews as hv
from holoviews.operation.datashader import regrid
hv.extension('bokeh','matplotlib')

# Make an xarray
zz         = range(x.shape[0])
yy         = range(x.shape[1]) 
xx         = range(x.shape[2]) 
data_arr = arr.DataArray(x, coords=[zz,yy,xx], dims=['z','y','x']) 

# Make holoviews dataset
dataset = hv.Dataset(data_arr,vdims=['ElectronDensity'],datatype=['xarray'])

# Make a holoviews image
img = dataset.to(hv.Image, ['x','y'])

# Regrid image via holoviews.operation.datashader.regrid
regrid(img)

Holoviews output here

mrocklin commented 6 years ago

It's pretty cool to see that work in action :) It also gives good motivation for opportunistic caching .

mrocklin commented 6 years ago

@agvaughan again I'm glad to see your engagement here. Now that you've had a moment to play with pangeo.pydata.org what are some of the things that you would like to see to make you and people like you more productive on a system like this?

jakirkham commented 6 years ago

FWIW Dask Arrays now work as a direct input to Holoviews.

Also, if you are interested, can point you to some basic image processing libraries on top of Dask, which may help as you play with your data.

stale[bot] commented 6 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

stale[bot] commented 6 years ago

This issue has been automatically closed because it had not seen recent activity. The issue can always be reopened at a later date.