ML Dimensionality Reduction techniques for large arrays

jakirkham commented 6 years ago

One of the things that came up during the ImageXD conference, which is also of great interest in our lab, is how to perform dimensionality reductions in Dask particularly on large arrays. These may be stacks of images or other things. As these are usually large array problems, the interest is using Dask Arrays to work on data that would be impractical to work on otherwise. Some techniques of interest include matrix factorization techniques like Dictionary Learning, NMF, etc.

mrocklin commented 6 years ago

Are there particular algorithms that are optimistic about that would be both useful on some of the example datasets and also feasible to implement and test within a week?

valentina-s commented 6 years ago

I will be happy to work on NMF (though remotely). There are two types of algorithms usually used: alternating least squares and multiplicative methods. I will need to think (and check some references) regarding which is easier to make distributed. In general they are all iterative and require some thought to avoid the graph becoming too large.

A famous test case for NMF (and dictionary learning) is to apply it on the faces dataset:

http://scikit-learn.org/stable/auto_examples/decomposition/plot_faces_decomposition.html#sphx-glr-auto-examples-decomposition-plot-faces-decomposition-py

mrocklin commented 6 years ago

Here is a small example doing alternating least squares on multi-dimensional arrays: https://gist.github.com/mrocklin/6fc759ab829a44c4f1969a6d6fc9dd28

This is a naive implementation though and not necessarily of use here, I thought I'd post it just as an example.

jakirkham commented 6 years ago

Thanks for chiming in @valentina-s. Was meaning to raise this to your attention. :)

Think we should be able to work something out so we can chat/share code snippets. @NelleV are you aware of any good resources for this that we could use during the sprint?

Indeed the faces dataset would be a good one to use. Also it is common to apply matrix factorization techniques in Calcium image data, which we have a fair bit of. Haven't looked through other datasets that people have provided, but maybe some of them would be good candidates for using this on as well.

Honestly even working out a very rough implementation during the conference would be very useful. These sorts of techniques are pretty important for us at work in a wide range of applications. So it's pretty easy to justify spending time on improving them afterwards.

valentina-s commented 6 years ago

@jakirkham maybe the datasets on Neurofinder could be a good choice since those are already public?

I started to compile some links:

https://amplab.cs.berkeley.edu/scientific-matrix-factorizations-in-spark-at-scale/ https://rserizel.github.io/minibatchNMF/ http://scikit-learn.org/stable/modules/generated/sklearn.decomposition.MiniBatchDictionaryLearning.html https://github.com/RaRe-Technologies/gensim/issues/132 https://blog.paperspace.com/dimension-reduction-with-independent-components-analysis/ https://github.com/thunder-project/thunder-factorization/tree/master/factorization/algorithms

jakirkham commented 6 years ago

Thanks for compiling these, @valentina-s. Looks like we have our reading cut out for us.

Something else that @GaelVaroquaux was sharing earlier is MODL, which would be good to take a look at. This would work best to run on a single powerful node.

Agree that Neurofinder data would work well for this. Talking to @TomAugspurger about making this data easily available from Pangeo. Typically what people do with this data in particular is restructure it in 2D where one dimension is raveled spatial coordinates and the other is time. The goal is to find some meaningful representative images that can be used to reconstruct the original data.

valentina-s commented 6 years ago

I think having an option to test with a big dataset on a cluster will be great.

I started the brute force conversion of the multiplicative method for NMF here:

https://github.com/valentina-s/daskNMF/blob/master/ExploreNMFmu.ipynb

and I am planning to convert also the coordinate descent solver, and test on my laptop with the faces dataset (which is not really the best testing setup but something to start with).

I think there are two scenarios for images:

chunk the images if the images are huge and the sample size is small
chunk samples, if sample size is huge, most probably when images come from videos with high sample rate (in that case the random subsampling techniques will help).

I considered the first case, but I think after reading some of the references I will have a better idea of what makes sense.

jakirkham commented 6 years ago

FWIW have built a copy of modl for conda in my channel. Only macOS and Linux (working on getting access to a Windows machine as well). The versioning is a little weird; sorry about that. Though conda does seem ok installing that anyways. Also ran the test suite as part of the build. So it should work. Should add this uses nearly everything from conda-forge with the exception of compiler runtime libraries (e.g. libgcc on Linux), which came from defaults.

ref: https://anaconda.org/jakirkham/modl

scisprints / 2018_05_sklearn_skimage_dask

ML Dimensionality Reduction techniques for large arrays #7