Open jakirkham opened 6 years ago
Are there particular algorithms that are optimistic about that would be both useful on some of the example datasets and also feasible to implement and test within a week?
I will be happy to work on NMF (though remotely). There are two types of algorithms usually used: alternating least squares and multiplicative methods. I will need to think (and check some references) regarding which is easier to make distributed. In general they are all iterative and require some thought to avoid the graph becoming too large.
A famous test case for NMF (and dictionary learning) is to apply it on the faces dataset:
Here is a small example doing alternating least squares on multi-dimensional arrays: https://gist.github.com/mrocklin/6fc759ab829a44c4f1969a6d6fc9dd28
This is a naive implementation though and not necessarily of use here, I thought I'd post it just as an example.
Thanks for chiming in @valentina-s. Was meaning to raise this to your attention. :)
Think we should be able to work something out so we can chat/share code snippets. @NelleV are you aware of any good resources for this that we could use during the sprint?
Indeed the faces dataset would be a good one to use. Also it is common to apply matrix factorization techniques in Calcium image data, which we have a fair bit of. Haven't looked through other datasets that people have provided, but maybe some of them would be good candidates for using this on as well.
Honestly even working out a very rough implementation during the conference would be very useful. These sorts of techniques are pretty important for us at work in a wide range of applications. So it's pretty easy to justify spending time on improving them afterwards.
@jakirkham maybe the datasets on Neurofinder could be a good choice since those are already public?
I started to compile some links:
https://amplab.cs.berkeley.edu/scientific-matrix-factorizations-in-spark-at-scale/ https://rserizel.github.io/minibatchNMF/ http://scikit-learn.org/stable/modules/generated/sklearn.decomposition.MiniBatchDictionaryLearning.html https://github.com/RaRe-Technologies/gensim/issues/132 https://blog.paperspace.com/dimension-reduction-with-independent-components-analysis/ https://github.com/thunder-project/thunder-factorization/tree/master/factorization/algorithms
Thanks for compiling these, @valentina-s. Looks like we have our reading cut out for us.
Something else that @GaelVaroquaux was sharing earlier is MODL, which would be good to take a look at. This would work best to run on a single powerful node.
Agree that Neurofinder data would work well for this. Talking to @TomAugspurger about making this data easily available from Pangeo. Typically what people do with this data in particular is restructure it in 2D where one dimension is raveled spatial coordinates and the other is time. The goal is to find some meaningful representative images that can be used to reconstruct the original data.
I think having an option to test with a big dataset on a cluster will be great.
I started the brute force conversion of the multiplicative method for NMF here:
https://github.com/valentina-s/daskNMF/blob/master/ExploreNMFmu.ipynb
and I am planning to convert also the coordinate descent solver, and test on my laptop with the faces dataset (which is not really the best testing setup but something to start with).
I think there are two scenarios for images:
I considered the first case, but I think after reading some of the references I will have a better idea of what makes sense.
FWIW have built a copy of modl
for conda
in my channel. Only macOS and Linux (working on getting access to a Windows machine as well). The versioning is a little weird; sorry about that. Though conda
does seem ok installing that anyways. Also ran the test suite as part of the build. So it should work. Should add this uses nearly everything from conda-forge
with the exception of compiler runtime libraries (e.g. libgcc
on Linux), which came from defaults
.
One of the things that came up during the ImageXD conference, which is also of great interest in our lab, is how to perform dimensionality reductions in Dask particularly on large arrays. These may be stacks of images or other things. As these are usually large array problems, the interest is using Dask Arrays to work on data that would be impractical to work on otherwise. Some techniques of interest include matrix factorization techniques like Dictionary Learning, NMF, etc.