mmp2 / megaman

megaman: Manifold Learning for Millions of Points
http://mmp2.github.io/megaman/
BSD 2-Clause "Simplified" License
322 stars 68 forks source link

example workflow #54

Closed hoyleb closed 8 years ago

hoyleb commented 8 years ago

Dear Mega-team, Great job with the code. Might you be able to help with some conceptual difficulties that I'm having.

I'd like to take a data set of size (rows=250M, features=5) and perform SpectralEmbedding into 2 or 3 dimensions. I'm finding very long computation times. Does it make more sense to perform the fit_transform() method on a sub set of data, and then apply this mapping to all the data in the sample? If so, I can't figure out how to do this from the documentation.

I'm following the bare-bones example I find here (recreating the mega-man image), and am quite new to many of the concepts that mega-man has to offer.

I see a fit_transform() method, but nothing like a sk-learn transform() or prediction()

Thanks,

Ben

jmcq89 commented 8 years ago

Hi Ben,

Which part of the computation is being slow for you? It may be that you selected a radius that is very large and so the resulting neighborhood graph is nearly dense. This would result in slow neighbor calculation as well as slow eigendecomposition. Which neighborhood method are you using? I would suggest using 'cyflann' if you're not already and for the eigendecomposition selecting 'amg' for best results.

As for a prediction() or transform() method, unfortunately for most manifold learning algorithms (spectral embedding included) there's no natural out-of-sample extension. I.e. it's non-trivial to fit data on a training set and then apply the transform to a test set without re-running the algorithm on the entire data set. This is one of the current drawbacks of Manifold Learning. In future we plan to offer a Nystrom Extension which is an approximation algorithm to an eigendecomposition when the decomposition of a subset of the matrix is known. This would allow for such an out-of-sample prediction procedure but is currently future work.

Sincerely, James

hoyleb commented 8 years ago

Thanks James, yes I was using cyflann, amg. I'll experiment with the radius now. I was using r=1.0. Thanks for the other explanations. Ben