Closed browserdotsys closed 5 years ago
That's getting into the guts of the linear algebra used for the initialisation of the embedding, which is deeper issues than I can deal with at the moment. As a workaround you can use init='random'
to set up random initialisation and avoid this -- that's not ideal, but it should at least get you around this problem.
As another note, if you are working with raw sparse data and it us say count of frequency data then I would recommend you consider using metric='cosine'
(or, in future versions of umap metric='hellinger'
).
Great, thanks for the advice! I'll give it a shot.
As a quick update on this, I'm an idiot and was running UMAP on a transposed version of what I meant to. Adding a .T
fixed everything quite nicely, even with the default initialization. Thanks again!
If you're curious by the way, here's what it produced on the 1000 genomes data. I was really impressed at how well UMAP did here, thanks for making this library – it's amazing!
I am glad to see LOBPCG working... You may also want to play with the tolerance below. The default tol=1e-15 seems very excessive to me in this setup.
/home/bowser/.virtualenvs/genetics/local/lib/python2.7/site-packages/sklearn/manifold/spectralembedding.pyc in spectral_embedding(adjacency, n_components, eigen_solver, random_state, eigen_tol, norm_laplacian, drop_first) 324 X[:, 0] = dd.ravel() 325 lambdas, diffusion_map = lobpcg(laplacian, X, tol=1e-15, --> 326 largest=False, maxiter=2000)
Thanks for the suggestion @lobpcg; this is not my area of expertise, so do you have some suggested default values that should be good?
Thanks for the suggestion @lobpcg; this is not my area of expertise, so do you have some suggested default values that should be good?
The optimal tolerance is difficult to predict theoretically, since it is problem dependent. Practically speaking, just try setting tol=1e-15 larger and/or maxiter=2000 smaller and see what happens... You need to get some practical experience yourself for your data finding the best balance keeping accuracy good enough while cutting the compute time, if your code runs for too long with the default values.
I've seen examples where even maxiter=2 would be enough for the overall goal. Have fun!
I'm looking at trying to use umap on whole-genome data from the 1000 genomes project. I'm not doing lots of preprocessing, just filtering out variants with minor allele frequency less than 10% and looking at a single chromosome (chr1); this gives me 509925 variants across 2504 individuals (a 509925x2504 matrix).
Here's what I'm doing:
(If any biologists are reading, I'm probably doing this filtering wrong but I think UMAP should still work in this case?)
Initially I was running into RecursionLimit issues, but those went away after increasing the recursion limit (first 3 lines above). However, after a couple hours I get the following traceback:
Any idea what's going wrong here? If you want to reproduce, the VCF can be found here:
http://hgdownload.cse.ucsc.edu/gbdb/hg19/1000Genomes/phase3/ALL.chr1.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz
The only other dependency aside from UMAP is
scikit-allel
.Versions of various things: