lmcinnes / umap

Uniform Manifold Approximation and Projection
BSD 3-Clause "New" or "Revised" License
7.23k stars 787 forks source link

LinAlgError: the leading minor of order 11 of 'b' is not positive definite. The factorization of 'b' could not be completed and no eigenvalues or eigenvectors were computed. #168

Closed browserdotsys closed 5 years ago

browserdotsys commented 5 years ago

I'm looking at trying to use umap on whole-genome data from the 1000 genomes project. I'm not doing lots of preprocessing, just filtering out variants with minor allele frequency less than 10% and looking at a single chromosome (chr1); this gives me 509925 variants across 2504 individuals (a 509925x2504 matrix).

Here's what I'm doing:

import resource, sys
sys.setrecursionlimit(10**6)
resource.setrlimit(resource.RLIMIT_STACK, (2**29,-1))
import umap
import allel
callset = allel.read_vcf('ALL.chr1.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz')
gt = allel.GenotypeArray(callset['calldata/GT'])
allel_freqs = gt.count_alleles()
maf_1 = gt[((allel_freqs/float(gt.shape[1]*2))[:,1] > .1)]
reducer = umap.UMAP()
embedding = reducer.fit(maf_1[:,:,0])

(If any biologists are reading, I'm probably doing this filtering wrong but I think UMAP should still work in this case?)

Initially I was running into RecursionLimit issues, but those went away after increasing the recursion limit (first 3 lines above). However, after a couple hours I get the following traceback:

In [42]: embedding = reducer.fit(maf_1[:,:,0])
  /home/bowser/.virtualenvs/genetics/local/lib/python2.7/site-packages/umap_learn-0.3.6-py2.7.egg/umap/spectral.py:229: UserWarning: Embedding a total of 2215 separate connected components using meta-embedding (experimental)
---------------------------------------------------------------------------
LinAlgError                               Traceback (most recent call last)
<ipython-input-42-784bcd135a58> in <module>()
----> 1 embedding = reducer.fit(maf_1[:,:,0])

/home/bowser/.virtualenvs/genetics/local/lib/python2.7/site-packages/umap_learn-0.3.6-py2.7.egg/umap/umap_.pyc in fit(self, X, y)
   1534             self.metric,
   1535             self._metric_kwds,
-> 1536             self.verbose,
   1537         )
   1538

/home/bowser/.virtualenvs/genetics/local/lib/python2.7/site-packages/umap_learn-0.3.6-py2.7.egg/umap/umap_.pyc in simplicial_set_embedding(data, graph, n_components, initial_alpha, a, b, gamma, negative_sample_rate, n_epochs, init, random_state, metric, metric_kwds, verbose)
    939             random_state,
    940             metric=metric,
--> 941             metric_kwds=metric_kwds,
    942         )
    943         expansion = 10.0 / initialisation.max()

/home/bowser/.virtualenvs/genetics/local/lib/python2.7/site-packages/umap_learn-0.3.6-py2.7.egg/umap/spectral.pyc in spectral_layout(data, graph, dim, random_state, metric, metric_kwds)
    238             random_state,
    239             metric=metric,
--> 240             metric_kwds=metric_kwds,
    241         )
    242

/home/bowser/.virtualenvs/genetics/local/lib/python2.7/site-packages/umap_learn-0.3.6-py2.7.egg/umap/spectral.pyc in multi_component_layout(data, graph, n_components, component_labels, dim, random_state, metric, metric_kwds)
    120             dim,
    121             metric=metric,
--> 122             metric_kwds=metric_kwds,
    123         )
    124     else:

/home/bowser/.virtualenvs/genetics/local/lib/python2.7/site-packages/umap_learn-0.3.6-py2.7.egg/umap/spectral.pyc in component_layout(data, n_components, component_labels, dim, metric, metric_kwds)
     57     component_embedding = SpectralEmbedding(
     58         n_components=dim, affinity="precomputed"
---> 59     ).fit_transform(affinity_matrix)
     60     component_embedding /= component_embedding.max()
     61

/home/bowser/.virtualenvs/genetics/local/lib/python2.7/site-packages/sklearn/manifold/spectral_embedding_.pyc in fit_transform(self, X, y)
    546         X_new : array-like, shape (n_samples, n_components)
    547         """
--> 548         self.fit(X)
    549         return self.embedding_

/home/bowser/.virtualenvs/genetics/local/lib/python2.7/site-packages/sklearn/manifold/spectral_embedding_.pyc in fit(self, X, y)
    525                                              n_components=self.n_components,
    526                                              eigen_solver=self.eigen_solver,
--> 527                                              random_state=random_state)
    528         return self
    529

/home/bowser/.virtualenvs/genetics/local/lib/python2.7/site-packages/sklearn/manifold/spectral_embedding_.pyc in spectral_embedding(adjacency, n_components, eigen_solver, random_state, eigen_tol, norm_laplacian, drop_first)
    324             X[:, 0] = dd.ravel()
    325             lambdas, diffusion_map = lobpcg(laplacian, X, tol=1e-15,
--> 326                                             largest=False, maxiter=2000)
    327             embedding = diffusion_map.T[:n_components]
    328             if norm_laplacian:

/home/bowser/.virtualenvs/genetics/local/lib/python2.7/site-packages/scipy/sparse/linalg/eigen/lobpcg/lobpcg.pyc in lobpcg(A, X, B, M, Y, tol, maxiter, largest, verbosityLevel, retLambdaHistory, retResidualNormsHistory)
    488
    489         # Solve the generalized eigenvalue problem.
--> 490         _lambda, eigBlockVector = eigh(gramA, gramB, check_finite=False)
    491         ii = np.argsort(_lambda)[:sizeX]
    492         if largest:

/home/bowser/.virtualenvs/genetics/local/lib/python2.7/site-packages/scipy/linalg/decomp.pyc in eigh(a, b, lower, eigvals_only, overwrite_a, overwrite_b, turbo, eigvals, type, check_finite)
    491                           " factorization of 'b' could not be completed"
    492                           " and no eigenvalues or eigenvectors were"
--> 493                           " computed." % (info-b1.shape[0]))
    494
    495

LinAlgError: the leading minor of order 11 of 'b' is not positive definite. The factorization of 'b' could not be completed and no eigenvalues or eigenvectors were computed.

Any idea what's going wrong here? If you want to reproduce, the VCF can be found here:

http://hgdownload.cse.ucsc.edu/gbdb/hg19/1000Genomes/phase3/ALL.chr1.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz

The only other dependency aside from UMAP is scikit-allel.

Versions of various things:

In [46]: print umap.__version__    
0.3.6

In [47]: print allel.__version__   
1.1.10

In [48]: import sklearn

In [49]: print sklearn.__version__
0.20.0
lmcinnes commented 5 years ago

That's getting into the guts of the linear algebra used for the initialisation of the embedding, which is deeper issues than I can deal with at the moment. As a workaround you can use init='random' to set up random initialisation and avoid this -- that's not ideal, but it should at least get you around this problem.

As another note, if you are working with raw sparse data and it us say count of frequency data then I would recommend you consider using metric='cosine' (or, in future versions of umap metric='hellinger').

browserdotsys commented 5 years ago

Great, thanks for the advice! I'll give it a shot.

browserdotsys commented 5 years ago

As a quick update on this, I'm an idiot and was running UMAP on a transposed version of what I meant to. Adding a .T fixed everything quite nicely, even with the default initialization. Thanks again!

browserdotsys commented 5 years ago

If you're curious by the way, here's what it produced on the 1000 genomes data. I was really impressed at how well UMAP did here, thanks for making this library – it's amazing!

https://imgur.com/a/CVFNRg0

lobpcg commented 5 years ago

I am glad to see LOBPCG working... You may also want to play with the tolerance below. The default tol=1e-15 seems very excessive to me in this setup.

/home/bowser/.virtualenvs/genetics/local/lib/python2.7/site-packages/sklearn/manifold/spectralembedding.pyc in spectral_embedding(adjacency, n_components, eigen_solver, random_state, eigen_tol, norm_laplacian, drop_first) 324 X[:, 0] = dd.ravel() 325 lambdas, diffusion_map = lobpcg(laplacian, X, tol=1e-15, --> 326 largest=False, maxiter=2000)

lmcinnes commented 5 years ago

Thanks for the suggestion @lobpcg; this is not my area of expertise, so do you have some suggested default values that should be good?

lobpcg commented 5 years ago

Thanks for the suggestion @lobpcg; this is not my area of expertise, so do you have some suggested default values that should be good?

The optimal tolerance is difficult to predict theoretically, since it is problem dependent. Practically speaking, just try setting tol=1e-15 larger and/or maxiter=2000 smaller and see what happens... You need to get some practical experience yourself for your data finding the best balance keeping accuracy good enough while cutting the compute time, if your code runs for too long with the default values.

I've seen examples where even maxiter=2 would be enough for the overall goal. Have fun!