tskit-dev / tskit

Population-scale genomics
MIT License
151 stars 71 forks source link

compute covariance matrix and PCA #275

Open petrelharp opened 5 years ago

petrelharp commented 5 years ago

To compute a covariance matrix, or do PCA, we currently have to export the genotype matrix. It would be nice to do this using the statistics framework, and totally do-able. It is also possible to find PCs without computing the covariance matrix first. This would be a nice compact but very useful contribution for someone to take on.

daniel-trejobanos commented 4 years ago

I am curious about this, specially how to compute the PCA without the covariance matrix? do you have a paper explaining this?

petrelharp commented 4 years ago

I have not worked out the details, but the basic idea is that the PCs are the eigenvectors of the genetic covariance matrix, and modern iterative methods (like Krylov methods) exist to find the eigenvectors of a matrix A without ever computing A explicitly, but rather finding the result of multiplying A by some random vectors. This paper: https://www.ncbi.nlm.nih.gov/pubmed/26924531 does something like this. In our situation, A = G^T G, where G is the genotype matrix (possibly normalized), and so we can use our general statistics to quickly compute u^T A v for vectors u and v.

That's the general idea. I have not worked out the details, so it's possible there's something very tricky in there, but I'm happy to help work it through.

hyanwong commented 1 year ago

Also see performance issues in https://github.com/tskit-dev/tskit/issues/1743