petrelharp / local_pca

Methods for examining PCA locally along the genome.
71 stars 13 forks source link

Using lostruct package on pca data coming from other programs #8

Closed clairemerot closed 5 years ago

clairemerot commented 5 years ago

I read with much interest you article on Local PCA and I'm interested in using this method on my dataset. I was wondering whether I could ask you for a piece of advice about how to integrate my kind of data into the loscruct pipeline?

I am working with seaweed flies and using genomic data to detect structural variation and population structure. As we have hundreds of genome at low coverage we tend to use probabilistic methods to infer covariance and pca (as implemented in angsd and pcangsd) and I was wondering how to apply the method that you developped to that sort of data.

For now, I have obtained covariance matrix between the individuals and/or pca matrix along the genome ( by splitting the initial data for a beagle in windows of X snps and applying pcangsd). First of all, have you ever tested that? or consider implementing that in the analyses? Or consider an intermediate step where the user could feed to eigen-windows function a set of covariance matrices by windows?

The simple way is perhaps that I transform each window's pca into a matrix to feed the function pc_dist. Then I'll be able to use the functions that you implemented to calculate distances and MDS between windows. To do so, I'd like to be sure of the format. 1) How I can obtain the first value (sum of square of the covariance matrix)? 2) if I followed well the format, there are k eigen-values (this is ok) and then 3) k eigen-vectors (are they coordinates of the n individuals along pc_1, then pc_2, .. pc_k or rotation vectors?)

Thanks a lot for your attention and help! Claire Mérot

petrelharp commented 5 years ago

Hello! What you're describing sounds like a good idea, but I have not tested it. You should make sure to check that any patterns you get out aren't being driven by variaition in data quality along the genome.

So, let's see. The function cov_pca( ) actually already accepts a covariance matrix. So, supposing your per-window covariance matrices are in a list window_covmats, then you could do something like:

window_eigs <- sapply(window_covmats, function (cm) cov_pca(k=2, covmat=cm))
pc_dist(window_eigs)

Here, window_eigs should have the same structure as the output of eigen_windows( ), and so you can apply pc_dist( ) to it. But, I haven't tested this code! Please double-check and let me know if you have issues.

clairemerot commented 5 years ago

Thanks a lot for you rapid answer. It works perfectly, and the results fully make sense with window belonging to the same inversion being closer to each other. I'll expand the analyses on a wider dataset now! Thank you so much!

petrelharp commented 5 years ago

I look forward to seeing what happens! I'm going to close this now; feel free to reopen if you run into issues.