sriramlab / ProPCA

Fast EM algorithm for a Probabilistic PCA model for Genotype data
MIT License
21 stars 6 forks source link

dumb Q #13

Closed wbsimey closed 3 years ago

wbsimey commented 3 years ago

I was able to run ProPCA on my SNP data set from a vcf,gz file after converting to Plink bed format. I have 96 individuals, 2 populations (k=2), and ~500,000 SNPs. I ran: propca -g Chr26_2pops_HW -k 2 -l 2 -m 20 -a -cl 0.001 -o Chr26_2pops_ -aem 1 -vn -nfm

it completed quickly (a few minutes), compared to days using adegenet in R. ProPCA generated eigenvectors, eigenvalues, and projections files.

The dumb question: Now what? how do I use these files to generate a PCA plot? Any chance you can add the step to your Git page?

thanks

alecmchiu commented 3 years ago

Hello,

There should be an outputted file with the suffix projections.txt. There are the coordinates of the samples projected onto the principal components. In your example, the file should be named Chr26_2pops_projections.txt. Each column of this file represents the coordinates for its respective principal component (PC). For instance, the first column are the coordinates on PC1, the second column are the coordinates on PC2, etc.

To generate a plot, you can load the file into Python or R. Since you mentioned you have used R, here is an example snippet of code to generate the PCA plot for PC1 vs PC2.

pca <- read.table("Chr26_2pops_projections.txt")
plot(pca[,1],pca[,2],xlab="PC1",ylab="PC2")

Hopefully this helps!

wbsimey commented 3 years ago

It worked, thank you!