comparison of PCA from bigsnpr and PLINK

privefl / bigsnpr

R package for the analysis of massive SNP arrays.

https://privefl.github.io/bigsnpr/

186 stars 44 forks source link

comparison of PCA from bigsnpr and PLINK #468

Closed ndimou closed 10 months ago

ndimou commented 10 months ago

Hello,

My objective is to calculate PCs that I can use to adjust my GWAS. I computed the first 20 PCs following all the steps available here https://privefl.github.io/bigsnpr/articles/bedpca.html (using the predict option) and using the --pca option in plink (eigenevec values from plink) and I get completely different estimates. I see people simply use eigenvec as adjustment factors and I was wondering which is the way to go?

Thank you Niki

privefl commented 10 months ago

I guess there are two things going on here:

PLINK does some unprecise approximation (especially for later PCs) -> cf. Fig 5 of https://doi.org/10.1093/bioinformatics/bty185
bed_autoSVD() (if you're using this function) automatically handles the removal of LD to capture only population structure, so that the two decomposition would not use the same set of variants

In conclusion, you should really use bed_autoSVD() over PLINK :')

ndimou commented 10 months ago

Thanks Florian. I removed variants in LD and forced bed_autoSVD() not to do any further prunning to make sure plink/bignpr number of variants are the same. However, the difference I get is quite substancial like 0.012 in plink and 16.51 in bigsnpr for a given PC/sample. Is it a transformation needed in the ".eigenvec" file I get from plink that could mirror what you are getting from the "predict" option in bigsnpr?

Thanks!

privefl commented 10 months ago

Ah, you're talking about that difference.. I guess what you get from the ".eigenvec" file corresponds to obj.svd$u. PC scores are actually UD (not just U) in the UDVt decomposition, and it is what is reported when you use predict().

ndimou commented 10 months ago

Thanks. I checked ".eigenvec" file is "similar" (at least some scale) with obj.svd$u. Then going back to my original question which should I use as a covariate in my GWAS? I see eigenvec are used in previous GWAS.

privefl commented 10 months ago

You should use autoSVD to make sure there is no LD left in PCs.

ndimou commented 10 months ago

LD is accounted for. Let me put it in another way. Is obj.svd$u OR PC_init <- predict(obj.svd_init) you would use as a covariate in the GWAS?

privefl commented 10 months ago

I don't think it makes a difference to use U or UD as covariates (because the scale of covariates does not matter in an unpenalized regression).

ndimou commented 10 months ago

Great! Thank you for your help.