privefl / bigsnpr

R package for the analysis of massive SNP arrays.
https://privefl.github.io/bigsnpr/
183 stars 43 forks source link

PCA calculation: outlier removal #457

Closed aepacker closed 5 months ago

aepacker commented 8 months ago

I have been following your tutorial on the steps to perform PCA at https://privefl.github.io/bigsnpr/articles/bedpca.html.

In the final step “Project remaining individuals”, am I correct in thinking that all remaining individuals are projected (i.e., the related individuals and individuals identified as outliers in the first PCA who were subsequently excluded from the final PCA calculation)? If so, should I subsequently remove participants who were previously identified as PCA outliers?

Working through the materials, I remove outliers with S > 0.275 based on the histogram of “outlierness”. Perhaps I am mistaken but after looking at other PCA methods, I had thought that I would need to remove these sample outliers from the projection of individuals onto my final PCs…

Below are a histogram and two plots of the PC scores coloured by the outlierness statistic:

Thanks in advance!

PCA_outlier_hist

PCA_outlier_scores_scatter

PCA_outlier_scores_scatter_bin_0275

privefl commented 8 months ago

If it's only 4 individuals, it is probably safe to remove them from all your analyses.