privefl / bigsnpr

R package for the analysis of massive SNP arrays.
https://privefl.github.io/bigsnpr/
193 stars 44 forks source link

Outlier detection scaling by eigenvalue/ S vector ; scales PCA scores #520

Open gtdoctor opened 4 hours ago

gtdoctor commented 4 hours ago

Hi Florian. Thanks for this excellent tool.

Regarding sample outlier screening that you suggest in the vignettes (and paper):

bigutilsr::prob_dist(obj.svd$u, ncores = nb_cores())
S <- prob$dist.self / sqrt(prob$dist.nn)

From my reading of the manual, this calculate the mean euclidean distance between each sample and K nearest neighbours based on left singular values matrix U (i.e. 'PC scores'); S shows this normalised against the a mean of it's local neighbours' own distances. In calculating the distances, this gives equal weight to the distance in each dimension without weighting by the eigenvalues/vector D. This has the effect of adding noise from lower dimensions that explain less variance overall in the data. Wouldn't be preferable to weight the values to avoid this? If i have understood this correctly, could you give your perspective?
(On visual inpsection of my PC score plots generated following your suggestoins, some 'outliers' identified appeared in the main mass of points in PC1,2, and were outlying apparently only in lower dimensions).

A possibly related point is about generating PC score plots. This seems to be done directly by

plot(obj.svd, type = "scores", scores = 1:10, coeff = 0.4)

even though the documentation suggests the need for using predict(). When I use predict() on the obj.svd I get the same plot with the same scale as it appears to me.

Thanks!

privefl commented 3 hours ago