Regarding sample outlier screening that you suggest in the vignettes (and paper):
bigutilsr::prob_dist(obj.svd$u, ncores = nb_cores())
S <- prob$dist.self / sqrt(prob$dist.nn)
From my reading of the manual, this calculate the mean euclidean distance between each sample and K nearest neighbours based on left singular values matrix U (i.e. 'PC scores'); S shows this normalised against the a mean of it's local neighbours' own distances.
In calculating the distances, this gives equal weight to the distance in each dimension without weighting by the eigenvalues/vector D. This has the effect of adding noise from lower dimensions that explain less variance overall in the data. Wouldn't be preferable to weight the values to avoid this? If i have understood this correctly, could you give your perspective?
(On visual inpsection of my PC score plots generated following your suggestoins, some 'outliers' identified appeared in the main mass of points in PC1,2, and were outlying apparently only in lower dimensions).
A possibly related point is about generating PC score plots. This seems to be done directly by
plot(obj.svd, type = "scores", scores = 1:10, coeff = 0.4)
even though the documentation suggests the need for using predict(). When I use predict() on the obj.svd I get the same plot with the same scale as it appears to me.
Yes, this is only using U. This was a long time ago, so I don't remember why I used only U, but there must be a reason. I think most of the time, the outliers appear in later PCs, and this is probably why I don't put more weight on the first PCs for this particular analysis.
predict() gives you the PC scores UD
plot(obj.svd, type = "scores") plot the PC scores UD
Hi Florian. Thanks for this excellent tool.
Regarding sample outlier screening that you suggest in the vignettes (and paper):
From my reading of the manual, this calculate the mean euclidean distance between each sample and K nearest neighbours based on left singular values matrix U (i.e. 'PC scores'); S shows this normalised against the a mean of it's local neighbours' own distances. In calculating the distances, this gives equal weight to the distance in each dimension without weighting by the eigenvalues/vector D. This has the effect of adding noise from lower dimensions that explain less variance overall in the data. Wouldn't be preferable to weight the values to avoid this? If i have understood this correctly, could you give your perspective?
(On visual inpsection of my PC score plots generated following your suggestoins, some 'outliers' identified appeared in the main mass of points in PC1,2, and were outlying apparently only in lower dimensions).
A possibly related point is about generating PC score plots. This seems to be done directly by
even though the documentation suggests the need for using predict(). When I use predict() on the obj.svd I get the same plot with the same scale as it appears to me.
Thanks!