Closed cstrlln closed 3 years ago
Hi @cstrlln ,
This is an interesting issue. When there are small clusters (ie. fewer cells than dimensions), getLineages
will use the diagonal covariance to avoid this error, but since it's not doing that, I'm guessing you don't have any clusters smaller than 9 cells?
Generally speaking, I would say that more is better, when it comes to PCs. The "full" PCA is probably several thousand dimensions (I'm guessing?), so that will require using the diagonal covariance, but that shouldn't make a huge difference.
Also, since you mention it, have you checked that you don't have any points with identical coordinates?
Thanks for the fast response! Here are my cluster sizes: 1 2 3 4 5 6 7 8 9 2160 2320 820 2356 3652 1202 1688 389 89
It's not a huge dataset.
Was referring to the duplicated or very similar points because of the thread [(https://github.com/kstreet13/slingshot/issues/35)]
I have 8 samples with 2 conditions, for the two conditions I do have replicates (5 in one condition and 3 in the other), so do expect them to be pretty similar.
Actually just checked and don't think there's duplicated coordinates:
table(duplicated(reducedDims(sling3)$MNN)) FALSE 14676
This is what I have in terms of dimensions:
ncol(reducedDims(sling3)$MNN) [1] 50
ncol(reducedDims(sling3)$PCA) [1] 50
Thanks for the additional information!
I would expect that you have enough cells per cluster that (even with 50 PCs) you shouldn't run into these sorts of problems. I think the root cause of that other issue you mentioned was that some of the clusters were highly linear, so maybe that could be the next thing to check? You could look at pairs plots, like I did in that thread, or calculate the covariance matrix for each cluster and try to invert it. I'd also be happy to take a quick look at it myself, if you think that would help!
Sorry for delay. Unfortunately don't think my supervisor would allow sending some data (even changing names and all). Would you mind sharing the code you used to analyze the data in that thread, I'm not that experienced. thank you!
No worries!
You can check whether each cluster's covariance matrix is invertible with the following code (assuming the MNN matrix is called mnn
and the vector of cluster labels is called clus
):
for(cl in unique(clus)){
solve(cov(mnn[clus == cl, ]))
}
Technically, we're interested in pairs of clusters, but this could help us identify some potential issues.
And the pairs plot uses the pairs
function, so to generate that for cluster 1 (for example), you could use:
cl <- 1
pairs(mnn[clus==cl, 1:10])
(I'm only plotting the first 10 dimensions because a 50 x 50 grid of plots probably wouldn't be that useful).
Thanks for the code!
Had to modify a bit to extract the info from the sce object but this is what I get with the first part:
for(cl in unique(sce4$clusters_9)){ solve(cov(reducedDims(sce4[,sce4$clusters_9 == cl])$MNN)) }
"Error in solve.default(cov(reducedDims(sce4[, sce4$clusters_9 == cl])$MNN)) : system is computationally singular: reciprocal condition number = 1.7136e-18"
I'm guessing this means is not invertible?
I tried the same for example with PCA for(cl in unique(sce4$clusters_9)){ solve(cov(reducedDims(sce4[,sce4$clusters_9 == cl])$PCA)) }
And ran without problem.
Regarding the plots, did for the 9 clusters and didn't seem anything as bad as the one in issue35 above.
Yeah, that error means that (at least) one cluster has a non-invertible covariance matrix (the value of cl
when you get the error is the problematic cluster).
And maybe the issue is not in the top 10 dimensions, so we could try visualizing the full correlation matrix for the non-invertible clusters, instead. Something like this should work, for a given cluster:
cl <- 1
x <- reducedDims(sce4)$MNN[sce4$clusters_9 == cl, ]
heatmap(cor(x), symm = TRUE, Rowv=NA, Colv = NA)
The cl value when getting the error is cluster 7. However this is just because is the first cluster to be evaluated:
unique(sce4$clusters_9) [1] 7 1 4 2 3 5 6 8 9 Levels: 1 2 3 4 5 6 7 8 9
Just checked all clusters without a loop and looks like they all are non-invertible:
solve(cov(reducedDims(sce4)$MNN[sce4$clusters_9 == 1, ])) Error in solve.default(cov(reducedDims(sce4)$MNN[sce4$clusters_9 == 1, : system is computationally singular: reciprocal condition number = 2.40021e-19 solve(cov(reducedDims(sce4)$MNN[sce4$clusters_9 == 2, ])) Error in solve.default(cov(reducedDims(sce4)$MNN[sce4$clusters_9 == 2, : system is computationally singular: reciprocal condition number = 2.56005e-19 solve(cov(reducedDims(sce4)$MNN[sce4$clusters_9 == 3, ])) Error in solve.default(cov(reducedDims(sce4)$MNN[sce4$clusters_9 == 3, : system is computationally singular: reciprocal condition number = 6.1804e-20 solve(cov(reducedDims(sce4)$MNN[sce4$clusters_9 == 4, ])) Error in solve.default(cov(reducedDims(sce4)$MNN[sce4$clusters_9 == 4, : system is computationally singular: reciprocal condition number = 1.21569e-18 solve(cov(reducedDims(sce4)$MNN[sce4$clusters_9 == 5, ])) Error in solve.default(cov(reducedDims(sce4)$MNN[sce4$clusters_9 == 5, : system is computationally singular: reciprocal condition number = 2.06702e-18 solve(cov(reducedDims(sce4)$MNN[sce4$clusters_9 == 6, ])) Error in solve.default(cov(reducedDims(sce4)$MNN[sce4$clusters_9 == 6, : system is computationally singular: reciprocal condition number = 6.83631e-19 solve(cov(reducedDims(sce4)$MNN[sce4$clusters_9 == 7, ])) Error in solve.default(cov(reducedDims(sce4)$MNN[sce4$clusters_9 == 7, : system is computationally singular: reciprocal condition number = 1.7136e-18 solve(cov(reducedDims(sce4)$MNN[sce4$clusters_9 == 8, ])) Error in solve.default(cov(reducedDims(sce4)$MNN[sce4$clusters_9 == 8, : system is computationally singular: reciprocal condition number = 3.47327e-19 solve(cov(reducedDims(sce4)$MNN[sce4$clusters_9 == 9, ])) Error in solve.default(cov(reducedDims(sce4)$MNN[sce4$clusters_9 == 9, : system is computationally singular: reciprocal condition number = 5.85048e-19
Attached the images for the heatmaps
cl1
cl2
cl3
cl4
cl5
cl6
cl7
cl8
cl9
Hmm, this is very strange. My linear algebra is a little rusty, but none of these correlation matrices look like they would be problematic, yet somehow none of them are invertible? I'm honestly not sure what's going on here.
Nonetheless, I think there are a couple ways to proceed without solving the issue. The most straightforward is probably to use the diagonal covariance rather than the full covariance when calculating distances between clusters. You'll need to supply it as an argument, since this situation doesn't fit the requirements for slingshot
to use it by default. Basically, that will mean defining the function before running slingshot and using the dist.fun
argument:
dist_clusters_diag <- function(X, w1, w2){
if(length(w1) != nrow(X) | length(w2) != nrow(X)){
stop("Reduced dimensional matrix and weights vector contain different
numbers of points.")
}
mu1 <- colWeightedMeans(X, w = w1)
mu2 <- colWeightedMeans(X, w = w2)
diff <- mu1 - mu2
if(sum(w1>0)==1){
s1 <- diag(ncol(X))
}else{
s1 <- diag(diag(cov.wt(X, wt = w1)$cov))
}
if(sum(w2>0)==1){
s2 <- diag(ncol(X))
}else{
s2 <- diag(diag(cov.wt(X, wt = w2)$cov))
}
return(as.numeric(t(diff) %*% solve(s1 + s2) %*% diff))
}
slingshot(sce4, ..., dist.fun = dist_clusters_diag)
The other possibility, as you mentioned previously, would be to use fewer dimensions (you said it works with 9 dimensions, but you could try going higher until you run into these issues again).
Hi, sorry for the really long delay. I actually tried with different amount of dimensions and interestingly it works up until 1:49 out of 50. It is only when including all dimensions that I get the error, with one less it works. I checked to see the inferred lineage that I get and don't really see much difference by adding more dimensions.
Hmm, that is very strange. I don't know why it wouldn't work with all 50 PCs, but using the first 49 should be almost identical (by that point, each additional PC probably isn't adding much). For visualization, you may find the embedCurves
function useful, which will allow you to represent the curves in a low-dimensional (eg. UMAP) space.
Hi, @kstreet13 . The interesting thing is I also meet this problem, and I find if I use the first 49, it will work. But if I use the full 50, it will report
Error in solve.default(s1 + s2) :
system is computationally singular: reciprocal condition number = 7.11528e-18
That is so strange. I still haven't figured out what's causing this. Could either of you share the code you used to get from the counts matrix to the principal components?
I add mergeCell
into one of my github repo
https://github.com/shangguandong1996/slingshot_test_data/blob/main/mergeCell.rds
library(scran)
library(batchelor)
library(bluster)
library(slingshot)
mergeCell <- readRDS("~/newProject/SGD_total_Fig/callus_scRNA/CIM1d/result/mergeCell.rds")
dec.sce <- modelGeneVar(mergeCell, block = mergeCell$batch)
mergeCell <- multiBatchNorm(mergeCell,
batch = mergeCell$batch)
chosen.hvgs <- getTopHVGs(dec.sce, n = 2000)
set.seed(20160806)
mnn.out <- fastMNN(mergeCell,
batch = mergeCell$batch,
merge.order = list(
list("CIM0d_rep1","CIM0d_rep2"),
list("CIM1d_rep1","CIM1d_rep2")),
subset.row = chosen.hvgs,
BSPARAM=BiocSingular::RandomParam())
clusters.mnn.50 <- clusterCells(mnn.out,
use.dimred="corrected",
BLUSPARAM=NNGraphParam(k = 50,
cluster.fun="louvain",
type="jaccard"))
colLabels(mnn.out) <- clusters.mnn.50
sce.sling <- slingshot(mnn.out,
reducedDim='corrected',
cluster = colLabels(mnn.out),
omega=TRUE,
start.clus = 2)
Error in solve.default(s1 + s2) :
system is computationally singular: reciprocal condition number = 1.23506e-17
Ok, well as I said earlier, my linear algebra is pretty rusty, but this definitely looks like some sort of weird issue with the rank of the covariance matrices. I don't know enough about modelGeneVar
, multiBatchNorm
, and fastMNN
to know where this issue is being introduced, but it's pretty easy to verify that the covariance matrix of the "corrected" matrix is not full rank:
> dim(cov(reducedDim(mnn.out,'corrected')))
[1] 50 50
> Matrix::rankMatrix(cov(reducedDim(mnn.out,'corrected')))
[1] 49
attr(,"method")
[1] "tolNorm2"
attr(,"useGrad")
[1] FALSE
attr(,"tol")
[1] 1.110223e-14
The covariance matrices of all the clusters are similarly not full-rank, hence why they can't be inverted (I think; again, very rusty).
My best guess is that there is some sort of redundancy in calling multiBatchNorm
followed by fastMNN
, which calls multiBatchPCA
. If these steps are correcting for batch effects twice, then maybe that introduces some sort of linear dependency in the output? @LTLA, any thoughts?
Yes, batchelor's fastMNN
does an orthogonalization step to remove "kissing" effects between batches. This involves finding the average batch vector and removing all variance along that vector. So it would make sense that you effectively lose one dimension there. I avoid this step in an updated C++ version of the algorithm but I haven't had the time to figure out how to propagate that update to the R package.
... I guess you could just do another PCA on the corrected PCs, take the top n-1, and use that instead? I don't recall how the covariance matrices were used here, I vaguely remember something about edge weights and the MST. If so, it might be easier to switch to one of the more primitive weighting schemes.
Hmm, this is very strange. My linear algebra is a little rusty, but none of these correlation matrices look like they would be problematic, yet somehow none of them are invertible? I'm honestly not sure what's going on here.
Nonetheless, I think there are a couple ways to proceed without solving the issue. The most straightforward is probably to use the diagonal covariance rather than the full covariance when calculating distances between clusters. You'll need to supply it as an argument, since this situation doesn't fit the requirements for
slingshot
to use it by default. Basically, that will mean defining the function before running slingshot and using thedist.fun
argument:dist_clusters_diag <- function(X, w1, w2){ if(length(w1) != nrow(X) | length(w2) != nrow(X)){ stop("Reduced dimensional matrix and weights vector contain different numbers of points.") } mu1 <- colWeightedMeans(X, w = w1) mu2 <- colWeightedMeans(X, w = w2) diff <- mu1 - mu2 if(sum(w1>0)==1){ s1 <- diag(ncol(X)) }else{ s1 <- diag(diag(cov.wt(X, wt = w1)$cov)) } if(sum(w2>0)==1){ s2 <- diag(ncol(X)) }else{ s2 <- diag(diag(cov.wt(X, wt = w2)$cov)) } return(as.numeric(t(diff) %*% solve(s1 + s2) %*% diff)) } slingshot(sce4, ..., dist.fun = dist_clusters_diag)
The other possibility, as you mentioned previously, would be to use fewer dimensions (you said it works with 9 dimensions, but you could try going higher until you run into these issues again).
It seems no dist.fun
argument for slingshot function now, are there any alternative methods to resolve this issue?
That argument was renamed dist.method
and there are now several built-in options, since we started using the createClusterMST
function from TrajectoryUtils
. To use the diagonal covariance as above, you can set dist.method = "scaled.diag"
, but I would recommend the nearest neighbor-based distance method, "mnn"
.
Hello, I've got a dataset that contains samples from 2 different conditions with males and females. I used fastmnn to correct for any batch effect. When using slingshot on the full MNN matrix I get this error:
Using full covariance matrix Error in solve.default(s1 + s2) : system is computationally singular: reciprocal condition number = 1.52312e-17
I saw in other issue threads it was mentioned it could happen when points have the same coordinates.
My data seems to be divided in 9 clusters so decided to give it a try and use only 9 dimensions of MNN. i.e. MNN[,1:9] and it works pretty well. I get reasonable lineage predictions. I'm just wondering is this ok? Or should I always use the full dimensions of the matrix For example would you use PCA[,1:9] instead of full PCA?