slingshot with mnn correction

cstrlln commented 4 years ago

Hello, I've got a dataset that contains samples from 2 different conditions with males and females. I used fastmnn to correct for any batch effect. When using slingshot on the full MNN matrix I get this error:

Using full covariance matrix Error in solve.default(s1 + s2) : system is computationally singular: reciprocal condition number = 1.52312e-17

I saw in other issue threads it was mentioned it could happen when points have the same coordinates.

My data seems to be divided in 9 clusters so decided to give it a try and use only 9 dimensions of MNN. i.e. MNN[,1:9] and it works pretty well. I get reasonable lineage predictions. I'm just wondering is this ok? Or should I always use the full dimensions of the matrix For example would you use PCA[,1:9] instead of full PCA?

kstreet13 commented 4 years ago

Hi @cstrlln ,

This is an interesting issue. When there are small clusters (ie. fewer cells than dimensions), getLineages will use the diagonal covariance to avoid this error, but since it's not doing that, I'm guessing you don't have any clusters smaller than 9 cells?

Generally speaking, I would say that more is better, when it comes to PCs. The "full" PCA is probably several thousand dimensions (I'm guessing?), so that will require using the diagonal covariance, but that shouldn't make a huge difference.

Also, since you mention it, have you checked that you don't have any points with identical coordinates?

cstrlln commented 4 years ago

Thanks for the fast response! Here are my cluster sizes: 1 2 3 4 5 6 7 8 9 2160 2320 820 2356 3652 1202 1688 389 89

It's not a huge dataset.

Was referring to the duplicated or very similar points because of the thread [(https://github.com/kstreet13/slingshot/issues/35)]

I have 8 samples with 2 conditions, for the two conditions I do have replicates (5 in one condition and 3 in the other), so do expect them to be pretty similar.

Actually just checked and don't think there's duplicated coordinates:

table(duplicated(reducedDims(sling3)$MNN)) FALSE 14676

This is what I have in terms of dimensions:

ncol(reducedDims(sling3)$MNN) [1] 50

ncol(reducedDims(sling3)$PCA) [1] 50

kstreet13 commented 4 years ago

Thanks for the additional information!

I would expect that you have enough cells per cluster that (even with 50 PCs) you shouldn't run into these sorts of problems. I think the root cause of that other issue you mentioned was that some of the clusters were highly linear, so maybe that could be the next thing to check? You could look at pairs plots, like I did in that thread, or calculate the covariance matrix for each cluster and try to invert it. I'd also be happy to take a quick look at it myself, if you think that would help!

cstrlln commented 4 years ago

Sorry for delay. Unfortunately don't think my supervisor would allow sending some data (even changing names and all). Would you mind sharing the code you used to analyze the data in that thread, I'm not that experienced. thank you!

kstreet13 commented 4 years ago

No worries!

You can check whether each cluster's covariance matrix is invertible with the following code (assuming the MNN matrix is called mnn and the vector of cluster labels is called clus):

for(cl in unique(clus)){
    solve(cov(mnn[clus == cl, ]))
}

Technically, we're interested in pairs of clusters, but this could help us identify some potential issues.

And the pairs plot uses the pairs function, so to generate that for cluster 1 (for example), you could use:

cl <- 1
pairs(mnn[clus==cl, 1:10])

(I'm only plotting the first 10 dimensions because a 50 x 50 grid of plots probably wouldn't be that useful).

cstrlln commented 4 years ago

Thanks for the code!

Had to modify a bit to extract the info from the sce object but this is what I get with the first part:

for(cl in unique(sce4$clusters_9)){ solve(cov(reducedDims(sce4[,sce4$clusters_9 == cl])$MNN)) }

"Error in solve.default(cov(reducedDims(sce4[, sce4$clusters_9 == cl])$MNN)) : system is computationally singular: reciprocal condition number = 1.7136e-18"

I'm guessing this means is not invertible?

I tried the same for example with PCA for(cl in unique(sce4$clusters_9)){ solve(cov(reducedDims(sce4[,sce4$clusters_9 == cl])$PCA)) }

And ran without problem.

Regarding the plots, did for the 9 clusters and didn't seem anything as bad as the one in issue35 above.

kstreet13 commented 4 years ago

Yeah, that error means that (at least) one cluster has a non-invertible covariance matrix (the value of cl when you get the error is the problematic cluster).

And maybe the issue is not in the top 10 dimensions, so we could try visualizing the full correlation matrix for the non-invertible clusters, instead. Something like this should work, for a given cluster:

cl <- 1
x <- reducedDims(sce4)$MNN[sce4$clusters_9 == cl, ]
heatmap(cor(x), symm = TRUE, Rowv=NA, Colv = NA)

cstrlln commented 4 years ago

The cl value when getting the error is cluster 7. However this is just because is the first cluster to be evaluated:

unique(sce4$clusters_9) [1] 7 1 4 2 3 5 6 8 9 Levels: 1 2 3 4 5 6 7 8 9

Just checked all clusters without a loop and looks like they all are non-invertible:

solve(cov(reducedDims(sce4)$MNN[sce4$clusters_9 == 1, ])) Error in solve.default(cov(reducedDims(sce4)$MNN[sce4$clusters_9 == 1, : system is computationally singular: reciprocal condition number = 2.40021e-19 solve(cov(reducedDims(sce4)$MNN[sce4$clusters_9 == 2, ])) Error in solve.default(cov(reducedDims(sce4)$MNN[sce4$clusters_9 == 2, : system is computationally singular: reciprocal condition number = 2.56005e-19 solve(cov(reducedDims(sce4)$MNN[sce4$clusters_9 == 3, ])) Error in solve.default(cov(reducedDims(sce4)$MNN[sce4$clusters_9 == 3, : system is computationally singular: reciprocal condition number = 6.1804e-20 solve(cov(reducedDims(sce4)$MNN[sce4$clusters_9 == 4, ])) Error in solve.default(cov(reducedDims(sce4)$MNN[sce4$clusters_9 == 4, : system is computationally singular: reciprocal condition number = 1.21569e-18 solve(cov(reducedDims(sce4)$MNN[sce4$clusters_9 == 5, ])) Error in solve.default(cov(reducedDims(sce4)$MNN[sce4$clusters_9 == 5, : system is computationally singular: reciprocal condition number = 2.06702e-18 solve(cov(reducedDims(sce4)$MNN[sce4$clusters_9 == 6, ])) Error in solve.default(cov(reducedDims(sce4)$MNN[sce4$clusters_9 == 6, : system is computationally singular: reciprocal condition number = 6.83631e-19 solve(cov(reducedDims(sce4)$MNN[sce4$clusters_9 == 7, ])) Error in solve.default(cov(reducedDims(sce4)$MNN[sce4$clusters_9 == 7, : system is computationally singular: reciprocal condition number = 1.7136e-18 solve(cov(reducedDims(sce4)$MNN[sce4$clusters_9 == 8, ])) Error in solve.default(cov(reducedDims(sce4)$MNN[sce4$clusters_9 == 8, : system is computationally singular: reciprocal condition number = 3.47327e-19 solve(cov(reducedDims(sce4)$MNN[sce4$clusters_9 == 9, ])) Error in solve.default(cov(reducedDims(sce4)$MNN[sce4$clusters_9 == 9, : system is computationally singular: reciprocal condition number = 5.85048e-19

Attached the images for the heatmaps

cl1 cl1heat

cl2 cl2heat

cl3 cl3heat

cl4 cl4heat

cl5 cl5heat

cl6 cl6heat

cl7 cl7heat

cl8 cl8heat

cl9 cl9heat

kstreet13 commented 4 years ago

Hmm, this is very strange. My linear algebra is a little rusty, but none of these correlation matrices look like they would be problematic, yet somehow none of them are invertible? I'm honestly not sure what's going on here.

Nonetheless, I think there are a couple ways to proceed without solving the issue. The most straightforward is probably to use the diagonal covariance rather than the full covariance when calculating distances between clusters. You'll need to supply it as an argument, since this situation doesn't fit the requirements for slingshot to use it by default. Basically, that will mean defining the function before running slingshot and using the dist.fun argument:

dist_clusters_diag <- function(X, w1, w2){
    if(length(w1) != nrow(X) | length(w2) != nrow(X)){
        stop("Reduced dimensional matrix and weights vector contain different
             numbers of points.")
    }
    mu1 <- colWeightedMeans(X, w = w1)
    mu2 <- colWeightedMeans(X, w = w2)
    diff <- mu1 - mu2
    if(sum(w1>0)==1){
        s1 <-  diag(ncol(X))
    }else{
        s1 <- diag(diag(cov.wt(X, wt = w1)$cov))
    }
    if(sum(w2>0)==1){
        s2 <-  diag(ncol(X))
    }else{
        s2 <- diag(diag(cov.wt(X, wt = w2)$cov))
    }
    return(as.numeric(t(diff) %*% solve(s1 + s2) %*% diff))
}
slingshot(sce4, ..., dist.fun = dist_clusters_diag)

The other possibility, as you mentioned previously, would be to use fewer dimensions (you said it works with 9 dimensions, but you could try going higher until you run into these issues again).

cstrlln commented 4 years ago

Hi, sorry for the really long delay. I actually tried with different amount of dimensions and interestingly it works up until 1:49 out of 50. It is only when including all dimensions that I get the error, with one less it works. I checked to see the inferred lineage that I get and don't really see much difference by adding more dimensions.

kstreet13 commented 4 years ago

Hmm, that is very strange. I don't know why it wouldn't work with all 50 PCs, but using the first 49 should be almost identical (by that point, each additional PC probably isn't adding much). For visualization, you may find the embedCurves function useful, which will allow you to represent the curves in a low-dimensional (eg. UMAP) space.

shangguandong1996 commented 2 years ago

Hi, @kstreet13 . The interesting thing is I also meet this problem, and I find if I use the first 49, it will work. But if I use the full 50, it will report

Error in solve.default(s1 + s2) : 
  system is computationally singular: reciprocal condition number = 7.11528e-18

kstreet13 commented 2 years ago

That is so strange. I still haven't figured out what's causing this. Could either of you share the code you used to get from the counts matrix to the principal components?

shangguandong1996 commented 2 years ago

I add mergeCell into one of my github repo

https://github.com/shangguandong1996/slingshot_test_data/blob/main/mergeCell.rds

library(scran)
library(batchelor)
library(bluster)
library(slingshot)

mergeCell <- readRDS("~/newProject/SGD_total_Fig/callus_scRNA/CIM1d/result/mergeCell.rds")
dec.sce <- modelGeneVar(mergeCell, block = mergeCell$batch)
mergeCell <- multiBatchNorm(mergeCell,
                            batch = mergeCell$batch)
chosen.hvgs <- getTopHVGs(dec.sce, n = 2000)

set.seed(20160806)
mnn.out <- fastMNN(mergeCell,
                   batch = mergeCell$batch,
                   merge.order = list(
                       list("CIM0d_rep1","CIM0d_rep2"),
                       list("CIM1d_rep1","CIM1d_rep2")),
                   subset.row = chosen.hvgs,
                   BSPARAM=BiocSingular::RandomParam())
clusters.mnn.50 <- clusterCells(mnn.out, 
                                use.dimred="corrected", 
                                BLUSPARAM=NNGraphParam(k = 50,
                                                       cluster.fun="louvain",
                                                       type="jaccard"))
colLabels(mnn.out) <- clusters.mnn.50

sce.sling <- slingshot(mnn.out,
                       reducedDim='corrected',
                       cluster = colLabels(mnn.out),
                       omega=TRUE,
                       start.clus = 2)

Error in solve.default(s1 + s2) : 
  system is computationally singular: reciprocal condition number = 1.23506e-17

kstreet13 commented 2 years ago

Ok, well as I said earlier, my linear algebra is pretty rusty, but this definitely looks like some sort of weird issue with the rank of the covariance matrices. I don't know enough about modelGeneVar, multiBatchNorm, and fastMNN to know where this issue is being introduced, but it's pretty easy to verify that the covariance matrix of the "corrected" matrix is not full rank:

> dim(cov(reducedDim(mnn.out,'corrected')))
[1] 50 50
> Matrix::rankMatrix(cov(reducedDim(mnn.out,'corrected')))
[1] 49
attr(,"method")
[1] "tolNorm2"
attr(,"useGrad")
[1] FALSE
attr(,"tol")
[1] 1.110223e-14

The covariance matrices of all the clusters are similarly not full-rank, hence why they can't be inverted (I think; again, very rusty).

My best guess is that there is some sort of redundancy in calling multiBatchNorm followed by fastMNN, which calls multiBatchPCA. If these steps are correcting for batch effects twice, then maybe that introduces some sort of linear dependency in the output? @LTLA, any thoughts?

LTLA commented 2 years ago

Yes, batchelor's fastMNN does an orthogonalization step to remove "kissing" effects between batches. This involves finding the average batch vector and removing all variance along that vector. So it would make sense that you effectively lose one dimension there. I avoid this step in an updated C++ version of the algorithm but I haven't had the time to figure out how to propagate that update to the R package.

... I guess you could just do another PCA on the corrected PCs, take the top n-1, and use that instead? I don't recall how the covariance matrices were used here, I vaguely remember something about edge weights and the MST. If so, it might be easier to switch to one of the more primitive weighting schemes.

Yunuuuu commented 10 months ago

Hmm, this is very strange. My linear algebra is a little rusty, but none of these correlation matrices look like they would be problematic, yet somehow none of them are invertible? I'm honestly not sure what's going on here.

Nonetheless, I think there are a couple ways to proceed without solving the issue. The most straightforward is probably to use the diagonal covariance rather than the full covariance when calculating distances between clusters. You'll need to supply it as an argument, since this situation doesn't fit the requirements for slingshot to use it by default. Basically, that will mean defining the function before running slingshot and using the dist.fun argument:
dist_clusters_diag <- function(X, w1, w2){
    if(length(w1) != nrow(X) | length(w2) != nrow(X)){
        stop("Reduced dimensional matrix and weights vector contain different
             numbers of points.")
    }
    mu1 <- colWeightedMeans(X, w = w1)
    mu2 <- colWeightedMeans(X, w = w2)
    diff <- mu1 - mu2
    if(sum(w1>0)==1){
        s1 <-  diag(ncol(X))
    }else{
        s1 <- diag(diag(cov.wt(X, wt = w1)$cov))
    }
    if(sum(w2>0)==1){
        s2 <-  diag(ncol(X))
    }else{
        s2 <- diag(diag(cov.wt(X, wt = w2)$cov))
    }
    return(as.numeric(t(diff) %*% solve(s1 + s2) %*% diff))
}
slingshot(sce4, ..., dist.fun = dist_clusters_diag)
The other possibility, as you mentioned previously, would be to use fewer dimensions (you said it works with 9 dimensions, but you could try going higher until you run into these issues again).

It seems no dist.fun argument for slingshot function now, are there any alternative methods to resolve this issue?

kstreet13 commented 10 months ago

That argument was renamed dist.method and there are now several built-in options, since we started using the createClusterMST function from TrajectoryUtils. To use the diagonal covariance as above, you can set dist.method = "scaled.diag", but I would recommend the nearest neighbor-based distance method, "mnn".

kstreet13 / slingshot

slingshot with mnn correction #87