getCurves() Missing values

kstreet13 / slingshot

Functions for identifying and characterizing continuous developmental trajectories in single-cell data.

259 stars 42 forks source link

getCurves() Missing values #230

Closed msclz closed 6 months ago

msclz commented 9 months ago

Hello ! First of all, I'd like to thank you for this really nice package.

I am not a trained bioinformatician, and this is my first time posting on github, so I apologize in advance if I am not being super clear.

I am trying to get some curves out of my lineages object:

lineages class: PseudotimeOrdering dim: 5510 3 metadata(3): lineages mst slingParams pathStats(2): pseudotime weights cellnames(5510): AACACGTGTTAAAGAC-1_1_1 AAGGTTCTCTCAACTT-1_1_1 ... Donor1.TTTGTCAAGAACTCGG.2_15 Donor1.TTTGTCATCCAAACTG.2_15 cellData names(2): reducedDim clusterLabels pathnames(3): Lineage1 Lineage2 Lineage3 pathData names(0)

Also, I am using the version slingshot_2.8.0

However, when I applied the getCurves function, I have this first error:

curves <-getCurves(lineages) Curves for Lineage3 and average1 appear to be going in opposite directions. No longer forcing them to share an initial point. To manually override this, set allow.breaks = FALSE. Error in vapply(seq_len(p), function(jj) { : values must be length 150, but FUN(X[[1]]) result is length 5510 In addition: Warning messages: 1: In avg.jj pct : longer object length is not a multiple of shorter object length 2: In orig.jj (1 - pct) : longer object length is not a multiple of shorter object length

Then, by setting the parameters 'allow.breaks' and 'approx_points' to 'FALSE', I removed that previous error, but I am now facing this message:

curves <-getCurves(lineages, allow.breaks = FALSE, approx_points = FALSE) Error in if (box.vals[1] == box.vals[5]) { : missing value where TRUE/FALSE needed

From that point, I am kind of lost, Would you have any idea on how to fix this?

Thanks again,

kstreet13 commented 9 months ago

Hi @msclz,

Thanks for the report! I don't think I've seen this issue before, but hopefully we can get to the bottom of it.

What code did you use to create the lineages object? I assume it was getLineages, but seeing the specific line of code might be helpful.

I think setting allow.breaks to FALSE was a good call, but I would generally advise using approx.points, as it significantly speeds up the computation, in most cases.

What are the sizes of your input clusters? A lot of issues seem to come down to a very small input cluster (1 or 2 cells) or NA values in the vector of cluster labels. If you had a vector of cluster labels, can you show the output of table(<clus.vec>)?

Best, Kelly

msclz commented 9 months ago

Hi, So I have used this coding to get my lineages:

dimred <- data.log.GC@reductions$umap@cell.embeddings
clustering <- data.log.GC$integrated_snn_res.0.2

lineages <- getLineages(data = dimred, 
                        clusterLabels = clustering, 
                        reducedDim="umap",
                        start.clus="9", 
                        end.clus="12")

I generated the clustering object starting from a subset from a bigger Seurat object as you can see above. Here is what I get if I look for any N/As in the clustering vector:

> any(is.na(clustering))
[1] FALSE
> length(clustering)
[1] 5510
> levels(clustering)
 [1] "0"  "1"  "2"  "3"  "4"  "5"  "6"  "7"  "8"  "9"  "10" "11" "12"

sames goes if I ask for NAs in the Lineages object, the return is FALSE.

Thanks again for helping me sorting this, Math

msclz commented 9 months ago

Oh, sorry I forgot to add in my previous message, If I set approx_points to a certain value, I get this new error message when running getCurves

> curves <-getCurves(lineages,  allow.breaks = FALSE, approx_points = 300)
Error in vapply(seq_len(p), function(jj) { : values must be length 300,
 but FUN(X[[1]]) result is length 5510
In addition: Warning messages:
1: In avg.jj * pct :
  longer object length is not a multiple of shorter object length
2: In orig.jj * (1 - pct) :
  longer object length is not a multiple of shorter object length

Sorry for the spam

msclz commented 9 months ago

Ok, sorry again for coming back at you, but I am making (some?) progress. Apparently, removing the argument start.clus and end.clusduring the generation of lineages allow for the code to run and for me to plot the curves.

> lineages <- getLineages(data = dimred, 
+                         clusterLabels = clustering, 
+                         reducedDim="umap",
+                         #start.clus="9", 
+                         #end.clus="12",
+                         )
> curves <-getCurves(lineages,  allow.breaks = FALSE, approx_points = F)
> curves
class: PseudotimeOrdering 
dim: 5510 2 
metadata(4): lineages mst slingParams curves
pathStats(2): pseudotime weights
cellnames(5510): AACACGTGTTAAAGAC-1_1_1 AAGGTTCTCTCAACTT-1_1_1 ... Donor1.TTTGTCAAGAACTCGG.2_15 Donor1.TTTGTCATCCAAACTG.2_15
cellData names(2): reducedDim clusterLabels
pathnames(2): Lineage1 Lineage2
pathData names(0):

However, as you can imagine, the lines are kind of all over the place now... Capture d’écran, le 2023-09-18 à 11 31 21

kstreet13 commented 9 months ago

Hi @msclz,

Thanks for the updates!

One thing I notice is that you don't need the argument reducedDim = "umap" when you run getLineages, since you are directly supplying the matrix of coordinates. That argument is for when the input data is a SingleCellExperiment object and you want to use a dimensionality reduction from the reducedDims slot. This might be the issue and if so, I'll need to do something about it.

Otherwise, it still might help to see table(clustering), as factors can sometimes be weird, especially when working with a subset of a larger dataset (factors will "remember" all the original levels, even if you subset down to remove a few).

And I'm not sure if it will help, but another thing you might try would be plotting the MST after you run getLineages with start.clus = "9". You can do this the same way you'd plot the curves, but with type = "l" (for "lineages").

Best, Kelly

msclz commented 9 months ago

Hi @kstreet13

So, I removed the reducedDim argument, and kept my defining stat.clus and end.clus arguments, here is what I get when I plot the lineages.

> lineages <- getLineages(data = dimred, 
+                         clusterLabels = clustering, 
+                         #reducedDim="umap",
+                         start.clus="9", 
+                         end.clus="12",
+                         )
> 
> plot(dimred, col=pal[clustering], cex = 0.5, pch = 16)
> for (i in levels(clustering)) {
+   text(mean(dimred[clustering == i, 1])+0.2, mean(dimred[clustering == i, 2])+1.5, labels = i, font = 2, cex=2)
+ } #add cluster labels
> lines(SlingshotDataSet(lineages), lwd = 3, col = "black")

Capture d’écran, le 2023-09-19 à 09 43 45 I do find my three lineages: 9-1/9-0-2-11/9-0-2-4-3-10-8-7-6-5-12

So I don't think the reduceDim argument is in question here, I do have the same errors for the curves whever or not it is present in the generationof the object lineages.

Regarding the table(clustering), I also don't see anything weird from the output:

> table(clustering)
clustering
   0    1    2    3    4    5    6    7    8    9   10   11   12 
1150  666  665  468  437  418  367  327  255  241  200  158  158 
>

One thing I noticed though: the getCurves fonction is not working only when both start.clus and end.clus arguments are used in the generation of lineages.

If I define only a start.clus="9", I get this result, which is better than previsouly but does not seem too match what plotting the MST gave me just prior with cluster 9 as the starting point...

> lineages <- getLineages(data = dimred, 
+                         clusterLabels = clustering, 
+                         #reducedDim="umap",
+                         start.clus="9", 
+                         #end.clus="12",
+                         )
> curves <-getCurves(lineages,  allow.breaks = FALSE, approx_points = FALSE)
> plot(dimred, col=pal[clustering], cex = 0.5, pch = 16)
> for (i in levels(clustering)) {
+   text(mean(dimred[clustering == i, 1])+0.2, mean(dimred[clustering == i, 2])+1.5, labels = i, font = 2, cex=2)
+ } #add cluster labels
> lines(SlingshotDataSet(curves), lwd = 3, col = "black")
>

Capture d’écran, le 2023-09-19 à 09 52 22

Thanks again for taking the time to help me out, Math

msclz commented 9 months ago

Hi @kstreet13, I am still trying to play around with the code and trying to understand where that initial error is coming from

curves <-getCurves(lineages, allow.breaks = FALSE, approx_points = FALSE)
Error in if (box.vals[1] == box.vals[5]) { :
missing value where TRUE/FALSE needed

My two main questions are 1- why does it seems that specifying a both start.clus and end.clus get me this error, 2- why when I only specify the first start cluster, the root of the curves does not seem to go on that specified cluster...

On that line, my latest (baby) step; apparently when I used the omega = T argument, now weirldy the getCurves can function with both start.clusand end.clus....

> lineages <- getLineages(data = dimred, 
+                         clusterLabels = clustering, 
+                         #reducedDim="umap",
+                         start.clus="9", 
+                         end.clus=c("12","1","11"), omega=T
+                         )
> 
> curves <-getCurves(lineages,  allow.breaks = FALSE, approx_points = F)
> 
> plot(dimred, col=pal[clustering], cex = 0.5, pch = 16)
> for (i in levels(clustering)) {
+   text(mean(dimred[clustering == i, 1])+0.2, mean(dimred[clustering == i, 2])+1.5, labels = i, font = 2, cex=2)
+ } #add cluster labels
> lines(SlingshotDataSet(curves), lwd = 3, col = "black", type="c") + lines(SlingshotDataSet(curves), lwd = 3, col = "dodgerblue", type="l")

Capture d’écran, le 2023-09-22 à 13 27 34

kstreet13 commented 9 months ago

Hi @msclz ,

That is a very weird error. Thankfully, in this case, you really don't need to specify that cluster 12 is an endpoint, since Slingshot will pick it up, regardless.
This one is just because the curves are fit by an iterative process which can sometimes be a bit unpredictable. You could try setting extend = "n" in getCurves() and see if that changes things.

And I'm pretty surprised that setting omega = TRUE didn't break the connection between clusters 4 and 2, as they seem pretty distant on the UMAP plot. Although there are some weird points with a mix of different cluster labels nearby, which could be causing that (Louvain clustering can sometimes create issues like this).

Best, Kelly