kstreet13 / slingshot

Functions for identifying and characterizing continuous developmental trajectories in single-cell data.
265 stars 43 forks source link

Generating a lineage for each factor in a SCE object? #102

Closed kmshort closed 3 years ago

kmshort commented 3 years ago

Hi Kelly,

Brilliant work, Slingshot is great. I am missing something. I have two treatments in a single SCE object, sce, (converted from a seurat obj). If I have sce$treatment, and in that two factors - treat1 & treat2, how can I perform slingshot on sce to generate two lineages, one for treat1 and one for treat2? I wish to compare the two lineages, and will probably use something like tradeSeq. I'll add, that the cluster assignments for the cells in the sce object are the same for the two treatments.

many thanks Kieran

kstreet13 commented 3 years ago

Hi Kieran,

Thanks! And I have a couple ideas, though I'm not sure which (if either) will be most appropriate for you.

If you have an a priori reason for creating separate trajectories, the best way to do it would be to split your data by treatment and run Slingshot on each SCE object, individually. Then both will have a slingPseudotime_1 variable and you'll probably want to rename one of them before merging the objects back together (so it's clear they are from separate runs).

Otherwise, what I would generally recommend is ignoring the treatment categories when you run Slingshot. This will give you a shared trajectory that can be used to compare the two treatment conditions within a common framework. That could mean comparing the two distributions of pseudotime values or doing more fine grain, gene-by-gene testing with tradeSeq.

All of this is basically the executive summary of our recent workflow, so if you want more detail on any of this, that might be a good place to start. And if I've misunderstood something about what you're trying to do or if you have more questions, please let me know!

Best, Kelly

kmshort commented 3 years ago

Hi Kelly,

Thanks. I'm happy to work on the integrated data and pull apart variances along a shared trajectory. We do have a priori reason to treat them a little differently, we know that the treatment does change the course of development of the treated cells a bit (they start growing about the same as normal, but then slow down for a bit (some cells stall and don't make it),but the ones that survive lag behind but continue more or less as normal. Maybe there's a mid-trajectory deviation which we can identify (maybe there isn't!).

Thanks for the link to your recent workflow. I've got the tradeSeq-workstopTrajectories.Rmd document and am working through that. However, it's throwing an error for me though right at the fitGam function. And it throws exactly the same error on my data!

This is your code, and the result:

set.seed(3)

sce <- fitGAM(counts = as.matrix(assays(sce)$counts),
              pseudotime = colData(sce)$slingshot$pseudotime,
              cellWeights = colData(sce)$slingshot$cellWeights.V1,
              conditions = factor(colData(sce)$pheno$treatment_id),
              nknots = 5)

mean(rowData(sce)$tradeSeq$converged)

Error in .local(counts, ...) : unused argument (conditions = c(1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,

and my session info:

> sessionInfo()
R version 4.0.2 (2020-06-22)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 18363)

Matrix products: default

locale:
[1] LC_COLLATE=English_Australia.1252 
[2] LC_CTYPE=English_Australia.1252   
[3] LC_MONETARY=English_Australia.1252
[4] LC_NUMERIC=C                      
[5] LC_TIME=English_Australia.1252    

attached base packages:
[1] parallel  stats4    stats     graphics 
[5] grDevices utils     datasets  methods  
[9] base     

other attached packages:
 [1] tradeSeq_1.2.01            
 [2] gridExtra_2.3              
 [3] ggplot2_3.3.2              
 [4] knitr_1.30                 
 [5] fgsea_1.14.0               
 [6] msigdbr_7.2.1              
 [7] pheatmap_1.0.12            
 [8] UpSetR_1.4.0               
 [9] viridis_0.5.1              
[10] viridisLite_0.3.0          
[11] scales_1.1.1               
[12] RColorBrewer_1.1-2         
[13] SingleCellExperiment_1.10.1
[14] SummarizedExperiment_1.18.2
[15] DelayedArray_0.14.1        
[16] matrixStats_0.57.0         
[17] Biobase_2.48.0             
[18] GenomicRanges_1.40.0       
[19] GenomeInfoDb_1.24.2        
[20] IRanges_2.22.2             
[21] S4Vectors_0.26.1           
[22] BiocGenerics_0.34.0        
[23] slingshot_1.6.1            
[24] princurve_2.1.5            
[25] RevoUtils_11.0.2           
[26] devtools_2.3.2             
[27] usethis_1.6.3              
[28] RevoUtilsMath_11.0.0       

loaded via a namespace (and not attached):
  [1] tidyselect_1.1.0             
  [2] AnnotationDbi_1.50.3         
  [3] RSQLite_2.2.1                
  [4] grid_4.0.2                   
  [5] combinat_0.0-8               
  [6] docopt_0.7.1                 
  [7] BiocParallel_1.22.0          
  [8] Rtsne_0.15                   
  [9] RNeXML_2.4.5                 
 [10] munsell_0.5.0                
 [11] codetools_0.2-16             
 [12] withr_2.3.0                  
 [13] colorspace_1.4-1             
 [14] fastICA_1.2-2                
 [15] uuid_0.1-4                   
 [16] zinbwave_1.10.1              
 [17] rstudioapi_0.11              
 [18] NMF_0.23.0                   
 [19] slam_0.1-47                  
 [20] GenomeInfoDbData_1.2.3       
 [21] bit64_4.0.5                  
 [22] farver_2.0.3                 
 [23] rhdf5_2.32.3                 
 [24] rprojroot_1.3-2              
 [25] vctrs_0.3.4                  
 [26] generics_0.0.2               
 [27] xfun_0.18                    
 [28] BiocFileCache_1.12.1         
 [29] R6_2.3.0                     
 [30] doParallel_1.0.14            
 [31] VGAM_1.1-4                   
 [32] locfit_1.5-9.4               
 [33] bioc2020trajectories_0.0.0.91
 [34] bitops_1.0-6                 
 [35] assertthat_0.2.1             
 [36] gtable_0.3.0                 
 [37] phylobase_0.8.10             
 [38] processx_3.4.4               
 [39] rlang_0.4.7                  
 [40] genefilter_1.70.0            
 [41] splines_4.0.2                
 [42] lazyeval_0.2.2               
 [43] yaml_2.2.1                   
 [44] reshape2_1.4.4               
 [45] backports_1.1.10             
 [46] tools_4.0.2                  
 [47] gridBase_0.4-7               
 [48] ellipsis_0.3.1               
 [49] sessioninfo_1.1.1            
 [50] Rcpp_1.0.5                   
 [51] plyr_1.8.6                   
 [52] progress_1.2.2               
 [53] zlibbioc_1.34.0              
 [54] purrr_0.3.4                  
 [55] RCurl_1.98-1.2               
 [56] densityClust_0.3             
 [57] ps_1.3.4                     
 [58] prettyunits_1.1.1            
 [59] pbapply_1.4-3                
 [60] ggrepel_0.8.2                
 [61] cluster_2.1.0                
 [62] fs_1.5.0                     
 [63] magrittr_1.5                 
 [64] data.table_1.13.0            
 [65] RSpectra_0.16-0              
 [66] RANN_2.6.1                   
 [67] pkgload_1.1.0                
 [68] hms_0.5.3                    
 [69] evaluate_0.14                
 [70] xtable_1.8-4                 
 [71] XML_3.99-0.5                 
 [72] sparsesvd_0.2                
 [73] HSMMSingleCell_1.8.0         
 [74] testthat_2.3.2               
 [75] compiler_4.0.2               
 [76] tibble_3.0.3                 
 [77] crayon_1.3.4                 
 [78] htmltools_0.5.0              
 [79] mgcv_1.8-31                  
 [80] tidyr_1.1.2                  
 [81] howmany_0.3-1                
 [82] DBI_1.1.0                    
 [83] dbplyr_2.0.0                 
 [84] MASS_7.3-51.6                
 [85] rappdirs_0.3.1               
 [86] Matrix_1.2-18                
 [87] ade4_1.7-16                  
 [88] cli_2.0.2                    
 [89] igraph_1.2.5                 
 [90] forcats_0.5.0                
 [91] pkgconfig_2.0.3              
 [92] rncl_0.8.4                   
 [93] registry_0.5-1               
 [94] locfdr_1.1-8                 
 [95] xml2_1.3.2                   
 [96] foreach_1.5.1                
 [97] annotate_1.66.0              
 [98] rngtools_1.5                 
 [99] pkgmaker_0.32.2              
[100] XVector_0.28.0               
[101] stringr_1.4.0                
[102] callr_3.4.4                  
[103] digest_0.6.25                
[104] softImpute_1.4               
[105] DDRTree_0.1.5                
[106] rmarkdown_2.4                
[107] fastmatch_1.1-0              
[108] edgeR_3.30.3                 
[109] curl_3.3                     
[110] kernlab_0.9-29               
[111] lifecycle_0.2.0              
[112] monocle_2.16.0               
[113] nlme_3.1-148                 
[114] clusterExperiment_2.8.0      
[115] Rhdf5lib_1.10.1              
[116] desc_1.2.0                   
[117] limma_3.44.3                 
[118] fansi_0.4.1                  
[119] pillar_1.4.6                 
[120] lattice_0.20-41              
[121] survival_3.1-12              
[122] httr_1.4.2                   
[123] pkgbuild_1.1.0               
[124] glue_1.4.2                   
[125] remotes_2.2.0                
[126] qlcMatrix_0.9.7              
[127] FNN_1.1.3                    
[128] iterators_1.0.11             
[129] bit_4.0.4                    
[130] stringi_1.5.3                
[131] HDF5Array_1.16.1             
[132] blob_1.2.1                   
[133] memoise_1.1.0                
[134] dplyr_1.0.2                  
[135] irlba_2.3.3                  
[136] ape_5.4-1 

Up until here though, the Rmd renders the same as your output (and it works for my data as well up until this point). Are you able to help?

best regards, Kieran

kstreet13 commented 3 years ago

Hi Kieran,

I think I know what's going on here. The conditions argument was only added in the most recent version of tradeSeq (v. 1.4.0). If you update your tradeSeq installation, I think you should be all set!

Best, Kelly

Sorry, minor update: you may also need to update Bioconductor, via BiocManager::install(version = "3.12") (I'm not sure which version you currently have).

kmshort commented 3 years ago

Hi Kelly,

Thanks for that update! It made the difference. Going to Bioconductor 3.12 (I was on 3.10) did the trick. I tried before going before 3.10, and it only installed tradeSeq 1.2. That's most odd, I would've thought it wouldn't matter what version of Bioconductor you have, that it'd provide the latest version.

So most of your Rmd now runs, except for a chunk under "#### Visualization of DE genes" :

```{r, eval=FALSE}
conditions <- colData(sce)$pheno$treatment_id
pt1 <- colData(sce)$slingshot$pseudotime

### based on fitted values (plotting takes a while to run)
yhatCell <- predictCells(sce, gene=mockGenes)
yhatCellMock <- yhatCell[,conditions == "Mock"]
# order according to pseudotime
ooMock <- order(pt1[conditions == "Mock"], decreasing=FALSE)
yhatCellMock <- yhatCellMock[,ooMock]
pheatmap(t(scale(t(yhatCellMock))), cluster_cols = FALSE,
          show_rownames = FALSE, show_colnames=FALSE) ```

pheatmap fails with Error in hclust(d, method = method) : NA/NaN/Inf in foreign function call (arg 10)

It's caused because colData(sce)$pheno$treatment_id is absent by this time in the script. I don't know why but it happens after fitGAM. But this causes the "conditions" object to be empty, and the knock on effect is pheatmap can't be run. I noticed that in your rendered HTML the output of this chunk is not included. So maybe you skipped over this chunk intentionally. It can be made to work by changing conditions: conditions <- colData(sce)$tradeSeq$conditions

One other thing, in the section #### Gene set enrichment analysis on genes from the Mock condition and ### Gene set enrichment analysis section at the bottom When 'fgsea' is run, a console message is displayed:

You are trying to run fgseaSimple. It is recommended to use fgseaMultilevel.
To run fgseaMultilevel, you need to remove the nperm argument in the fgsea function call.
There are ties in the preranked stats (0.03% of the list).
The order of those tied genes will be arbitrary, which may produce unexpected results.

The GO terms are generally similar, but the order and p-values are different. I'm not sure what to make of that. I'm using tradeSeq 1.4 now, and the HTML rendered was 1.3.13; maybe that is the difference. Also, your rendered HTML uses fgsea 1.15, but I have 1.16.

Finally: I find the bioc2020trajectories::imbalance_score interesting, will that be added to tradeSeq? In the workshop tutorial, you say that "Regions with a high score indicate that the local cell distribution according to treatment label is unbalanced compared the overall distribution". Can you explain further? Is it a measure of how far, say, TGFB cells (in your example) are away from a point on the nearest pseudotime "line"? How do we know which treatment label is causing the deviation from the overall pseudotime?

When I run that on my data, there are a lot of yellow patches. What would you suggest I do in this situation?

Sorry for the many questions - I admit that they are getting very off-topic.

best regards, Kieran

kstreet13 commented 3 years ago

Hi Kieran,

Hm, that is strange. I'm pretty sure that chunk is skipped because it takes a while to run and the resulting plot is not particularly informative (I was just able to re-generated it and it's basically just a blue rectangle...). That said, I did not have the same issue with colData(sce)$pheno$treatment_id being renamed to colData(sce)$pheno$conditions, so I'm not sure what's causing that (and I am running the same version of tradeSeq, 1.4). It looks like there is some redundancy within the SCE object, so sce$tradeSeq$conditions, sce$pheno$treatment_id, colData(sce)$pheno$treatment_id, and colData(sce)$tradeSeq$conditions are all returning the same thing, for me.

As for fgsea, I can't claim much knowledge, there. The same warning message shows up in our workflow and for me (running version 1.16). I guess if we were doing this analysis for a publication, we would take the advice and run fgseaMultilevel, but this seemed sufficient for our purposes. As for the different p-values, I'm guessing that's because it uses a permutation test, so everything would have to be exactly the same between your environment and ours when we compiled the document in order to perfectly re-create the same set of permutations (presumably including the package version). I also get different p-values depending on whether I run the code in R Studio or knit the entire document, so I'm not sure how possible it will be to reproduce the exact values from our original.

And regarding the imbalance scores, I think @HectorRDB would be able to give bette answers than me, as he has been leading the development, there. The very basic idea is that these scores attempt to answer the question "how well-balanced are the conditions in a local neighborhood of the data?", which is independent of any trajectory inference/pseudotime (we even run it before running slingshot). As far as I know, there is no indication of a direction (ie. which condition is more/less abundant), just imbalance. At the moment, I think that is sufficiently distinct from the rest of tradeSeq that it probably won't be incorporated, but we're still thinking about this. If you're interested, I also just saw this preprint the other day that seems to be doing something similar: https://www.biorxiv.org/content/10.1101/2020.11.23.393769v1.

Best, Kelly

HectorRDB commented 3 years ago

To pile up on @kstreet13's point. The imbalance score is looking at neighborhood and comparing the local distribution of condition labels to the global distribution.

In details, we are using a kNN-graph. For each cell, we test whether the condition labels of the k-neighbour comes from the global distribution of labels using a multinomial test. This yields a z-score per cell that is then smoother in the reduced dimension space. You can play with both k and smooth to see how changing the definition of the neighborhood impact the score.
I would not focus so much on the exact value and more on the distribution of values, this is meant to be a visual tool.

We are also planning to release a package, hopefully rather soon, with this and more functionalities on bioconductor.

kmshort commented 3 years ago

Thanks @HectorRDB I'll have to chew on that for a while. I'm a biologist, so I'll have to think about what the "global distribution" means.

Hi @kstreet13 ,

I tried the pheatmap chunk, which indeed was skipped, for the sake of seeing what it did..

Also, the execution of fitGAM and inputting it into the sce object strips out the pre-figGAM sce object information (but it keeps sce$slingshot).

Now, moving on to what I actually am trying to do with my own data: I have found a 'bug' with the fitGAM condition option.

I set up my sce object as you do in the workshop vignette, and evaluateK works:

evaluateK(counts = as.matrix(assays(sce)$counts),
                   pseudotime = colData(sce)$slingshot$pseudotime.curve1,
                   cellWeights = colData(sce)$slingshot$cellWeights.curve1,
                   conditions = factor(colData(sce)$Diet),
                   nGenes = 500,
                   k = 3:7)

but this did not work:

sce <- fitGAM(counts = as.matrix(assays(sce)$counts),
              pseudotime = colData(sce)$slingshot$pseudotime.curve1,
              cellWeights = colData(sce)$slingshot$cellWeights.curve1,
              conditions = factor(colData(sce)$Diet),
              nknots = 6)

it errors with:

the condition has length > 1 and only the first element will be used
Error in .local(counts, ...) : conditions must be a factor vector.

I'm certain that sce$Diet is a factor, I input it into the sce object as factor to begin with.

is.factor(sce$Diet)
[1] TRUE

and using it in the conditions argument that I've used, it is still is a factor.

is.factor(colData(sce)$Diet)
[1] TRUE

When I check your workshop vignette sce object, I see: Levels: Mock TGFB with mine: Levels: NPD < LPD

Hmmm.. I saw no reason why it shouldn't work. But I did notice that I had ordered my factors (for a simple reason, that it plots in the order that I want).

So I tried:

sce <- fitGAM(counts = as.matrix(assays(sce)$counts),
              pseudotime = colData(sce)$slingshot$pseudotime.curve1,
              cellWeights = colData(sce)$slingshot$cellWeights.curve1,
              conditions = factor(colData(sce)$Diet, ordered = FALSE),
              nknots = 6)

And it is running. I can't tell if it'll run to completion yet (I'm in for an 8 hour 15 min wait).

Maybe it's worth running a pre-flight check in fitGAM to firstly see if conditions is a factor. Then if a factor, check if ordered. Then if ordered, eiher throw an error saying so, or just unorder them or ignore the order completely?

best regards, Kieran

HectorRDB commented 3 years ago

Hi @kmshort Sorry I should have been more explicit.

Let's say you have two conditions NPD and LPD and cells are 50% for each condition overall on your dataset (i.e. the global distribution). The imbalance score will look at the distribution among the k neighbors of each cell. That is, if among the k neighbor of a given cell, 70% of the cells are NPD, it will compute how likely it is to have this level of imbalance compared to the global balance of 50%. Then, we smooth to make for nicer visuals.

I'll also help you with the other issue since it is more tradeSeq related than slingshot. It is indeed kind of weird. Just to clarify, do you have three levels (NPD, < and LPD) or two levels (NPD and LPD)?

Also, could you paste the traceback of your error? Thanks

kmshort commented 3 years ago

Hi @HectorRDB ,

Let's say you have two conditions NPD and LPD and cells are 50% for each condition overall on your dataset (i.e. the global distribution). The imbalance score will look at the distribution among the k neighbors of each cell. That is, if among the k neighbor of a given cell, 70% of the cells are NPD, it will compute how likely it is to have this level of imbalance compared to the global balance of 50%. Then, we smooth to make for nicer visuals.

Thanks, I get that. So in your case ever cell should have 50/50 around it, but if it doesn't, it'll show it. Or in an alternative case, if the condition is 30/70 to begin with, if around every cell it isn't 30/70, it will show it. It more or less normalises for whatever the relative contribution of cells there is in the opposing datasets, on a local level in the plot.

I'll also help you with the other issue since it is more tradeSeq related than slingshot. It is indeed kind of weird. Just to clarify, do you have three levels (NPD, < and LPD) or two levels (NPD and LPD)?

They are two levels NPD and LPD, but ordered; so are NPD < LPD

set.seed(3)
sce <- fitGAM(counts = as.matrix(assays(sce)$counts),
+               pseudotime = colData(sce)$slingshot$pseudotime.curve1,
+               cellWeights = colData(sce)$slingshot$cellWeights.curve1,
+               conditions = factor(colData(sce)$Diet),
+               nknots = 3)

the condition has length > 1 and only the first element will be usedError in .local(counts, ...) : conditions must be a factor vector.

> traceback()

4: stop("conditions must be a factor vector.")
3: .local(counts, ...)
2: fitGAM(counts = as.matrix(assays(sce)$counts), pseudotime = colData(sce)$slingshot$pseudotime.curve1, 
       cellWeights = colData(sce)$slingshot$cellWeights.curve1, 
       conditions = factor(colData(sce)$Diet), nknots = 3)
1: fitGAM(counts = as.matrix(assays(sce)$counts), pseudotime = colData(sce)$slingshot$pseudotime.curve1, 
       cellWeights = colData(sce)$slingshot$cellWeights.curve1, 
       conditions = factor(colData(sce)$Diet), nknots = 3)

I'm not sure how much this helps

best regards.

HectorRDB commented 3 years ago

I think the ordering might be the issue. Could you tell me the output of class(factor(colData(sce)$Diet)) Thanks

kmshort commented 3 years ago

Hi @HectorRDB , I know it's the issue. I fed in my conditions unordered and it didn't error out (it still has 5 hours to go, so I haven't seen it complete yet). What I used when it failed: class(factor(colData(sce)$Diet)) [1] "ordered" "factor"

What I used when it worked:

class(factor(colData(sce)$Diet, ordered = FALSE)) [1] "factor"

It's just a bit inconsistent with evaluateK because it doesn't seem to care if the factor is ordered or not.

thanks

HectorRDB commented 3 years ago

Ok, I had not see the edited message. If runtime is an issue, I would suggest filtering your dataset first or using parallelization. I will work to add a more informative error message for ordered factors. Pinging @koenvandenberge here as well.

kmshort commented 3 years ago

Thanks @HectorRDB , my apologies there (with the Edit). Have you also noticed that if you have your original SCE file, and run fitGAM on it, with conditions, inserting the result back into he original SCE file, that it wipes out the original SCE data? It didn't do it when I was on the older version of tradeSeq, but since I updated to tradeSeq 1.4, it does that now. It's quite a pain actually, but can't be sure that it's not just me.

HectorRDB commented 3 years ago

No problem. I am unsure I understand exactly the issue, could you please show a reproducible error. Thanks a lt