Closed kmshort closed 3 years ago
Hi Kieran,
Thanks! And I have a couple ideas, though I'm not sure which (if either) will be most appropriate for you.
If you have an a priori reason for creating separate trajectories, the best way to do it would be to split your data by treatment
and run Slingshot on each SCE object, individually. Then both will have a slingPseudotime_1
variable and you'll probably want to rename one of them before merging the objects back together (so it's clear they are from separate runs).
Otherwise, what I would generally recommend is ignoring the treatment
categories when you run Slingshot. This will give you a shared trajectory that can be used to compare the two treatment conditions within a common framework. That could mean comparing the two distributions of pseudotime values or doing more fine grain, gene-by-gene testing with tradeSeq
.
All of this is basically the executive summary of our recent workflow, so if you want more detail on any of this, that might be a good place to start. And if I've misunderstood something about what you're trying to do or if you have more questions, please let me know!
Best, Kelly
Hi Kelly,
Thanks. I'm happy to work on the integrated data and pull apart variances along a shared trajectory. We do have a priori reason to treat them a little differently, we know that the treatment does change the course of development of the treated cells a bit (they start growing about the same as normal, but then slow down for a bit (some cells stall and don't make it),but the ones that survive lag behind but continue more or less as normal. Maybe there's a mid-trajectory deviation which we can identify (maybe there isn't!).
Thanks for the link to your recent workflow. I've got the tradeSeq-workstopTrajectories.Rmd document and am working through that. However, it's throwing an error for me though right at the fitGam function. And it throws exactly the same error on my data!
This is your code, and the result:
set.seed(3)
sce <- fitGAM(counts = as.matrix(assays(sce)$counts),
pseudotime = colData(sce)$slingshot$pseudotime,
cellWeights = colData(sce)$slingshot$cellWeights.V1,
conditions = factor(colData(sce)$pheno$treatment_id),
nknots = 5)
mean(rowData(sce)$tradeSeq$converged)
Error in .local(counts, ...) : unused argument (conditions = c(1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
and my session info:
> sessionInfo()
R version 4.0.2 (2020-06-22)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 18363)
Matrix products: default
locale:
[1] LC_COLLATE=English_Australia.1252
[2] LC_CTYPE=English_Australia.1252
[3] LC_MONETARY=English_Australia.1252
[4] LC_NUMERIC=C
[5] LC_TIME=English_Australia.1252
attached base packages:
[1] parallel stats4 stats graphics
[5] grDevices utils datasets methods
[9] base
other attached packages:
[1] tradeSeq_1.2.01
[2] gridExtra_2.3
[3] ggplot2_3.3.2
[4] knitr_1.30
[5] fgsea_1.14.0
[6] msigdbr_7.2.1
[7] pheatmap_1.0.12
[8] UpSetR_1.4.0
[9] viridis_0.5.1
[10] viridisLite_0.3.0
[11] scales_1.1.1
[12] RColorBrewer_1.1-2
[13] SingleCellExperiment_1.10.1
[14] SummarizedExperiment_1.18.2
[15] DelayedArray_0.14.1
[16] matrixStats_0.57.0
[17] Biobase_2.48.0
[18] GenomicRanges_1.40.0
[19] GenomeInfoDb_1.24.2
[20] IRanges_2.22.2
[21] S4Vectors_0.26.1
[22] BiocGenerics_0.34.0
[23] slingshot_1.6.1
[24] princurve_2.1.5
[25] RevoUtils_11.0.2
[26] devtools_2.3.2
[27] usethis_1.6.3
[28] RevoUtilsMath_11.0.0
loaded via a namespace (and not attached):
[1] tidyselect_1.1.0
[2] AnnotationDbi_1.50.3
[3] RSQLite_2.2.1
[4] grid_4.0.2
[5] combinat_0.0-8
[6] docopt_0.7.1
[7] BiocParallel_1.22.0
[8] Rtsne_0.15
[9] RNeXML_2.4.5
[10] munsell_0.5.0
[11] codetools_0.2-16
[12] withr_2.3.0
[13] colorspace_1.4-1
[14] fastICA_1.2-2
[15] uuid_0.1-4
[16] zinbwave_1.10.1
[17] rstudioapi_0.11
[18] NMF_0.23.0
[19] slam_0.1-47
[20] GenomeInfoDbData_1.2.3
[21] bit64_4.0.5
[22] farver_2.0.3
[23] rhdf5_2.32.3
[24] rprojroot_1.3-2
[25] vctrs_0.3.4
[26] generics_0.0.2
[27] xfun_0.18
[28] BiocFileCache_1.12.1
[29] R6_2.3.0
[30] doParallel_1.0.14
[31] VGAM_1.1-4
[32] locfit_1.5-9.4
[33] bioc2020trajectories_0.0.0.91
[34] bitops_1.0-6
[35] assertthat_0.2.1
[36] gtable_0.3.0
[37] phylobase_0.8.10
[38] processx_3.4.4
[39] rlang_0.4.7
[40] genefilter_1.70.0
[41] splines_4.0.2
[42] lazyeval_0.2.2
[43] yaml_2.2.1
[44] reshape2_1.4.4
[45] backports_1.1.10
[46] tools_4.0.2
[47] gridBase_0.4-7
[48] ellipsis_0.3.1
[49] sessioninfo_1.1.1
[50] Rcpp_1.0.5
[51] plyr_1.8.6
[52] progress_1.2.2
[53] zlibbioc_1.34.0
[54] purrr_0.3.4
[55] RCurl_1.98-1.2
[56] densityClust_0.3
[57] ps_1.3.4
[58] prettyunits_1.1.1
[59] pbapply_1.4-3
[60] ggrepel_0.8.2
[61] cluster_2.1.0
[62] fs_1.5.0
[63] magrittr_1.5
[64] data.table_1.13.0
[65] RSpectra_0.16-0
[66] RANN_2.6.1
[67] pkgload_1.1.0
[68] hms_0.5.3
[69] evaluate_0.14
[70] xtable_1.8-4
[71] XML_3.99-0.5
[72] sparsesvd_0.2
[73] HSMMSingleCell_1.8.0
[74] testthat_2.3.2
[75] compiler_4.0.2
[76] tibble_3.0.3
[77] crayon_1.3.4
[78] htmltools_0.5.0
[79] mgcv_1.8-31
[80] tidyr_1.1.2
[81] howmany_0.3-1
[82] DBI_1.1.0
[83] dbplyr_2.0.0
[84] MASS_7.3-51.6
[85] rappdirs_0.3.1
[86] Matrix_1.2-18
[87] ade4_1.7-16
[88] cli_2.0.2
[89] igraph_1.2.5
[90] forcats_0.5.0
[91] pkgconfig_2.0.3
[92] rncl_0.8.4
[93] registry_0.5-1
[94] locfdr_1.1-8
[95] xml2_1.3.2
[96] foreach_1.5.1
[97] annotate_1.66.0
[98] rngtools_1.5
[99] pkgmaker_0.32.2
[100] XVector_0.28.0
[101] stringr_1.4.0
[102] callr_3.4.4
[103] digest_0.6.25
[104] softImpute_1.4
[105] DDRTree_0.1.5
[106] rmarkdown_2.4
[107] fastmatch_1.1-0
[108] edgeR_3.30.3
[109] curl_3.3
[110] kernlab_0.9-29
[111] lifecycle_0.2.0
[112] monocle_2.16.0
[113] nlme_3.1-148
[114] clusterExperiment_2.8.0
[115] Rhdf5lib_1.10.1
[116] desc_1.2.0
[117] limma_3.44.3
[118] fansi_0.4.1
[119] pillar_1.4.6
[120] lattice_0.20-41
[121] survival_3.1-12
[122] httr_1.4.2
[123] pkgbuild_1.1.0
[124] glue_1.4.2
[125] remotes_2.2.0
[126] qlcMatrix_0.9.7
[127] FNN_1.1.3
[128] iterators_1.0.11
[129] bit_4.0.4
[130] stringi_1.5.3
[131] HDF5Array_1.16.1
[132] blob_1.2.1
[133] memoise_1.1.0
[134] dplyr_1.0.2
[135] irlba_2.3.3
[136] ape_5.4-1
Up until here though, the Rmd renders the same as your output (and it works for my data as well up until this point). Are you able to help?
best regards, Kieran
Hi Kieran,
I think I know what's going on here. The conditions
argument was only added in the most recent version of tradeSeq
(v. 1.4.0). If you update your tradeSeq
installation, I think you should be all set!
Best, Kelly
Sorry, minor update: you may also need to update Bioconductor, via BiocManager::install(version = "3.12")
(I'm not sure which version you currently have).
Hi Kelly,
Thanks for that update! It made the difference. Going to Bioconductor 3.12 (I was on 3.10) did the trick. I tried before going before 3.10, and it only installed tradeSeq 1.2. That's most odd, I would've thought it wouldn't matter what version of Bioconductor you have, that it'd provide the latest version.
So most of your Rmd now runs, except for a chunk under "#### Visualization of DE genes" :
```{r, eval=FALSE}
conditions <- colData(sce)$pheno$treatment_id
pt1 <- colData(sce)$slingshot$pseudotime
### based on fitted values (plotting takes a while to run)
yhatCell <- predictCells(sce, gene=mockGenes)
yhatCellMock <- yhatCell[,conditions == "Mock"]
# order according to pseudotime
ooMock <- order(pt1[conditions == "Mock"], decreasing=FALSE)
yhatCellMock <- yhatCellMock[,ooMock]
pheatmap(t(scale(t(yhatCellMock))), cluster_cols = FALSE,
show_rownames = FALSE, show_colnames=FALSE) ```
pheatmap
fails with
Error in hclust(d, method = method) : NA/NaN/Inf in foreign function call (arg 10)
It's caused because colData(sce)$pheno$treatment_id is absent by this time in the script. I don't know why but it happens after fitGAM. But this causes the "conditions" object to be empty, and the knock on effect is pheatmap can't be run.
I noticed that in your rendered HTML the output of this chunk is not included. So maybe you skipped over this chunk intentionally. It can be made to work by changing conditions:
conditions <- colData(sce)$tradeSeq$conditions
One other thing, in the section
#### Gene set enrichment analysis on genes from the Mock condition
and
### Gene set enrichment analysis
section at the bottom
When 'fgsea' is run, a console message is displayed:
You are trying to run fgseaSimple. It is recommended to use fgseaMultilevel.
To run fgseaMultilevel, you need to remove the nperm argument in the fgsea function call.
There are ties in the preranked stats (0.03% of the list).
The order of those tied genes will be arbitrary, which may produce unexpected results.
The GO terms are generally similar, but the order and p-values are different. I'm not sure what to make of that. I'm using tradeSeq 1.4 now, and the HTML rendered was 1.3.13; maybe that is the difference. Also, your rendered HTML uses fgsea 1.15, but I have 1.16.
Finally: I find the bioc2020trajectories::imbalance_score interesting, will that be added to tradeSeq? In the workshop tutorial, you say that "Regions with a high score indicate that the local cell distribution according to treatment label is unbalanced compared the overall distribution". Can you explain further? Is it a measure of how far, say, TGFB cells (in your example) are away from a point on the nearest pseudotime "line"? How do we know which treatment label is causing the deviation from the overall pseudotime?
When I run that on my data, there are a lot of yellow patches. What would you suggest I do in this situation?
Sorry for the many questions - I admit that they are getting very off-topic.
best regards, Kieran
Hi Kieran,
Hm, that is strange. I'm pretty sure that chunk is skipped because it takes a while to run and the resulting plot is not particularly informative (I was just able to re-generated it and it's basically just a blue rectangle...). That said, I did not have the same issue with colData(sce)$pheno$treatment_id
being renamed to colData(sce)$pheno$conditions
, so I'm not sure what's causing that (and I am running the same version of tradeSeq
, 1.4). It looks like there is some redundancy within the SCE object, so sce$tradeSeq$conditions
, sce$pheno$treatment_id
, colData(sce)$pheno$treatment_id
, and colData(sce)$tradeSeq$conditions
are all returning the same thing, for me.
As for fgsea
, I can't claim much knowledge, there. The same warning message shows up in our workflow and for me (running version 1.16). I guess if we were doing this analysis for a publication, we would take the advice and run fgseaMultilevel
, but this seemed sufficient for our purposes. As for the different p-values, I'm guessing that's because it uses a permutation test, so everything would have to be exactly the same between your environment and ours when we compiled the document in order to perfectly re-create the same set of permutations (presumably including the package version). I also get different p-values depending on whether I run the code in R Studio or knit the entire document, so I'm not sure how possible it will be to reproduce the exact values from our original.
And regarding the imbalance scores, I think @HectorRDB would be able to give bette answers than me, as he has been leading the development, there. The very basic idea is that these scores attempt to answer the question "how well-balanced are the conditions in a local neighborhood of the data?", which is independent of any trajectory inference/pseudotime (we even run it before running slingshot
). As far as I know, there is no indication of a direction (ie. which condition is more/less abundant), just imbalance. At the moment, I think that is sufficiently distinct from the rest of tradeSeq
that it probably won't be incorporated, but we're still thinking about this. If you're interested, I also just saw this preprint the other day that seems to be doing something similar: https://www.biorxiv.org/content/10.1101/2020.11.23.393769v1.
Best, Kelly
To pile up on @kstreet13's point. The imbalance score is looking at neighborhood and comparing the local distribution of condition labels to the global distribution.
In details, we are using a kNN-graph. For each cell, we test whether the condition labels of the k-neighbour comes from the global distribution of labels using a multinomial test. This yields a z-score per cell that is then smoother in the reduced dimension space. You can play with both k
and smooth
to see how changing the definition of the neighborhood impact the score.
I would not focus so much on the exact value and more on the distribution of values, this is meant to be a visual tool.
We are also planning to release a package, hopefully rather soon, with this and more functionalities on bioconductor.
Thanks @HectorRDB I'll have to chew on that for a while. I'm a biologist, so I'll have to think about what the "global distribution" means.
Hi @kstreet13 ,
I tried the pheatmap chunk, which indeed was skipped, for the sake of seeing what it did..
Also, the execution of fitGAM and inputting it into the sce object strips out the pre-figGAM sce object information (but it keeps sce$slingshot).
Now, moving on to what I actually am trying to do with my own data: I have found a 'bug' with the fitGAM condition
option.
I set up my sce object as you do in the workshop vignette, and evaluateK works:
evaluateK(counts = as.matrix(assays(sce)$counts),
pseudotime = colData(sce)$slingshot$pseudotime.curve1,
cellWeights = colData(sce)$slingshot$cellWeights.curve1,
conditions = factor(colData(sce)$Diet),
nGenes = 500,
k = 3:7)
but this did not work:
sce <- fitGAM(counts = as.matrix(assays(sce)$counts),
pseudotime = colData(sce)$slingshot$pseudotime.curve1,
cellWeights = colData(sce)$slingshot$cellWeights.curve1,
conditions = factor(colData(sce)$Diet),
nknots = 6)
it errors with:
the condition has length > 1 and only the first element will be used
Error in .local(counts, ...) : conditions must be a factor vector.
I'm certain that sce$Diet
is a factor, I input it into the sce object as factor to begin with.
is.factor(sce$Diet)
[1] TRUE
and using it in the conditions
argument that I've used, it is still is a factor.
is.factor(colData(sce)$Diet)
[1] TRUE
When I check your workshop vignette sce object, I see:
Levels: Mock TGFB
with mine:
Levels: NPD < LPD
Hmmm.. I saw no reason why it shouldn't work. But I did notice that I had ordered my factors (for a simple reason, that it plots in the order that I want).
So I tried:
sce <- fitGAM(counts = as.matrix(assays(sce)$counts),
pseudotime = colData(sce)$slingshot$pseudotime.curve1,
cellWeights = colData(sce)$slingshot$cellWeights.curve1,
conditions = factor(colData(sce)$Diet, ordered = FALSE),
nknots = 6)
And it is running. I can't tell if it'll run to completion yet (I'm in for an 8 hour 15 min wait).
Maybe it's worth running a pre-flight check in fitGAM to firstly see if conditions
is a factor. Then if a factor, check if ordered. Then if ordered, eiher throw an error saying so, or just unorder them or ignore the order completely?
best regards, Kieran
Hi @kmshort Sorry I should have been more explicit.
Let's say you have two conditions NPD
and LPD
and cells are 50% for each condition overall on your dataset (i.e. the global distribution). The imbalance score will look at the distribution among the k neighbors of each cell. That is, if among the k neighbor of a given cell, 70% of the cells are NPD
, it will compute how likely it is to have this level of imbalance compared to the global balance of 50%. Then, we smooth to make for nicer visuals.
I'll also help you with the other issue since it is more tradeSeq
related than slingshot
. It is indeed kind of weird. Just to clarify, do you have three levels (NPD
, <
and LPD
) or two levels (NPD
and LPD
)?
Also, could you paste the traceback
of your error?
Thanks
Hi @HectorRDB ,
Let's say you have two conditions
NPD
andLPD
and cells are 50% for each condition overall on your dataset (i.e. the global distribution). The imbalance score will look at the distribution among the k neighbors of each cell. That is, if among the k neighbor of a given cell, 70% of the cells areNPD
, it will compute how likely it is to have this level of imbalance compared to the global balance of 50%. Then, we smooth to make for nicer visuals.
Thanks, I get that. So in your case ever cell should have 50/50 around it, but if it doesn't, it'll show it. Or in an alternative case, if the condition is 30/70 to begin with, if around every cell it isn't 30/70, it will show it. It more or less normalises for whatever the relative contribution of cells there is in the opposing datasets, on a local level in the plot.
I'll also help you with the other issue since it is more
tradeSeq
related thanslingshot
. It is indeed kind of weird. Just to clarify, do you have three levels (NPD
,<
andLPD
) or two levels (NPD
andLPD
)?
They are two levels NPD
and LPD
, but ordered; so are NPD < LPD
set.seed(3)
sce <- fitGAM(counts = as.matrix(assays(sce)$counts),
+ pseudotime = colData(sce)$slingshot$pseudotime.curve1,
+ cellWeights = colData(sce)$slingshot$cellWeights.curve1,
+ conditions = factor(colData(sce)$Diet),
+ nknots = 3)
the condition has length > 1 and only the first element will be usedError in .local(counts, ...) : conditions must be a factor vector.
> traceback()
4: stop("conditions must be a factor vector.")
3: .local(counts, ...)
2: fitGAM(counts = as.matrix(assays(sce)$counts), pseudotime = colData(sce)$slingshot$pseudotime.curve1,
cellWeights = colData(sce)$slingshot$cellWeights.curve1,
conditions = factor(colData(sce)$Diet), nknots = 3)
1: fitGAM(counts = as.matrix(assays(sce)$counts), pseudotime = colData(sce)$slingshot$pseudotime.curve1,
cellWeights = colData(sce)$slingshot$cellWeights.curve1,
conditions = factor(colData(sce)$Diet), nknots = 3)
I'm not sure how much this helps
best regards.
I think the ordering might be the issue.
Could you tell me the output of class(factor(colData(sce)$Diet))
Thanks
Hi @HectorRDB ,
I know it's the issue. I fed in my conditions
unordered and it didn't error out (it still has 5 hours to go, so I haven't seen it complete yet).
What I used when it failed:
class(factor(colData(sce)$Diet))
[1] "ordered" "factor"
What I used when it worked:
class(factor(colData(sce)$Diet, ordered = FALSE))
[1] "factor"
It's just a bit inconsistent with evaluateK
because it doesn't seem to care if the factor is ordered or not.
thanks
Ok, I had not see the edited message. If runtime is an issue, I would suggest filtering your dataset first or using parallelization. I will work to add a more informative error message for ordered factors. Pinging @koenvandenberge here as well.
Thanks @HectorRDB , my apologies there (with the Edit). Have you also noticed that if you have your original SCE file, and run fitGAM on it, with conditions, inserting the result back into he original SCE file, that it wipes out the original SCE data? It didn't do it when I was on the older version of tradeSeq, but since I updated to tradeSeq 1.4, it does that now. It's quite a pain actually, but can't be sure that it's not just me.
No problem. I am unsure I understand exactly the issue, could you please show a reproducible error. Thanks a lt
Hi Kelly,
Brilliant work, Slingshot is great. I am missing something. I have two treatments in a single SCE object, sce, (converted from a seurat obj). If I have sce$treatment, and in that two factors - treat1 & treat2, how can I perform slingshot on sce to generate two lineages, one for treat1 and one for treat2? I wish to compare the two lineages, and will probably use something like tradeSeq. I'll add, that the cluster assignments for the cells in the sce object are the same for the two treatments.
many thanks Kieran