statOmics / tradeSeq

TRAjectory-based Differential Expression analysis for SEQuencing data
Other
238 stars 27 forks source link

CellRank interface #164

Open Marius1311 opened 3 years ago

Marius1311 commented 3 years ago

Hi TradeSeq team! I think you guys developed an awesome tool, and we were discussing recently whether and if we could interface to it from CellRank. CellRank defines trajectories through a soft assignment of cells to branches and one global pseudotime. So the idea would be to construct an interface that would allow users to seamlessly transition from CellRank to tradeSeq, once the trajectories have been defined. What do you think of this idea, and what data structures would you require on your end for such an interface? Looking fwd to your thoughts!

HectorRDB commented 3 years ago

Hi @Marius1311. Thanks for the idea, I think it would indeed be useful to users.
The main tradeSeq functions is fitGam, which fit the smoothers for each gene. It accepts several input formats. The most useful would be the following:

Then, fitGam outputs a singleCellExperiment Object with the smoother information stored in the rowData slots.

Adding a direct convertor to the package would require us to list CellRank as a dependency and that would mean having python and so on, which is kind of a nightmare with bioconductor.

However, we can easily write a converter function that would live in your package and produces all the input needed for tradeSeq. We can also (or instead) write a small vignette together on how to use tradeSeq downstream of CellRank.

Anyway, it sounds like a useful idea!! Thanks for reaching out

Marius1311 commented 3 years ago

Hi @HectorRDB, thanks for getting back to me! I'm channeling in @michalk8 for the technical details of constructing the interface. @michalk8, how would you construct the interface?

@HectorRDB, how are cellWeights used internally? Do you just assign every cell to it's argmax lineage, or are the weights taken into account when fitting the GAMs?

Writing a small vignette/tutorial together would be a great idea, I'm all in for that. Let's first discuss the interface, and then we move to that!

HectorRDB commented 3 years ago

Hi @Marius1311 The cellWeights are first normalized to sum to 1 and then cells are randomly assigned to one lineage according to a multinomial distribution with the normalized weights acting as event probabilities (see here for the relevant piece of code).

As an aside, we also tried a version where cells were assigned to all lineages, with the cellWeights acting as observation weights. However, it lead to nearly identical fits and p-values when testing for DE and it made the smoothers less interpretable.

I agree with the plan, let's weight for your college's input.

michalk8 commented 3 years ago

Writing a small vignette/tutorial together would be a great idea, I'm all in for that. Let's first discuss the interface, and then we move to that!

The interface really depends on how we want to use tradeSeq's GAM (i.e. either only for significance testing, or whether we want to use the model in our functions, such as cellrank.pl.cluster_fates). If only the former is wanted, than, simply wrapping these with rpy2 in cellrank.external is not an issue and is fairly straightforward. On the other hand, if we'd like to have a tradeSeq model in cellrank.external.models conforming to our API, it's a bit more tricky, but I already have an idea. Is there a way to extract the model predictions, as well as the confidence intervals. predictCells seems to return only the former (the latter is only necessary for plotting).

Marius1311 commented 3 years ago

@HectorRDB, that's surprises me, because we do just that in CellRank (we fit GAMs to visualize gene expression trends along lineages, passing lineage-probabilities as cell-level weights into the loss function) , see below for some example fits from our preprint. I wonder why that didn't work in your case - I think softly assigning cells to lineages via cell-level weights makes more sense because if you just sample cells, you throw away some of the information you have, don't you?

Here's the relevant part of CellRank's API: https://cellrank.readthedocs.io/en/stable/api/cellrank.ul.models.GAMR.html#cellrank.ul.models.GAMR

Screenshot 2021-08-11 at 14 23 37
koenvandenberge commented 3 years ago

Thank you for these suggestions, @Marius1311. It would indeed be useful and relevant to see if we can try to streamline a workflow from CellRank to tradeSeq and we'll be glad to help.

As @HectorRDB is mentioning we have indeed also considered soft assignment of cells to lineages. We considered two options:

Marius1311 commented 3 years ago

Thanks @koenvandenberge! @michalk8, how exactly do we use the weights in mgcv?

michalk8 commented 3 years ago

Thanks @koenvandenberge! @michalk8, how exactly do we use the weights in mgcv?

By default, we use this formula y ~ s(x, bs='cr', k=5, sp=1.0).

Marius1311 commented 3 years ago

So where do the weights come in @michalk8? Is it the first or the second possibility that @koenvandenberge describes?

michalk8 commented 3 years ago

So where do the weights come in @michalk8? Is it the first or the second possibility that @koenvandenberge describes?

The second, we're passing it as weights in mgcv::gam.

Marius1311 commented 3 years ago

Hi @koenvandenberge, should we have a zoom chat about this sometimes? Seems easier to me.

koenvandenberge commented 3 years ago

Sounds good to me.

This week, I believe that both @HectorRDB and I are on Central European Time (but it's possible @HectorRDB is taking some time off). As of next week I will be in California and so it might be a little bit harder to find a good time to meet.

This week I can do

Marius1311 commented 3 years ago

@michalk8, would Friday morning work for you?

koenvandenberge commented 3 years ago

Following points were discussed in the meeting (feel free to correct and add where needed).

How to test whether stacking works.

michalk8 commented 3 years ago

Hi @koenvandenberge ,

sorry for late reply, I've tried the above mentioned stacking approach on simulated data (as well as weights from CellRank), but it seems that mgcv doesn't use the obs. weights as the #observations (the weights sum to #obs, as mentioned above), assuming I di the things correcly (see the notebook below) - I've looked at the mean variance-covariance matrix and it decreases with increasing number of stack. This is also supported by the CI, which with increasing #stacks become smaller. smoothers.ipynb.txt

Marius1311 commented 3 years ago

@koenvandenberge, did you have a chance to look at this yet?

koenvandenberge commented 3 years ago

Thanks so much for taking a look at this, @michalk8. This is indeed what I was afraid might happen.

We've been talking a bit about the assignments and it would be good if we could play around with an example dataset where you think this would be useful. Could you therefore please share with us a fitted CellRank trajectory and the accompanying data? It would be great if we could use that to explore how the weights are behaving.

Marius1311 commented 3 years ago

Hi @koenvandenberge, thanks for getting back to us. You can get an example of CellRank's weights by running our basic tutorial: https://cellrank.readthedocs.io/en/stable/cellrank_basics.html

The data is included in cellrank.datasets and is downloaded automatically.

michalk8 commented 3 years ago

Could you therefore please share with us a fitted CellRank trajectory and the accompanying data?

I will share with you on Monday the notebook I've used to get the CR data when testing the smoothers (though as Marius mentioned, it should be pretty straight-forward, I also just ran the tutorial + exported to csvs).

michalk8 commented 3 years ago

Sorry for late reply, here's the notebook: cellrank_export.ipynb.txt

koenvandenberge commented 2 years ago

Hi @Marius1311 and @michalk8,

A PhD student in our group, @ViktorVerstraelen, has been looking into this in more depth. We first tried using the CellRank weights as is in a default tradeSeq pipeline. Since some lineages have very low CellRank weights overall, this leads to few cells being assigned to that lineage, and thus a higher uncertainty of the mean expression smoother. This will lead to fewer genes detected in that lineage as compared to, e.g., using weights from slingshot.

He is now also looking into whether we can get soft-assignment working in mgcv. We believe that it should be theoretically possible but the practical implementation is not yet clear to us. Even if we can get this working, I do not expect the point above to change qualitatively.

Best, Koen

Marius1311 commented 2 years ago

Hi @koenvandenberge! Thanks for getting back to me. The point about early cells having very low weights for particular trajectories (e.g. Delta for pancreas) is not really inherent to CellRank, it's inherent to the Velocity kernel, i.e. that's really a feature of RNA velocity. With other CellRank kernels, i.e. with the Pseudotime kernel, I expect this to be less of an issue.

koenvandenberge commented 2 years ago

I see, thanks for bringing that up. If I understand correctly, the Pseudotime kernel will not use the RNA velocity information and relies mainly on the distances in the low-dimensional embedding. In that case, I think tradeSeq can be applied directly downstream of CellRank and all we need is interoperability to make this a smooth pipeline. Would you agree?

Marius1311 commented 2 years ago

Yes! The PseudotimeKernel is inspired by Palantir (Setty et al., Nature Biotech 2019), e.g. it combines a KNN graph with a pseudotime. However, it uses a different, more stable weighting scheme than Palantir and it allows the user to input any pseudotime, e.g. Monocle, Slingshot, DPT, etc. There's no RNA velocity information used here

There's also the CytoTRACE kernel, which does not use RNA velocity information, and there is the external StationaryOT kernel, which also does not use RNA velocity information. The nice thing: all of these kernels give the same output, i.e. you can compute fate probabilities based on any of them, so we have a consistent pipeline while allowing flexibility in the individual components.

Marius1311 commented 2 years ago

Hi @koenvandenberge ,

sorry for late reply, I've tried the above mentioned stacking approach on simulated data (as well as weights from CellRank), but it seems that mgcv doesn't use the obs. weights as the #observations (the weights sum to #obs, as mentioned above), assuming I di the things correcly (see the notebook below) - I've looked at the mean variance-covariance matrix and it decreases with increasing number of stack. This is also supported by the CI, which with increasing #stacks become smaller. smoothers.ipynb.txt

Is this still a problem? Or did we find a solution to this? Should we reach out to someone from mgcv to help us with this?

WeilerP commented 2 years ago

@koenvandenberge, I am encountering the same issue as described in #73 when running a version of the CellRank vignette. The progress of the fitGAM reaches 100% and then throws the error

Error in UseMethod("predict") : 
  no applicable method for 'predict' applied to an object of class "NULL"
Calls: fitGAM ... fitGAM -> fitGAM -> .local -> .fitGAM -> <Anonymous>

The following warnings are issued as well

In addition: Warning messages:
1: In .findKnots(nknots, pseudotime, wSamp) :
  Impossible to place a knot at all endpoints.Increase the number of knots to avoid this issue.
2: In min(which(!is.na(SigmaAll))) :
  no non-missing arguments to min; returning Inf
Execution halted
Code ```R library(BiocParallel) library(Matrix) library(SingleCellExperiment) library(tradeSeq) library(zellkonverter) print("Loading data.") sce <- zellkonverter::readH5AD("tradeseq_input.h5ad") sce cat("\nPrepare data for gene trend analysis with tradeSeq\n") # Count matrix X <- assays(sce)$X # cell weights cell_weights <- reducedDims(sce)$terminal_states_memberships colnames(cell_weights) <- c("Alpha", "Beta", "Epsilon", "Delta") # pseudotime pseudotime <- colData(sce)$dpt_pseudotime pseudotime <- cbind(pseudotime, pseudotime, pseudotime, pseudotime) colnames(pseudotime) <- colnames(cell_weights) cat("\nRun tradeSeq\n") sceGAM <- fitGAM( X, pseudotime=pseudotime, cellWeights=cell_weights, nknots=6, verbose=TRUE, parallel=TRUE, BPPARAM=SnowParam(workers = 25), sce=TRUE, ) ```
Session info ``` R version 4.1.2 (2021-11-01) Platform: x86_64-pc-linux-gnu (64-bit) Running under: Debian GNU/Linux 11 (bullseye) Matrix products: default BLAS: /opt/R/lib/R/lib/libRblas.so LAPACK: /opt/R/lib/R/lib/libRlapack.so locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 [7] LC_PAPER=en_US.UTF-8 LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] zellkonverter_1.4.0 tradeSeq_1.8.0 BiocParallel_1.28.2 loaded via a namespace (and not attached): [1] slingshot_2.2.0 Rcpp_1.0.7 [3] locfit_1.5-9.4 dir.expiry_1.2.0 [5] lattice_0.20-45 png_0.1-7 [7] SingleCellExperiment_1.16.0 utf8_1.2.2 [9] R6_2.5.1 GenomeInfoDb_1.30.0 [11] stats4_4.1.2 princurve_2.1.6 [13] ggplot2_3.3.5 pillar_1.6.4 [15] basilisk_1.6.0 zlibbioc_1.40.0 [17] rlang_0.4.12 S4Vectors_0.32.3 [19] Matrix_1.3-4 reticulate_1.24 [21] splines_4.1.2 igraph_1.2.9 [23] RCurl_1.98-1.5 munsell_0.5.0 [25] DelayedArray_0.20.0 compiler_4.1.2 [27] pkgconfig_2.0.3 TrajectoryUtils_1.2.0 [29] BiocGenerics_0.40.0 mgcv_1.8-38 [31] tidyselect_1.1.1 SummarizedExperiment_1.24.0 [33] tibble_3.1.6 gridExtra_2.3 [35] GenomeInfoDbData_1.2.7 edgeR_3.36.0 [37] IRanges_2.28.0 matrixStats_0.61.0 [39] fansi_0.5.0 viridisLite_0.4.0 [41] crayon_1.4.2 dplyr_1.0.7 [43] bitops_1.0-7 basilisk.utils_1.6.0 [45] grid_4.1.2 jsonlite_1.7.2 [47] nlme_3.1-153 gtable_0.3.0 [49] lifecycle_1.0.1 magrittr_2.0.1 [51] scales_1.1.1 pbapply_1.5-0 [53] XVector_0.34.0 viridis_0.6.2 [55] limma_3.50.0 ellipsis_0.3.2 [57] filelock_1.0.2 generics_0.1.1 [59] vctrs_0.3.8 RColorBrewer_1.1-2 [61] tools_4.1.2 Biobase_2.54.0 [63] glue_1.5.0 purrr_0.3.4 [65] MatrixGenerics_1.6.0 parallel_4.1.2 [67] colorspace_2.0-2 GenomicRanges_1.46.1 ```

This worked previously using R 3.5 and tradeSeq==1.4. Compared to your vignette, I am running tradeSeq on the raw counts (I simply ran the CellRank pipeline and stored the pseudotime and terminal state membership in the AnnData object containing the raw counts). Do you have any ideas/suggestions on how to resolve this issue? Is there a (easy) way to convert the pseudotime and cell weights into a slingshot object?

koenvandenberge commented 2 years ago

Hi @WeilerP,

Thanks for letting us know. Do you mind sharing the tradeseq_input.h5ad file so we can take a look?

WeilerP commented 2 years ago

Ah yes, sorry, forgot to add it. I guess this should work: tradeseq_input.h5ad.zip.

WeilerP commented 2 years ago

Thanks for letting us know. Do you mind sharing the tradeseq_input.h5ad file so we can take a look?

@koenvandenberge, any update on this?

koenvandenberge commented 2 years ago

Hi @WeilerP,

I have not been able to fix this yet, but I can already say that the problem seems to be related to the parallelization; if I set parallel=FALSE, it runs without issues for me.

koenvandenberge commented 2 years ago

Hi @WeilerP

My apologies for taking so long in getting back to you. On my end, it seems to be related to the SnowParam specification of the parallelization. When I run the code below, it returns the error you state above.

sceGAM <- fitGAM(
  X[1:40,], 
  pseudotime=pseudotime,
  cellWeights=cell_weights,
  nknots=6,
  verbose=TRUE,
  parallel=TRUE,
  BPPARAM=SnowParam(workers = 2),
  sce=TRUE)

However, specifying parallelization using MulticoreParam works, for me, as below

sceGAM <- fitGAM(
  X[1:40,], 
  pseudotime=pseudotime,
  cellWeights=cell_weights,
  nknots=6,
  verbose=TRUE,
  parallel=TRUE,
  BPPARAM=MulticoreParam(workers=2),
  sce=TRUE)

Does this also work for you?

WeilerP commented 2 years ago

@koenvandenberge, yes, thank you, works like a charm!

Leads me to another question, though: How can you use the plotGeneCount function in this context? It requires the argument curve which is either an SlingshotDataSet, SingleCellExperiment or CellDataset. I'm constantly running into the "Impossible to place a knot at all endpoints.Increase the number of knots to avoid this issue." warning and would like to visualize where the knots are placed. This would also be relevant later on when running the earlyDETest at a specific knot (e.g. close to a branching point) as show here, for example.

koenvandenberge commented 2 years ago

Hi @WeilerP,

That's a good point and there is currently indeed no way to use plotGeneCount without a Slingshot output. We will look into this. As a quick current workaround, you should be able to extract the knot points using metadata(sceGAM)$tradeSeq$knots, where sceGAM is the SCE after running fitGAM. You could then add them to your plot of the trajectory. The knots are the same for all lineages.

flde commented 1 year ago

Hello,

many thanks for the open discussion! I just used the cellrank (velocity+connect kernel and cytotrace pseudotime) results with tradeseq. In the soft-asignment code above from @WeilerP you used the terminal_states_memberships as cellWeights for the fitGAM. Intuitively and from the cellrank GAM API I would have used the absorption_probabilities. Which one would be correct?

Many thanks and happy holidays, Florian

Marius1311 commented 1 year ago

Hi @flde, thanks for pointing this out - @WeilerP, I think we should use the absorption_probabilities and not the terminal_state_memberships, the latter don't really have a trajectory interpretation.