CellRank interface - Githubissues

Marius1311 commented 3 years ago

Hi TradeSeq team! I think you guys developed an awesome tool, and we were discussing recently whether and if we could interface to it from CellRank. CellRank defines trajectories through a soft assignment of cells to branches and one global pseudotime. So the idea would be to construct an interface that would allow users to seamlessly transition from CellRank to tradeSeq, once the trajectories have been defined. What do you think of this idea, and what data structures would you require on your end for such an interface? Looking fwd to your thoughts!

HectorRDB commented 3 years ago

Hi @Marius1311. Thanks for the idea, I think it would indeed be useful to users.
The main tradeSeq functions is fitGam, which fit the smoothers for each gene. It accepts several input formats. The most useful would be the following:

the counts matrix, with the counts argument
a matrix of weights, with one row per cell and one column per lineage, with the cellWeights argument. It looks like this is what CellRank is already providing.
a matrix of pseudotime, with one row per cell and one column per lineage, with the pseudotime argument. This will just have to be a concatenation of the global pseudotime vector into the appropriate matrix size.

Then, fitGam outputs a singleCellExperiment Object with the smoother information stored in the rowData slots.

Adding a direct convertor to the package would require us to list CellRank as a dependency and that would mean having python and so on, which is kind of a nightmare with bioconductor.

However, we can easily write a converter function that would live in your package and produces all the input needed for tradeSeq. We can also (or instead) write a small vignette together on how to use tradeSeq downstream of CellRank.

Anyway, it sounds like a useful idea!! Thanks for reaching out

Marius1311 commented 3 years ago

Hi @HectorRDB, thanks for getting back to me! I'm channeling in @michalk8 for the technical details of constructing the interface. @michalk8, how would you construct the interface?

@HectorRDB, how are cellWeights used internally? Do you just assign every cell to it's argmax lineage, or are the weights taken into account when fitting the GAMs?

Writing a small vignette/tutorial together would be a great idea, I'm all in for that. Let's first discuss the interface, and then we move to that!

HectorRDB commented 3 years ago

Hi @Marius1311 The cellWeights are first normalized to sum to 1 and then cells are randomly assigned to one lineage according to a multinomial distribution with the normalized weights acting as event probabilities (see here for the relevant piece of code).

As an aside, we also tried a version where cells were assigned to all lineages, with the cellWeights acting as observation weights. However, it lead to nearly identical fits and p-values when testing for DE and it made the smoothers less interpretable.

I agree with the plan, let's weight for your college's input.

michalk8 commented 3 years ago

Writing a small vignette/tutorial together would be a great idea, I'm all in for that. Let's first discuss the interface, and then we move to that!

The interface really depends on how we want to use tradeSeq's GAM (i.e. either only for significance testing, or whether we want to use the model in our functions, such as cellrank.pl.cluster_fates). If only the former is wanted, than, simply wrapping these with rpy2 in cellrank.external is not an issue and is fairly straightforward. On the other hand, if we'd like to have a tradeSeq model in cellrank.external.models conforming to our API, it's a bit more tricky, but I already have an idea. Is there a way to extract the model predictions, as well as the confidence intervals. predictCells seems to return only the former (the latter is only necessary for plotting).

Marius1311 commented 3 years ago

@HectorRDB, that's surprises me, because we do just that in CellRank (we fit GAMs to visualize gene expression trends along lineages, passing lineage-probabilities as cell-level weights into the loss function) , see below for some example fits from our preprint. I wonder why that didn't work in your case - I think softly assigning cells to lineages via cell-level weights makes more sense because if you just sample cells, you throw away some of the information you have, don't you?

Here's the relevant part of CellRank's API: https://cellrank.readthedocs.io/en/stable/api/cellrank.ul.models.GAMR.html#cellrank.ul.models.GAMR

koenvandenberge commented 3 years ago

Thank you for these suggestions, @Marius1311. It would indeed be useful and relevant to see if we can try to streamline a workflow from CellRank to tradeSeq and we'll be glad to help.

As @HectorRDB is mentioning we have indeed also considered soft assignment of cells to lineages. We considered two options:

Using mgcv's by statement in mgcv::s, which also leads to good fits and as you say might be prone to keeping more information. However, it seems to become more complex for correct statistical inference as soon as there are >= 2 lineages. If we're working with a trajectory with, say, two lineages and a long common trajectory before they're bifurcating, then the estimated smoother of the two lineages before the bifurcation point will use each cell in that space for about 50%. As we get closer to the bifurcation and for a while after it, these weights start evolving to binary 0/1 weights. So it seems to me there is an association between the weights and the pseudotime, which we would have to model (on top of the gene expression) in order to derive the average smoother for each lineage, and therefore in order to provide inference. Practically, in order to derive the estimated average for lineage 1 at some pseudotime, you'll also need the average weight at that pseudotime. Since fits looked similar to assigning cells using the Multinomial, we decided to forego this additional complexity. Of course, this all depends on the weights. If CellRank or another method has weights that are more involved (as in, may reliably assign cells to lineages even before a bifurcation), the soft assignment could be more worthwhile, I think.
- We also considered using observation-level weights, which seems to be more similar as what you are doing. In our case when working with mgcv, however, this required us to 'stack' the data with the number of stacks equaling the number of lineages. In the two-lineage case, we'd have one set of rows with observation-level weights equaling P(lineage 1), and then another set of rows with observation-level weights equaling P(lineage 2). Since the number of rows in our dataset has then doubled we'd have to really carefully check the calculations being done in mgcv for e.g. the variance-covariance matrix.

Marius1311 commented 3 years ago

Thanks @koenvandenberge! @michalk8, how exactly do we use the weights in mgcv?

michalk8 commented 3 years ago

Thanks @koenvandenberge! @michalk8, how exactly do we use the weights in mgcv?

By default, we use this formula y ~ s(x, bs='cr', k=5, sp=1.0).

Marius1311 commented 3 years ago

So where do the weights come in @michalk8? Is it the first or the second possibility that @koenvandenberge describes?

michalk8 commented 3 years ago

So where do the weights come in @michalk8? Is it the first or the second possibility that @koenvandenberge describes?

The second, we're passing it as weights in mgcv::gam.

Marius1311 commented 3 years ago

Hi @koenvandenberge, should we have a zoom chat about this sometimes? Seems easier to me.

koenvandenberge commented 3 years ago

Sounds good to me.

This week, I believe that both @HectorRDB and I are on Central European Time (but it's possible @HectorRDB is taking some time off). As of next week I will be in California and so it might be a little bit harder to find a good time to meet.

This week I can do

Wednesday between 10AM-12PM and in the afternoon after 2PM.
Thursday I can do 4PM-6PM.
Friday is completely free.

Marius1311 commented 3 years ago

@michalk8, would Friday morning work for you?

koenvandenberge commented 3 years ago

Following points were discussed in the meeting (feel free to correct and add where needed).

CellRank uses RNA velocity information of each cell to probabilistically determine its most likely future (among its k-nearest neighbors). By propagating these velocities, "fate probabilities" are derived for each cell, which may be interpreted as the probability that a cell will develop to each of the respective terminal states of the trajectory.
In order to avoid bias of terminal states being much more abundant (in terms of number of cells), for each terminal state 30 cells are being sampled to represent that state.
It may still happen, however, that all cells before a furcation point have low fate probabilities (magnitude of ~ 5%) for a particular terminal state.
This seems like it could be troublesome for the Multinomial sampling to hard-assign cells to lineages in tradeSeq (or really for either way of fitting the GAM).
Still, the hope would be that the use of RNA velocity information could be a more informative way to assign cells to lineages prior to a furcation point. We should look into whether these probabilities differ significantly across cells before that furcation point, or whether all cells before that furcation point have similar fate probabilities.
If the RNA velocity information is indeed useful for this purpose (i.e., cells have significantly different fate probabilities prior to furcation), we may want to make better use of the soft-assignments provided by CellRank. The stacking approach suggested above could be tested.
One analysis to be tried out is if we can find TFs or genes that drive the furcation. As of yet, we have had little hope to do so as the cell-to-lineage assignment prior to the furcation has essentially been quite arbitrary.

How to test whether stacking works.

One concern is that the number of data points increases if stacking the data for each gene (hence the variance-covariance matrix will be too optimistic). We do not want this to happen and would want that mgcv really uses the sum of the weights as the true number of data points.
We may be able to test this by e.g. simulating from one smoother, and stacking the data repeatedly but with very low weights for the newly stacked data (e.g. w=0.99 for 'original' and w=0.01 for stacked data points). Doing this repeatedly as in increasing the stack, while keeping the sum of the weights equal to the number of original data points and checking whether the estimated variance of the smoother mean drops or not could be useful.
Other concerns may also be possible.

michalk8 commented 3 years ago

Hi @koenvandenberge ,

sorry for late reply, I've tried the above mentioned stacking approach on simulated data (as well as weights from CellRank), but it seems that mgcv doesn't use the obs. weights as the #observations (the weights sum to #obs, as mentioned above), assuming I di the things correcly (see the notebook below) - I've looked at the mean variance-covariance matrix and it decreases with increasing number of stack. This is also supported by the CI, which with increasing #stacks become smaller. smoothers.ipynb.txt

Marius1311 commented 3 years ago

@koenvandenberge, did you have a chance to look at this yet?

koenvandenberge commented 3 years ago

Thanks so much for taking a look at this, @michalk8. This is indeed what I was afraid might happen.

We've been talking a bit about the assignments and it would be good if we could play around with an example dataset where you think this would be useful. Could you therefore please share with us a fitted CellRank trajectory and the accompanying data? It would be great if we could use that to explore how the weights are behaving.

Marius1311 commented 3 years ago

Hi @koenvandenberge, thanks for getting back to us. You can get an example of CellRank's weights by running our basic tutorial: https://cellrank.readthedocs.io/en/stable/cellrank_basics.html

The data is included in cellrank.datasets and is downloaded automatically.

michalk8 commented 3 years ago

Could you therefore please share with us a fitted CellRank trajectory and the accompanying data?

I will share with you on Monday the notebook I've used to get the CR data when testing the smoothers (though as Marius mentioned, it should be pretty straight-forward, I also just ran the tutorial + exported to csvs).

michalk8 commented 3 years ago

Sorry for late reply, here's the notebook: cellrank_export.ipynb.txt

koenvandenberge commented 2 years ago

Hi @Marius1311 and @michalk8,

A PhD student in our group, @ViktorVerstraelen, has been looking into this in more depth. We first tried using the CellRank weights as is in a default tradeSeq pipeline. Since some lineages have very low CellRank weights overall, this leads to few cells being assigned to that lineage, and thus a higher uncertainty of the mean expression smoother. This will lead to fewer genes detected in that lineage as compared to, e.g., using weights from slingshot.

He is now also looking into whether we can get soft-assignment working in mgcv. We believe that it should be theoretically possible but the practical implementation is not yet clear to us. Even if we can get this working, I do not expect the point above to change qualitatively.

Best, Koen

Marius1311 commented 2 years ago

Hi @koenvandenberge! Thanks for getting back to me. The point about early cells having very low weights for particular trajectories (e.g. Delta for pancreas) is not really inherent to CellRank, it's inherent to the Velocity kernel, i.e. that's really a feature of RNA velocity. With other CellRank kernels, i.e. with the Pseudotime kernel, I expect this to be less of an issue.

koenvandenberge commented 2 years ago

I see, thanks for bringing that up. If I understand correctly, the Pseudotime kernel will not use the RNA velocity information and relies mainly on the distances in the low-dimensional embedding. In that case, I think tradeSeq can be applied directly downstream of CellRank and all we need is interoperability to make this a smooth pipeline. Would you agree?

Marius1311 commented 2 years ago

Yes! The PseudotimeKernel is inspired by Palantir (Setty et al., Nature Biotech 2019), e.g. it combines a KNN graph with a pseudotime. However, it uses a different, more stable weighting scheme than Palantir and it allows the user to input any pseudotime, e.g. Monocle, Slingshot, DPT, etc. There's no RNA velocity information used here

There's also the CytoTRACE kernel, which does not use RNA velocity information, and there is the external StationaryOT kernel, which also does not use RNA velocity information. The nice thing: all of these kernels give the same output, i.e. you can compute fate probabilities based on any of them, so we have a consistent pipeline while allowing flexibility in the individual components.

Marius1311 commented 2 years ago

Hi @koenvandenberge ,

sorry for late reply, I've tried the above mentioned stacking approach on simulated data (as well as weights from CellRank), but it seems that mgcv doesn't use the obs. weights as the #observations (the weights sum to #obs, as mentioned above), assuming I di the things correcly (see the notebook below) - I've looked at the mean variance-covariance matrix and it decreases with increasing number of stack. This is also supported by the CI, which with increasing #stacks become smaller. smoothers.ipynb.txt

Is this still a problem? Or did we find a solution to this? Should we reach out to someone from mgcv to help us with this?

WeilerP commented 2 years ago

@koenvandenberge, I am encountering the same issue as described in #73 when running a version of the CellRank vignette. The progress of the fitGAM reaches 100% and then throws the error

Error in UseMethod("predict") : 
  no applicable method for 'predict' applied to an object of class "NULL"
Calls: fitGAM ... fitGAM -> fitGAM -> .local -> .fitGAM -> <Anonymous>

The following warnings are issued as well

In addition: Warning messages:
1: In .findKnots(nknots, pseudotime, wSamp) :
  Impossible to place a knot at all endpoints.Increase the number of knots to avoid this issue.
2: In min(which(!is.na(SigmaAll))) :
  no non-missing arguments to min; returning Inf
Execution halted

Code

```R library(BiocParallel) library(Matrix) library(SingleCellExperiment) library(tradeSeq) library(zellkonverter) print("Loading data.") sce <- zellkonverter::readH5AD("tradeseq_input.h5ad") sce cat("\nPrepare data for gene trend analysis with tradeSeq\n") # Count matrix X <- assays(sce)$X # cell weights cell_weights <- reducedDims(sce)$terminal_states_memberships colnames(cell_weights) <- c("Alpha", "Beta", "Epsilon", "Delta") # pseudotime pseudotime <- colData(sce)$dpt_pseudotime pseudotime <- cbind(pseudotime, pseudotime, pseudotime, pseudotime) colnames(pseudotime) <- colnames(cell_weights) cat("\nRun tradeSeq\n") sceGAM <- fitGAM( X, pseudotime=pseudotime, cellWeights=cell_weights, nknots=6, verbose=TRUE, parallel=TRUE, BPPARAM=SnowParam(workers = 25), sce=TRUE, ) ```

Session info

``` R version 4.1.2 (2021-11-01) Platform: x86_64-pc-linux-gnu (64-bit) Running under: Debian GNU/Linux 11 (bullseye) Matrix products: default BLAS: /opt/R/lib/R/lib/libRblas.so LAPACK: /opt/R/lib/R/lib/libRlapack.so locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 [7] LC_PAPER=en_US.UTF-8 LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] zellkonverter_1.4.0 tradeSeq_1.8.0 BiocParallel_1.28.2 loaded via a namespace (and not attached): [1] slingshot_2.2.0 Rcpp_1.0.7 [3] locfit_1.5-9.4 dir.expiry_1.2.0 [5] lattice_0.20-45 png_0.1-7 [7] SingleCellExperiment_1.16.0 utf8_1.2.2 [9] R6_2.5.1 GenomeInfoDb_1.30.0 [11] stats4_4.1.2 princurve_2.1.6 [13] ggplot2_3.3.5 pillar_1.6.4 [15] basilisk_1.6.0 zlibbioc_1.40.0 [17] rlang_0.4.12 S4Vectors_0.32.3 [19] Matrix_1.3-4 reticulate_1.24 [21] splines_4.1.2 igraph_1.2.9 [23] RCurl_1.98-1.5 munsell_0.5.0 [25] DelayedArray_0.20.0 compiler_4.1.2 [27] pkgconfig_2.0.3 TrajectoryUtils_1.2.0 [29] BiocGenerics_0.40.0 mgcv_1.8-38 [31] tidyselect_1.1.1 SummarizedExperiment_1.24.0 [33] tibble_3.1.6 gridExtra_2.3 [35] GenomeInfoDbData_1.2.7 edgeR_3.36.0 [37] IRanges_2.28.0 matrixStats_0.61.0 [39] fansi_0.5.0 viridisLite_0.4.0 [41] crayon_1.4.2 dplyr_1.0.7 [43] bitops_1.0-7 basilisk.utils_1.6.0 [45] grid_4.1.2 jsonlite_1.7.2 [47] nlme_3.1-153 gtable_0.3.0 [49] lifecycle_1.0.1 magrittr_2.0.1 [51] scales_1.1.1 pbapply_1.5-0 [53] XVector_0.34.0 viridis_0.6.2 [55] limma_3.50.0 ellipsis_0.3.2 [57] filelock_1.0.2 generics_0.1.1 [59] vctrs_0.3.8 RColorBrewer_1.1-2 [61] tools_4.1.2 Biobase_2.54.0 [63] glue_1.5.0 purrr_0.3.4 [65] MatrixGenerics_1.6.0 parallel_4.1.2 [67] colorspace_2.0-2 GenomicRanges_1.46.1 ```

This worked previously using R 3.5 and tradeSeq==1.4. Compared to your vignette, I am running tradeSeq on the raw counts (I simply ran the CellRank pipeline and stored the pseudotime and terminal state membership in the AnnData object containing the raw counts). Do you have any ideas/suggestions on how to resolve this issue? Is there a (easy) way to convert the pseudotime and cell weights into a slingshot object?

koenvandenberge commented 2 years ago

Hi @WeilerP,

Thanks for letting us know. Do you mind sharing the tradeseq_input.h5ad file so we can take a look?

WeilerP commented 2 years ago

Ah yes, sorry, forgot to add it. I guess this should work: tradeseq_input.h5ad.zip.

WeilerP commented 2 years ago

Thanks for letting us know. Do you mind sharing the tradeseq_input.h5ad file so we can take a look?

@koenvandenberge, any update on this?

koenvandenberge commented 2 years ago

Hi @WeilerP,

I have not been able to fix this yet, but I can already say that the problem seems to be related to the parallelization; if I set parallel=FALSE, it runs without issues for me.

koenvandenberge commented 2 years ago

Hi @WeilerP

My apologies for taking so long in getting back to you. On my end, it seems to be related to the SnowParam specification of the parallelization. When I run the code below, it returns the error you state above.

sceGAM <- fitGAM(
  X[1:40,], 
  pseudotime=pseudotime,
  cellWeights=cell_weights,
  nknots=6,
  verbose=TRUE,
  parallel=TRUE,
  BPPARAM=SnowParam(workers = 2),
  sce=TRUE)

However, specifying parallelization using MulticoreParam works, for me, as below

sceGAM <- fitGAM(
  X[1:40,], 
  pseudotime=pseudotime,
  cellWeights=cell_weights,
  nknots=6,
  verbose=TRUE,
  parallel=TRUE,
  BPPARAM=MulticoreParam(workers=2),
  sce=TRUE)

Does this also work for you?

WeilerP commented 2 years ago

@koenvandenberge, yes, thank you, works like a charm!

Leads me to another question, though: How can you use the plotGeneCount function in this context? It requires the argument curve which is either an SlingshotDataSet, SingleCellExperiment or CellDataset. I'm constantly running into the "Impossible to place a knot at all endpoints.Increase the number of knots to avoid this issue." warning and would like to visualize where the knots are placed. This would also be relevant later on when running the earlyDETest at a specific knot (e.g. close to a branching point) as show here, for example.

koenvandenberge commented 2 years ago

Hi @WeilerP,

That's a good point and there is currently indeed no way to use plotGeneCount without a Slingshot output. We will look into this. As a quick current workaround, you should be able to extract the knot points using metadata(sceGAM)$tradeSeq$knots, where sceGAM is the SCE after running fitGAM. You could then add them to your plot of the trajectory. The knots are the same for all lineages.

flde commented 1 year ago

Hello,

many thanks for the open discussion! I just used the cellrank (velocity+connect kernel and cytotrace pseudotime) results with tradeseq. In the soft-asignment code above from @WeilerP you used the terminal_states_memberships as cellWeights for the fitGAM. Intuitively and from the cellrank GAM API I would have used the absorption_probabilities. Which one would be correct?

Many thanks and happy holidays, Florian

Marius1311 commented 1 year ago

Hi @flde, thanks for pointing this out - @WeilerP, I think we should use the absorption_probabilities and not the terminal_state_memberships, the latter don't really have a trajectory interpretation.

statOmics / tradeSeq

CellRank interface #164