Open Marius1311 opened 3 years ago
Hi @Marius1311. Thanks for the idea, I think it would indeed be useful to users.
The main tradeSeq functions is fitGam
, which fit the smoothers for each gene. It accepts several input formats. The most useful would be the following:
counts
argumentcellWeights
argument. It looks like this is what CellRank is already providing. pseudotime
argument. This will just have to be a concatenation of the global pseudotime vector into the appropriate matrix size. Then, fitGam
outputs a singleCellExperiment Object with the smoother information stored in the rowData slots.
Adding a direct convertor to the package would require us to list CellRank as a dependency and that would mean having python and so on, which is kind of a nightmare with bioconductor.
However, we can easily write a converter function that would live in your package and produces all the input needed for tradeSeq. We can also (or instead) write a small vignette together on how to use tradeSeq
downstream of CellRank
.
Anyway, it sounds like a useful idea!! Thanks for reaching out
Hi @HectorRDB, thanks for getting back to me! I'm channeling in @michalk8 for the technical details of constructing the interface. @michalk8, how would you construct the interface?
@HectorRDB, how are cellWeights
used internally? Do you just assign every cell to it's argmax lineage, or are the weights taken into account when fitting the GAMs?
Writing a small vignette/tutorial together would be a great idea, I'm all in for that. Let's first discuss the interface, and then we move to that!
Hi @Marius1311
The cellWeights
are first normalized to sum to 1 and then cells are randomly assigned to one lineage according to a multinomial distribution with the normalized weights acting as event probabilities (see here for the relevant piece of code).
As an aside, we also tried a version where cells were assigned to all lineages, with the cellWeights acting as observation weights. However, it lead to nearly identical fits and p-values when testing for DE and it made the smoothers less interpretable.
I agree with the plan, let's weight for your college's input.
Writing a small vignette/tutorial together would be a great idea, I'm all in for that. Let's first discuss the interface, and then we move to that!
The interface really depends on how we want to use tradeSeq
's GAM (i.e. either only for significance testing, or whether we want to use the model in our functions, such as cellrank.pl.cluster_fates
).
If only the former is wanted, than, simply wrapping these with rpy2
in cellrank.external
is not an issue and is fairly straightforward.
On the other hand, if we'd like to have a tradeSeq
model in cellrank.external.models
conforming to our API, it's a bit more tricky, but I already have an idea. Is there a way to extract the model predictions, as well as the confidence intervals. predictCells
seems to return only the former (the latter is only necessary for plotting).
@HectorRDB, that's surprises me, because we do just that in CellRank (we fit GAMs to visualize gene expression trends along lineages, passing lineage-probabilities as cell-level weights into the loss function) , see below for some example fits from our preprint. I wonder why that didn't work in your case - I think softly assigning cells to lineages via cell-level weights makes more sense because if you just sample cells, you throw away some of the information you have, don't you?
Here's the relevant part of CellRank's API: https://cellrank.readthedocs.io/en/stable/api/cellrank.ul.models.GAMR.html#cellrank.ul.models.GAMR
Thank you for these suggestions, @Marius1311.
It would indeed be useful and relevant to see if we can try to streamline a workflow from CellRank
to tradeSeq
and we'll be glad to help.
As @HectorRDB is mentioning we have indeed also considered soft assignment of cells to lineages. We considered two options:
mgcv
's by
statement in mgcv::s
, which also leads to good fits and as you say might be prone to keeping more information. However, it seems to become more complex for correct statistical inference as soon as there are >= 2 lineages. If we're working with a trajectory with, say, two lineages and a long common trajectory before they're bifurcating, then the estimated smoother of the two lineages before the bifurcation point will use each cell in that space for about 50%. As we get closer to the bifurcation and for a while after it, these weights start evolving to binary 0/1 weights.
So it seems to me there is an association between the weights and the pseudotime, which we would have to model (on top of the gene expression) in order to derive the average smoother for each lineage, and therefore in order to provide inference. Practically, in order to derive the estimated average for lineage 1 at some pseudotime, you'll also need the average weight at that pseudotime. Since fits looked similar to assigning cells using the Multinomial, we decided to forego this additional complexity.
Of course, this all depends on the weights. If CellRank
or another method has weights that are more involved (as in, may reliably assign cells to lineages even before a bifurcation), the soft assignment could be more worthwhile, I think.
mgcv
, however, this required us to 'stack' the data with the number of stacks equaling the number of lineages. In the two-lineage case, we'd have one set of rows with observation-level weights equaling P(lineage 1), and then another set of rows with observation-level weights equaling P(lineage 2). Since the number of rows in our dataset has then doubled we'd have to really carefully check the calculations being done in mgcv
for e.g. the variance-covariance matrix.Thanks @koenvandenberge! @michalk8, how exactly do we use the weights in mgcv
?
Thanks @koenvandenberge! @michalk8, how exactly do we use the weights in mgcv?
By default, we use this formula y ~ s(x, bs='cr', k=5, sp=1.0)
.
So where do the weights come in @michalk8? Is it the first or the second possibility that @koenvandenberge describes?
So where do the weights come in @michalk8? Is it the first or the second possibility that @koenvandenberge describes?
The second, we're passing it as weights
in mgcv::gam
.
Hi @koenvandenberge, should we have a zoom chat about this sometimes? Seems easier to me.
Sounds good to me.
This week, I believe that both @HectorRDB and I are on Central European Time (but it's possible @HectorRDB is taking some time off). As of next week I will be in California and so it might be a little bit harder to find a good time to meet.
This week I can do
@michalk8, would Friday morning work for you?
Following points were discussed in the meeting (feel free to correct and add where needed).
CellRank
uses RNA velocity information of each cell to probabilistically determine its most likely future (among its k-nearest neighbors). By propagating these velocities, "fate probabilities" are derived for each cell, which may be interpreted as the probability that a cell will develop to each of the respective terminal states of the trajectory.tradeSeq
(or really for either way of fitting the GAM).CellRank
. The stacking approach suggested above could be tested.How to test whether stacking works.
mgcv
really uses the sum of the weights as the true number of data points.w=0.99
for 'original' and w=0.01
for stacked data points). Doing this repeatedly as in increasing the stack, while keeping the sum of the weights equal to the number of original data points and checking whether the estimated variance of the smoother mean drops or not could be useful.Hi @koenvandenberge ,
sorry for late reply, I've tried the above mentioned stacking approach on simulated data (as well as weights from CellRank), but it seems that mgcv
doesn't use the obs. weights as the #observations (the weights sum to #obs, as mentioned above), assuming I di the things correcly (see the notebook below) - I've looked at the mean variance-covariance matrix and it decreases with increasing number of stack. This is also supported by the CI, which with increasing #stacks become smaller.
smoothers.ipynb.txt
@koenvandenberge, did you have a chance to look at this yet?
Thanks so much for taking a look at this, @michalk8. This is indeed what I was afraid might happen.
We've been talking a bit about the assignments and it would be good if we could play around with an example dataset where you think this would be useful. Could you therefore please share with us a fitted CellRank
trajectory and the accompanying data?
It would be great if we could use that to explore how the weights are behaving.
Hi @koenvandenberge, thanks for getting back to us. You can get an example of CellRank's weights by running our basic tutorial: https://cellrank.readthedocs.io/en/stable/cellrank_basics.html
The data is included in cellrank.datasets
and is downloaded automatically.
Could you therefore please share with us a fitted CellRank trajectory and the accompanying data?
I will share with you on Monday the notebook I've used to get the CR data when testing the smoothers (though as Marius mentioned, it should be pretty straight-forward, I also just ran the tutorial + exported to csvs).
Sorry for late reply, here's the notebook: cellrank_export.ipynb.txt
Hi @Marius1311 and @michalk8,
A PhD student in our group, @ViktorVerstraelen, has been looking into this in more depth.
We first tried using the CellRank
weights as is in a default tradeSeq
pipeline. Since some lineages have very low CellRank
weights overall, this leads to few cells being assigned to that lineage, and thus a higher uncertainty of the mean expression smoother.
This will lead to fewer genes detected in that lineage as compared to, e.g., using weights from slingshot
.
He is now also looking into whether we can get soft-assignment working in mgcv
. We believe that it should be theoretically possible but the practical implementation is not yet clear to us.
Even if we can get this working, I do not expect the point above to change qualitatively.
Best, Koen
Hi @koenvandenberge! Thanks for getting back to me. The point about early cells having very low weights for particular trajectories (e.g. Delta for pancreas) is not really inherent to CellRank, it's inherent to the Velocity kernel, i.e. that's really a feature of RNA velocity. With other CellRank kernels, i.e. with the Pseudotime kernel, I expect this to be less of an issue.
I see, thanks for bringing that up.
If I understand correctly, the Pseudotime kernel will not use the RNA velocity information and relies mainly on the distances in the low-dimensional embedding. In that case, I think tradeSeq
can be applied directly downstream of CellRank
and all we need is interoperability to make this a smooth pipeline. Would you agree?
Yes! The PseudotimeKernel is inspired by Palantir (Setty et al., Nature Biotech 2019), e.g. it combines a KNN graph with a pseudotime. However, it uses a different, more stable weighting scheme than Palantir and it allows the user to input any pseudotime, e.g. Monocle, Slingshot, DPT, etc. There's no RNA velocity information used here
There's also the CytoTRACE kernel, which does not use RNA velocity information, and there is the external StationaryOT
kernel, which also does not use RNA velocity information. The nice thing: all of these kernels give the same output, i.e. you can compute fate probabilities based on any of them, so we have a consistent pipeline while allowing flexibility in the individual components.
Hi @koenvandenberge ,
sorry for late reply, I've tried the above mentioned stacking approach on simulated data (as well as weights from CellRank), but it seems that
mgcv
doesn't use the obs. weights as the #observations (the weights sum to #obs, as mentioned above), assuming I di the things correcly (see the notebook below) - I've looked at the mean variance-covariance matrix and it decreases with increasing number of stack. This is also supported by the CI, which with increasing #stacks become smaller. smoothers.ipynb.txt
Is this still a problem? Or did we find a solution to this? Should we reach out to someone from mgcv to help us with this?
@koenvandenberge, I am encountering the same issue as described in #73 when running a version of the CellRank vignette. The progress of the fitGAM
reaches 100% and then throws the error
Error in UseMethod("predict") :
no applicable method for 'predict' applied to an object of class "NULL"
Calls: fitGAM ... fitGAM -> fitGAM -> .local -> .fitGAM -> <Anonymous>
The following warnings are issued as well
In addition: Warning messages:
1: In .findKnots(nknots, pseudotime, wSamp) :
Impossible to place a knot at all endpoints.Increase the number of knots to avoid this issue.
2: In min(which(!is.na(SigmaAll))) :
no non-missing arguments to min; returning Inf
Execution halted
This worked previously using R 3.5
and tradeSeq==1.4
. Compared to your vignette, I am running tradeSeq on the raw counts (I simply ran the CellRank pipeline and stored the pseudotime and terminal state membership in the AnnData object containing the raw counts). Do you have any ideas/suggestions on how to resolve this issue? Is there a (easy) way to convert the pseudotime and cell weights into a slingshot object?
Hi @WeilerP,
Thanks for letting us know. Do you mind sharing the tradeseq_input.h5ad
file so we can take a look?
Ah yes, sorry, forgot to add it. I guess this should work: tradeseq_input.h5ad.zip.
Thanks for letting us know. Do you mind sharing the
tradeseq_input.h5ad
file so we can take a look?
@koenvandenberge, any update on this?
Hi @WeilerP,
I have not been able to fix this yet, but I can already say that the problem seems to be related to the parallelization; if I set parallel=FALSE
, it runs without issues for me.
Hi @WeilerP
My apologies for taking so long in getting back to you.
On my end, it seems to be related to the SnowParam
specification of the parallelization. When I run the code below, it returns the error you state above.
sceGAM <- fitGAM(
X[1:40,],
pseudotime=pseudotime,
cellWeights=cell_weights,
nknots=6,
verbose=TRUE,
parallel=TRUE,
BPPARAM=SnowParam(workers = 2),
sce=TRUE)
However, specifying parallelization using MulticoreParam
works, for me, as below
sceGAM <- fitGAM(
X[1:40,],
pseudotime=pseudotime,
cellWeights=cell_weights,
nknots=6,
verbose=TRUE,
parallel=TRUE,
BPPARAM=MulticoreParam(workers=2),
sce=TRUE)
Does this also work for you?
@koenvandenberge, yes, thank you, works like a charm!
Leads me to another question, though: How can you use the plotGeneCount
function in this context? It requires the argument curve
which is either an SlingshotDataSet
, SingleCellExperiment
or CellDataset
. I'm constantly running into the "Impossible to place a knot at all endpoints.Increase the number of knots to avoid this issue." warning and would like to visualize where the knots are placed. This would also be relevant later on when running the earlyDETest
at a specific knot (e.g. close to a branching point) as show here, for example.
Hi @WeilerP,
That's a good point and there is currently indeed no way to use plotGeneCount
without a Slingshot output. We will look into this.
As a quick current workaround, you should be able to extract the knot points using metadata(sceGAM)$tradeSeq$knots
, where sceGAM
is the SCE after running fitGAM
. You could then add them to your plot of the trajectory. The knots are the same for all lineages.
Hello,
many thanks for the open discussion! I just used the cellrank (velocity+connect kernel and cytotrace pseudotime) results with tradeseq. In the soft-asignment code above from @WeilerP you used the terminal_states_memberships as cellWeights for the fitGAM. Intuitively and from the cellrank GAM API I would have used the absorption_probabilities. Which one would be correct?
Many thanks and happy holidays, Florian
Hi @flde, thanks for pointing this out - @WeilerP, I think we should use the absorption_probabilities and not the terminal_state_memberships, the latter don't really have a trajectory interpretation.
Hi TradeSeq team! I think you guys developed an awesome tool, and we were discussing recently whether and if we could interface to it from CellRank. CellRank defines trajectories through a soft assignment of cells to branches and one global pseudotime. So the idea would be to construct an interface that would allow users to seamlessly transition from CellRank to tradeSeq, once the trajectories have been defined. What do you think of this idea, and what data structures would you require on your end for such an interface? Looking fwd to your thoughts!