statOmics / tradeSeq

TRAjectory-based Differential Expression analysis for SEQuencing data
Other
239 stars 28 forks source link

Pitfalls of testing between conditions (ConditionTest()) #250

Closed Jaimelan closed 4 months ago

Jaimelan commented 9 months ago

Hi, I am Jaime,

I'm currently using your pipeline to analyze trajectory inference data from Slingshot and determine differences between conditions from two lineages within the same dataset.

First of all, I wanted to thank you for the effor in developing this software, as it is proving very insightful for our data.

What I wanted is to discuss potential pitfalls that can affect the condition testing, and to know if there are circumstances in which it is prefferable to approach the analysis from different angles.

One example could be to fit data with a lack of cells from one condition in a part of the lineage, the slingshot TI could yield a trajectory but on the contrasts, the models for the zone that is depleted for one condition would be misrepresenting the real data. I attach an image to illustrate a bit better this situation.

tradeseq

Being involved on the use of these pipelines I have found a very interesting and relevant discussion that must be ongoing with the different approaches that can apply for pseudotemporal data.

Anyways, thank you for your attention,

Best Jaime

koenvandenberge commented 9 months ago

Hi Jaime,

I'm not quite sure I understand the following statement

the models for the zone that is depleted for one condition would be misrepresenting the real data.

Are you referring to the tradeSeq model or to the condiments tests? What do you mean with 'misrepresenting the real data'? In your visualization, assuming the two colors represent two conditions, are you saying that condition 'red' has no cells along the lineage, but only at a start and end point?

Jaimelan commented 9 months ago

Hi, Sorry, if I was not clear.

I was referring to a hypothetical situation in which you are testing conditions within tradeSeq using the conditionTest() function, and you lack representation of one condition across the pseudotime of a single trajectory (without dividing trajectories with condiments ).

The GAM could be adjusted and conditions tested, couldn't it? But would the test be appropriate being the pseudotemporal ordering somewhat "truncated" for the cells that are lacking in that region in the middle?

I do not find these concepts to be easily grasped so forgive me if I did not make myself clearer.

Thanks

koenvandenberge commented 9 months ago

Hi Jaime,

To ensure I understand your question I'll try to recapitulate it here. Let me know if I got it right or not. You are thinking of a situation where we have e.g. 2 conditions, each with their own trajectory. If one would estimate a single trajectory when pooling all cells across both conditions, then one of the conditions has no cells in a particular region of this common trajectory. For example, the condition may not branch at a particular region where the other condition does branch out. Your question, then, is whether one can still perform DE using the conditionTest.

First, this seems to be a case where the two trajectories have different topologies, a situation which we describe in the condiments paper. When two trajectories have different topologies, the downstream analysis becomes much more complicated. If the two conditions' trajectories have different topologies, you could consider mapping the trajectories between conditions. In our toy example above, this would basically eliminate the branch uniquely present for the respective condition, as this lineage cannot be mapped between conditions. A pragmatic way of dealing with this could also amount to removing the cells part of this unique branched out lineage, after which the trajectories should be (more) comparable.

If you would still fit a common trajectory, and there's this branch in the trajectory that is unique to one condition, I am unsure how tradeSeq will behave. In theory, we have no data there for one condition to fit the mean parameters, so our parameters would be non-estimable. But, the basis functions are active across a large domain of pseudotime, so may still get estimated. What I would hope that happens, is that their estimates are very uncertain and therefore the test has high p-value for that specific lineage. In practice, however, this is a situation I have not encountered yet, so I can't say exactly what will happen.

Happy to follow along to hear your feedback about this.

cc @HectorRDB

HectorRDB commented 9 months ago

Hi @Jaimelan I am confused as well. In the original image, it looks like you have a common lineage for both conditions, but one region (the middle here) of the lineage only contains cells from that one condition. In that case, as mentionned by @koenvandenberge, the coefficients will be highly uncertain in that region and won't drive eventuall differential expression between conditions. Just to check, you can use the plotSmoothers function to visualize a few genes, or even visualize the clusters of DEG as shown in the vignette.

Jaimelan commented 9 months ago

Hi again,

Actually, you have answered my doubts from a couple of perspectives.

Addressing @koenvandenberge, I see that the most appropriate way of proceeding would be to take into consideration the topology of the dataset and then decide if you still perform the contrast or you first remove that region of cells that is only present in one condition and then take the contrast.

However, sorry for the lack of clarity, @HectorRDB is right with the interpretation of the (most certainly improvable) graphic. So what would happen even if keeping the cells is that the contrast would not yield differential expression due to a lack of information from one of the conditions.

This was just an hypothetical that arised in my group the other day discussing the approaches with condiments and tradeSeq. Thank you both for the answers, I see it clearer now.

Best, Jaime