pachterlab / sleuth

Differential analysis of RNA-Seq
http://pachterlab.github.io/sleuth
GNU General Public License v3.0
305 stars 95 forks source link

Option to output gene-level abundances from sleuth_to_matrix, even when gene_mode = FALSE #188

Open apcamargo opened 6 years ago

apcamargo commented 6 years ago

I can't get gene-level expression values using sleuth_to_matrix, usig both raw_obs and norm_obs. The function returns a transcript-level matrix.

I'm using version 0.30.0.

warrenmcg commented 6 years ago

Hi @apcamargo, did you use gene_mode = TRUE when you ran sleuth_prep? That's required to get gene-level aggregation of counts and TPMs, and it's required to mimic the previous behavior for sleuth versions <= 0.29.0.

If gene_mode = FALSE (which is the default, even if you set the aggregation_column), this method will only report transcript-level values. If gene_mode = TRUE, you can only get gene-level values with norm_obs. Let me know if this works or if there is indeed a bug.

apcamargo commented 6 years ago

It worked. Thanks Warren!

Is there a reason that the new p-value aggregation prevents sleuth_to_matrix from returning gene-level abundance values? I understand that this method doesn't use gene-level expression for it's statistical analysis, but this behavior of not returning gene-level expression values is not user friendly. When a user is doing gene-level testing, I imagine the he will want gene-level abundances.

On a different note, I think it is easy to output the non-normalized TPM gene-level values in sleuth_to_matrix. It's just a matter of multiplying the normalized values by the size factors. I imagine that sleuth currently only supports the normalized values because the between sample normalization is done before summing up transcripts abundance to obtain gene-level abundance. Is that right?

I'm not very knowledgeable in R, to say the least, but I can try to implement this if it's ok with you and Harold.

lynnyi commented 6 years ago

Hi @apcarmargo for your question. I can address some of the questions about p-value aggregation and our rationale going into this. The new way of gene-level testing is built on the idea that the gene significance should take into account transcript dynamics. Since the aggregated gene p-value is only based on the constitutive transcript p-values, we believe looking at overall gene expression would be misleading (there may be significant yet lowly expressed transcripts, or cancellation of effect sizes between transcripts). Instead, looking at the changes in each constitutive transcript’s expression would better explain the aggregated gene p-value.

Also thanks Warren for fielding all the questions!

Best, Lynn

On Jun 26, 2018, at 9:48 AM, apcamargo notifications@github.com wrote:

warrenmcg commented 6 years ago

Hi @apcamargo, I agree that gene-level TPMs are easily returned, and maybe we could implement that.

In the meantime, I would lean heavily on @lynnyi's comments about the utility of looking at gene-level TPMs, whether normalized or not. You may consider running models gene-level aggregation of TPMs to see if there is separately detected change in gene-level abundances, but not sure if it's worth it. I would read the p-value aggregation paper for details, as well as Dr. Pachter's blog post on the topic.

We'll leave the issue open until we decide about implementing the gene-level aggregation in sleuth_to_matrix

apcamargo commented 6 years ago

Hi @lynnyi

I understand that the p-value aggregation strategy doesn't use gene-level abundances for the statistical analysis, but I think users may be confused that they're doing gene-level analysis and getting transcript-level abundances with sleuth_to_matrix.

One possible solution would be adding a new parameter to sleuth_to_matrix in which users would choose whether they want transcript or gene-level abundances, irrespective of the type of gene-level testing.

I've read your paper and fully understand the statistical taking of this strategy, but I observed that my coworkers have gotten very confused that they're not able to get their abundance values. I think this is a matter of user friendliness.

mschilli87 commented 6 years ago

@apcamargo: So if a gene comes up as significant because it switched the isoform while keeping the overall gene level constant, you'd like to get a 'significant' gene level fold-change of 0? Wouldn't this confuse user even more?

IMHO sometimes it is worth actually dealing with the method at hand and trying to understand why it doesn't do things the way one is used to, also when not performing the analyses. I guess few of your collaborators asks you for delta-CT values after analyzing RNA-seq date even though this might be what they used to be used to. :wink:

That aside, offering an option to return gene level estimates probably wouldn't hurt anyone. I just doubt it would help the interpretation of the data.

That's obviously just my 2 cents. :smile:

apcamargo commented 6 years ago

@mschilli87 I agree with you that gene-level TPM values do not make sense in this kind of analysis, but I'm not certain that making it impossible for the user to get those values is a good idea. Maybe sleuth_to_matrix should return transcript-level values by default and gene-level on user request, just like you said.

warrenmcg commented 5 years ago

There was an additional user on the forum that wants a feature similar to what has been discussed here (link to the discussion).

The framework to implement this is available with the deprecated gene_summary and read_per_base_transform methods, but they would need to be reconfigured to work well for this use case.

I have two suggestions:

Given the time constraints, it's not clear if this will be implemented for the next release, but it will be prioritized for the subsequent release if it's not ready in time for this next release.

sudarshaniyer commented 5 years ago

There was an additional user on the forum that wants a feature similar to what has been discussed here (link to the discussion).

The framework to implement this is available with the deprecated gene_summary and read_per_base_transform methods, but they would need to be reconfigured to work well for this use case.

I have two suggestions:

  • We can implement a version where it is possible to get gene-level counts even from the raw transcript values, but it is best if we still use the "reads_per_base" transform because of the multiple issues with combining counts from transcripts of different lengths. Maybe there is space to allow for customization by the user, but for now, this will be the default behavior.
  • For API, we can make a gene_mode option to sleuth_to_matrix that takes as the default the obj$gene_mode. This would allow the function to output gene-level values as a default even for the raw values when gene_mode == TRUE, but could be set to something different by the user.

Given the time constraints, it's not clear if this will be implemented for the next release, but it will be prioritized for the subsequent release if it's not ready in time for this next release.

Is there any update on if/when this new feature will be released?