mhammell-laboratory / TEtranscripts

A package for including transposable elements in differential enrichment analysis of sequencing datasets.
http://hammelllab.labsites.cshl.edu/software/#TEtranscripts
GNU General Public License v3.0
206 stars 29 forks source link

Calculate TE's CPM expression frequency #179

Closed songlyzz closed 3 months ago

songlyzz commented 5 months ago

Hi,olivertam: I sincerely apologize if I take the liberty to disturb you. I recently had a sudden idea to calculate the frequency of expression between TE subfamilies, can I compare different groups every TE subfamily frequency by calculating mean CPM / total CPM? and I know that the total CPM is 10 *6 ,but the mean CPM will be different. I think may be there is some wrong but I didn't think it out. IS this method meaning?

THANKS, SINCERELY.

olivertam commented 5 months ago

Hi,

It's not clear what you're referring to as mean CPM and total CPM. Is the total CPM the total counts attributed to all TE, normalized by million of mapped reads? How is mean CPM being calculated? I assume the denominator is "million of mapped reads", but not sure what the numerator would be.

Thanks.

songlyzz commented 5 months ago

Hi, I sorry for my vague. I use the count matrix to nomalize the sequence deepth by calculating the TE family's CPM matirx, and then I divide the group to control and test group. In each group, I calculate every TE's mean CPM value by sum all sample's values ,then divide sample number. Finally, I calculate the frequency of each TE value accounting for the total TE. Like this : result$average <- rowMeans(result) result$freq <- result$average / sum(result$average) And I use the freq vaule to compare each group's TE expression level.

THANKS, SINCERELY.

olivertam commented 5 months ago

Hi,

I guess you can try that, though I'm not sure what statistical test you would use to determine if they are different.

Based on what you're describing, it sounds like you're trying to compare TE expression between test and control groups. Is there a reason why you can't perform differential analysis using the raw count matrix, using algorithms such as edgeR and DESeq2? In essence, they can normalize the values for each sample and perform comparison using negative binomial models or likelihood ratio tests.

I might not be understanding the comparison that you're trying to make.

Thanks

songlyzz commented 5 months ago

Hi, I paste the picture, I'm not sure if you can open it. I find that different groups all highly express Alu, L2,L1 and LTR even the frequency seems like the same.

THANKS

1706804462252

olivertam commented 5 months ago

Hi,

I think I now understand what you are asking about. You are wondering why there are "consistently high" expression of L1, Alu and LTR family of transcripts in all samples, regardless of treatment/control.

I think one of the aspect that has yet to be fully resolved is the idea that "active" TE (which are typically defined by the ability to be able to retrotranspose) are directly correlated to transcription (as measured in RNA-seq), and that there shouldn't be TE expression since they are not retro-transposing ("active").

I think there is increasing evidence that TE transcription is occurring more often than "downstream" activity (such as generating retroviral proteins and retro-transposition). There is also another argument that some of these TE (especially Alu and older L1) might be generated from "read-through" by other transcriptional units (especially given the large number of Alu elements in the genome), though without long-read sequencing, it's impossible to disentangle their contributions.

Thus, especially when assessing expression at the family level (which is what you're showing), it appears that they are "consistent" across all samples. However, what we have noticed is that at the sub-family (e.g. L1HS rather than L1), there are more variability in expression due to "experimental conditions" that are lost when you include all the other L1 (with varying age/transcriptional activity/mappability), and thus you might be "averaging" out all the signals.

In contrast, if you could perform differential analysis at the sub-family (and perhaps even locus) level, you might see differences in expression of elements that are prone to perturbations, while elements with more "basal" transcription rates (e.g. older TE or readthrough) would not show up as differentially regulated.

Hope this is somewhat helpful.

Thanks.

songlyzz commented 5 months ago

Hi olivertam, I understand it. Thanks for your sincere reply, very very helpful! I am very grateful for your answer.

THANKS, SINCERELY.

github-actions[bot] commented 4 months ago

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days