mhammell-laboratory / TEtranscripts

A package for including transposable elements in differential enrichment analysis of sequencing datasets.
http://hammelllab.labsites.cshl.edu/software/#TEtranscripts
GNU General Public License v3.0
206 stars 29 forks source link

Consistently higher TE expression in control rather than treatment across many datasets #178

Closed stevencincotta closed 3 months ago

stevencincotta commented 5 months ago

Hello Oliver,

I have a quick question regarding a potentially strange output I am getting while using TEtranscripts. I have run multiple datasets through the program (4 entirely different sets), but when I perform the differential expression analysis at the end, I am consistently seeing a skew in my volcano plots towards the "control" condition side. This has occurred across multiple datasets with entirely different "controls", and I do not suspect this is a biological effect (in almost all of our treatment conditions, I expected to see higher TE expression, and I in fact am seeing the opposite). Furthermore this "skewing" of the volcano plot seems to occur only when I look at the TEs, not the genes. I have been taking the DEseq2.R file that the program outputs, running it in Rstudio off the generated counts table, and then I have been taking the resulting .txt file and extracting the genes and TEs into two separate .csv files to read into R to generate my plots. I was wondering if anyone else has come across this issue, and whether you might have any suggestion. I am not sure if this has to do with any data normalization steps occurring within the .R data files generated. Any insight you might have would be greatly appreciated. Thank you in advance!

Here are some representative images of what my volcano plots have been looking like (blue is cntrl, orange is treatment)

Screen Shot 2024-01-30 at 10 51 30 AM Screen Shot 2024-01-30 at 10 54 55 AM

Screen Shot 2024-01-30 at 10 55 07 AM

olivertam commented 5 months ago

Hi,

Thank you for your interest in the software.

We have seen skews in TE up/down-regulation before, but not in a consistent way that would be independent of biological condition (i.e. we see both up and down in different experiments/comparisons).

I guess my question is whether there are any positive controls (e.g. genes that you expect to go in a particular direction) that confirms that the comparison is correct.

It might also depend on the type of "controls" that you're using. We have seen cases where knockdown by certain "control" shRNA do appear to affect TE expression, and thus we had to be more careful when selecting certain control hairpins. Again, this might not be your case, but just a thought.

Without digging into the data, I don't think I can offer more advice, but I'd be happy to discuss this further.

Thanks.

stevencincotta commented 5 months ago

Hi Oliver, thanks so much for your swift response!

I have plenty of internal positive controls that are working (e.g. on the gene side and not the TE side, we see loss of our knockout gene, etc.) to at least confirm the comparison is correct. Do you by any chance have a human dataset that you could point me towards that has been verified to run well on TEtranscripts that I could potentially analyze as a control?

The more I look into this, I am starting to think there may be a partial biological factor skewing this based on my control. My treatment conditions involve activation of immune cell populations, which may be accompanied by large upregulation of total transcript content in the cells, which I fear may be "masking" any more lowly expressed TEs. If you'd prefer we discuss further offline and not as part of this forum, please just let me know - would be curious to get your insight on how I might normalize for this within the tool. Thanks again!

olivertam commented 5 months ago

Hi,

This is one of our dataset where we saw upregulation of TE upon TDP43 knock-down, and where we also see slight differences in TE expression depending on the control shRNA that we used. We do see a skew here too (where TE are all up-regulated). If you want another dataset where we don't see much TE changes, let me know.

Regarding your hypothesis about large upregulation of total cellular transcript, that will certainly cause issue, as it does break an assumption that many of these algorithms have (in that most features are similar between samples). It would be interesting to see if, after normalization, the proportions of total TE reads (compared to gene + TE) are significantly different between your treatment and control. If so, that might explain the huge skew in TE expression that you're seeing.

Regarding how to deal with this, if there are clear differences in the overall behavior of genes & TE transcription, it might be worth splitting your final count table into genes only and TE only, and then run differential analysis on them separately. If you still see a global decrease in TE in your treatment, then you might want an orthologous validation (e.g. qPCR) to check if this bears out.

Sorry if this isn't as helpful. I'm ok with continuing the discussion on this forum, though happy to go offline if you prefer not to discuss your project here.

Thanks.

github-actions[bot] commented 4 months ago

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days