mhammell-laboratory / TEtranscripts

A package for including transposable elements in differential enrichment analysis of sequencing datasets.
http://hammelllab.labsites.cshl.edu/software/#TEtranscripts
GNU General Public License v3.0
229 stars 30 forks source link

Strandedness of TEs and getting antisense counts #171

Closed murphytho closed 9 months ago

murphytho commented 11 months ago

Hi there,

First let me say thank you for an awesome tool! Our lab is very excited to see how we can use it.

My question may be a bit complex, and I'm unsure if this is the best place and people to ask, but I figure it can't hurt.

I have RNA-seq data resulting from a reverse stranded library prep that I am running through TEtranscripts. Out of curiosity, I decided to run it in all three modes: reverse, forward, and unstranded. In the past when I've done this (using other softwares in non-TE experiments with TE reads and multimappers excluded), I've gotten high counts in the reverse orientation, low in the forward, and about the same as the reverse in the unstranded mode. When I did it here, however, I got quite a different story.

I found that running it in forward mode actually gave quite high counts for TEs, in many cases comparable to or even more than the count in reverse mode. Running it as unstranded gave me even higher numbers, but these counts were NOT equivalent to the counts in forward mode + the counts in reverse mode.

My number one theory right now is a biological explanation: bidirectional transcription. The GTF file specifies a strand that each TE gene is "on" (the sense direction), but if there is bidirectional transcription, the reads from an antisense transcript would map to the other strand and not be counted in the reverse mode. In theory, these antisense transcripts would be counted in the forward mode, though, right?

So, my question is two parted: First, why is the count of a particular TE in unstranded mode not simply the count in forward mode + the count in reverse mode? This is true of both the normalized counts from DESeq2 and the non-normalized count table. The unstranded counts are always lower than the sum of counts from the other two modes.

Second, is running the program in forward mode an adequate and accurate way to quantify antisense transcripts? Or is a better way to do this to adjust the GTF file to swap every + for a - and vice versa in the strand column and re-run in reverse mode? And related, to get the most accurate total count (sense+antisense transcripts or forward+reverse strand) is it better to use the unstranded mode or simply add those that I get from forward and reverse?

As I mentioned, I'm not certain I'm asking the right people, but I have a feeling you'll have a better idea than me. Any thoughts would be greatly appreciated!

Thanks

olivertam commented 11 months ago

Hi,

Thank you for your interest in the software. We will try to answer your questions as best as we can.

1) It is not unusual that the unstranded mode is not exactly the count of the sense + antisense counts, as the program would need to always take into account of both the gene and TE annotations if there are any overlaps, versus ignoring whichever annotation is not in the correct orientation (of course, if the gene & TE are in the same orientation, then it would be affected regardless of mode). However, you should typically expect numbers that are pretty close (within 80-90% of just adding them up), so if you're seeing significantly altered numbers (e.g. fold differences), then that is certainly a concern.

2) Running the program in forward mode would technically work for the TE, but it would also try to annotate antisense gene reads, which may or may not be what you want (and would certainly affect genes with known antisense transcripts). It might be safer to keep the genes as "sense", and maybe alter the TE GTF as you described (swapping + for -), and the running it as required. You could also try to combine the altered and unaltered TE GTF, though you would need to change the gene_id to reflect the antisense nature of the annotation (e.g. L1HS_antisense).

Please let us know if that doesn't address your question.

Thanks.

murphytho commented 11 months ago

Hello again,

Thank you so much for both a quick and thorough reply! I had not considered the idea of DNA contamination. I didn't do the lab work, so I'm not sure if DNase was used during RNA isolation. I should keep that in mind for the next time we do this.

I'm sure there is a better way to do this, but I did what may be a quick and dirty check by comparing the total number of reads mapped by STAR to the counts in the TEtranscripts output table. I tested a few samples, and the numbers are similar. In one sample, STAR mapped about 66 million reads (either unique or multi), but adding up all values from the TEtranscripts count table (run in reverse mode) only yields about 40 million reads mapped to annotations in my GTFs. What are your thoughts on that ballpark percentage? Suggestive of contamination?

When I check the count table run in forward mode, I get about 17 million reads mapped to annotated regions (13 million of these come only from TEs-- not from cellular genes)

Looking through the count table run in "forward" mode, I do see counts higher than I would expect for many non-TE genes. However, they are much less than the counts when run in the "forward" mode, which makes sense. What is strange however, is that the TE counts are generally higher when run in the forward mode than in the reverse, suggesting (assuming no contamination) that antisense transcripts are more common than sense transcripts in these TEs. If contamination were the only explanation, I would expect to see this trend hold true in all annotated genes, not just in TEs, right? Plus, DNA contamination, theoretically, would be uniform throughout all TE reads, sense or antisense, and whichever counts are higher should be the more expressed one.

This is becoming more and more complex! Any thoughts on how I did that estimation, why my reads mapping to the "antisense strand" (forward mode) are most TEs, or why I would get quite consistently higher "antisense" reads than "sense" reads in TEs?

Thanks again so much.

olivertam commented 11 months ago

Hi,

Thanks for your response.

Your quick and dirty check is useful. We have been typically getting around 70% of mapped reads annotated to our gene (GENCODE) and TE annotations. Your values of 40/66 million is a little lower than what we have been seeing, but I don't think it's simple contamination (as you pointed out, antisense gene reads are not as numerous).

There is a confounding factor regarding the high TE counts in the forward run: if a TE is inserted antisense to the gene transcript, then a run that looks for sense reads (reverse) would ignore that TE but count it towards the gene. However, in the forward mode, those gene-derived sense reads are now treated belonging to the TE (antisense), and so potentially inflating the TE counts with gene-derived transcripts that happen to contain TE within. This is why we recommend combining sense and antisense TE annotations together (if that is something that you're really interested in) and running it with the gene annotations, so that gene-derived reads won't be erroneously assigned to TE.

Hope this is helpful.

Thanks.

murphytho commented 11 months ago

That's a great point. How does the program handle overlapping annotations? If I do adjust the TE GTF to create an "antisense" GTF (flipping all the signs) and there is a gene overlapping a TE in that direction, how will TEtransciripts assign a read that maps to both?

Also, if we decide to go down this route, would you suggest A) flipping all the signs in the TE GTF, combining the sense and antisense GTFs, and running in reverse mode with the GENCODE GTF or B) flipping signs to create an antisense GTF and TEtranscripts twice, once with the sense TE GTF and the GENCODE GTF and once with the antisense TE GTF and the GENCODE GTF?

Thank you again for your help, insight, and fast responses. I appreciate it immensely!

olivertam commented 11 months ago

Hi,

TEtranscripts & TEcount handles uniquely aligned and ambiguously/multi-aligned reads differently. If a read is uniquely aligned in the genome, it would preferentially be assigned to a gene annotation. If there are no gene annotations, then it would assign it to a TE annotation. If neither exists, the read is unannotated (and thus not quantified).

For reads that are ambiguously/multi-aligned, they are first assigned to any overlapping TE annotation (ensuring the strand of the aligned read and annotation matches the strandedness of the library), with a fractional count for each TE annotation. If there are no TE, then the read is distributed to overlapping gene annotations.

After all reads are distributed, the ambiguously aligned reads that were assigned to TE then go through the EM steps for redistribution, before being added to the rest.

TL; DR: Depending on whether it's uniquely/ambiguously aligned, it will be preferentially assigned one annotation over the other

To address your other question: I would probably recommend running A, since that would ensure that the EM will try to redistribute reads for both sense and antisense TE at the same time, while ensuring that gene reads are assigned as best as it can. It would also technically generate a single count table with everything, so you don't have to combine each run for your differential analysis. However, we haven't really tested that approach with RNA-seq (we have only done this for small RNAs), so we can't say what the "expected" result would be.

Thanks.

murphytho commented 11 months ago

I agree, A seems like the best option. I will give it a try, and make sure that I keep in mind that it isn't strictly meant to be used that way and isn't tested. We will see if anything interesting shakes out, though.

Thank you once again for your quick, detailed, and insightful answers. I truly appreciate it and hope you have a great New Years!

murphytho commented 11 months ago

Hi there, one more quick question.

I am using one of your guys' GTF files for TEs (CRCh38_GENCODE_rmsk_TE.gtf) and just found what I think might be an artifact, but I want to check. I noticed that there is an annotation (in the geneid flag) that says "LTR12" but also one that says "LTR12" in other lines. Both have multiple lines annotated with them. Of course, that underscore makes them counted separately. But are these actually different elements? This is the only example I've found like this-- it just seems like an underscore may have been put in some of the annotations by accident at some point, and the GTF carried that mistake with it.

Let me know your thoughts.

olivertam commented 11 months ago

Hi,

Thank you for your question. This is actually something that exists in the hg38 RepeatMasker track (which we then convert to GRCh38 GENCODE nomenclature), and corresponds to LTR12_v, a variant of LTR12. For some reason, when the RepeatMasker track was generated, the _v got converted to _, and thus we have this weird annotation. It is technically a subfamily variant, so probably should be considered separate. We have seen similar issues with other TE "variants" in other organisms.

While we could "fix" the issue, we decided that we would try to minimize changes to the original annotations so we could easily trace back any weirdness to the source (e.g. with LTR12_). If you are interested in seeing an LTR12_ annotation in the UCSC genome browser, you can navigate to chr1:52587723-52587930 in hg38 and look at the RepeatMasker track.

Hope this addresses your concerns.

Thanks.

murphytho commented 11 months ago

Got it. That makes sense. As long as it’s a unique subfamily/variant, it doesn’t particularly matter what it’s annotated as, just that it’s different. I will leave it as it is too— as you said, it will make it easier to trace back anything we see. Good to be aware of.

Thank you again so much!

github-actions[bot] commented 10 months ago

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days