mhammell-laboratory / TEtranscripts

A package for including transposable elements in differential enrichment analysis of sequencing datasets.
http://hammelllab.labsites.cshl.edu/software/#TEtranscripts
GNU General Public License v3.0
206 stars 29 forks source link

The problem with the strand. #146

Closed Wenwen012345 closed 10 months ago

Wenwen012345 commented 10 months ago

Dear @olivertam

Hi, this is a very good tool and I am already on my second recent study applying the tool. The tool I use to identify TEs for all my species (8 species) is EDTA (https://github.com/oushujun/EDTA), but I found that many of the TEs identified with EDTA are not labelled with "strand " (see: https://github.com/oushujun/EDTA/issues/80). However, I know that TEtranscripts should be run with "strand". What do you suggest for this issue?

Attached is the intact.gff3 file from the EDTA run. Rb.wrapped.FINAL.fasta.mod.EDTA.intact.gff3.txt

olivertam commented 10 months ago

Hi,

Thank you for your interest in the software.

It is suboptimal that there is no strand information in some of the EDTA output. I envision two options: 1) You can convert all the ? into +. This will have the least impact if you're working with unstranded RNA-seq libraries, as it will ensure that those elements are quantified. If you are working with stranded libraries, it could impact the precision of the quantification, as you are now uncertain whether you are getting the sense or antisense read. 2) You can remove entries with ?. This is the most conservative approach, as you would be able to quantify a stranded library without ambiguity. However, you will lose all those elements from the quantification.

Checking through your GFF, I see mainly LTR/unknown as the predominant TE element (~11k) that has ? for strand, with a smaller number of Copia (90) and Gypsy (138). Thus, I envision the impact of altering the strand or removing ? entries to have most impact on LTR/unknown and minimal impact on the other TE.

Thanks.

Wenwen012345 commented 10 months ago

Ok, thanks for the reply. @olivertam

Sorry there are concepts that are not particularly clear to me; is it that Mutator_TIR_transposon,, PIF_Harbinger_TIR_transposon, Tc1_Mariner_TIR_transposon, etc. don't have "expression" statement. At least in my opinion, there really should be no "expression" of these DNA transposons at the RNA level; therefore TEtranscripts do not count the expression of the sites where these TEs are located, right (unless they overlap with the transcripts that are expressed)? So their "strands" are "." . But I noticed that in the example file you gave, ZmAGPv4_TE.gtf, there is a item of "DNA_transposon"; so I'm not quite sure if TEtranscripts counts the "expression" of these sites? Then in my file, the "helitron" item has strand information, this DNA transposon. The LTRs are expressed, which is understandable, since it is a Class I transposon. But I'm a bit confused about the other transposons and how TEtranscripts handles them.

Thanks!

Sincerely. Wen

olivertam commented 10 months ago

Hi Wen,

You are right that, "in theory", there might be no expression of these DNA transposons at the RNA level (though to be honest, I don't know if that is always true in plant systems).

However, TEtranscripts assumes that if the annotation exists, then there is a possibility of their expression, and thus would count them if the reads overlap (either if they align to multiple genomic locations overlapping the same TE, or uniquely aligned and overlapping just the TE).

TEtranscripts also make an assumption that it's assessing largely spliced transcripts, and thus if a TE is located in the intron of a gene and was "detected", it will assign the read to the TE rather than the gene intron.

Therefore, if you are absolutely confident that these DNA transposons should not have any expression in your system, you can take them out of the TE GTF, and that way, TEtranscripts will not try to annotate reads to them.

As a side note, the . value in the strand column would also have the same issue as the ? for TEtranscripts, in the sense that it can't perform stranded quantification. So if you're not interested in them, you can remove those entries too.

Please let me know if there are further questions.

Thanks.

Wenwen012345 commented 10 months ago

@olivertam Really appreciate your guidance and answers, very helpful. No more questions. Thanks so much!