mhammell-laboratory / TEtranscripts

A package for including transposable elements in differential enrichment analysis of sequencing datasets.
http://hammelllab.labsites.cshl.edu/software/#TEtranscripts
GNU General Public License v3.0
217 stars 29 forks source link

--stranded option #48

Closed clouds-drift closed 5 years ago

clouds-drift commented 5 years ago

I think for most of the RNA library, they are not stranded , right? Library such as brdu labeled are stranded, but it is not very often. The default --stranded is set as yes, that makes me lost a lot of reads if my library is not stranded?

hope for some help on this ! Thanks

olivertam commented 5 years ago

Hi,

If you are using the Illumina TruSeq stranded RNA library kits (which is the most common kit used in our institution), then the libraries tend to be reverse stranded (equivalent to --stranded reverse). In our experience with recent RNA-seq libraries submitted by other researchers to GEO/SRA (in the last two years or so), we find that unstranded RNA libraries tend to be in the minority. I think that stranded RNA libraries are becoming the norm, especially with the increasing interest in antisense transcripts that overlap known genes (e.g. Kcnq-ot1).

I agree that the default --stranded yes option might not be the most appropriate for most libraries nowadays. However, we haven't decided if defaulting to "no" (for unstranded libraries) or "reverse" (for most Illumina TruSeq libraries) would be a better option.

Regarding your question, using --stranded yes with an unstranded library would theoretically lead to 50% loss of read annotations. If you are processing an unstranded library, the best option would be --stranded no.

Please let me know if my answer is unclear, or if you have other questions.

clouds-drift commented 5 years ago

Dear Oliver, Thank a lot for your kind explanation. It is my mistake to set the "--stranded yes", since my library is done by Illumina TruSeq which should be "--stranded reverse". That is also the reason for lost of most of reads on assignment. Thank you anyway. I'm also trying use featureCounts to do similar things, since it seems much faster. I would like to combine the coding gene gtf and TE gene gtf together. And then do "no multimap reads" and "allow multimap reads" separately. As following:

  1. featureCounts(files, annot.ext="combine.gtf",GTF.featureType="exon",GTF.attrType="gene_id", isPairedEnd = T,strandSpecific=2)

  2. featureCounts(files, annot.ext="combine.gtf",GTF.featureType="exon",GTF.attrType="gene_id", isPairedEnd = T, allowMultiOverlap=T,countMultiMappingReads=T,fraction=T,strandSpecific=2)

I'm doing this in 3 different ways (including TEcount) to confirm the quantification. How do you think of this? Does it make sense?

olivertam commented 5 years ago

Hi,

It makes sense what you're trying to do, though each of the methods will give you different results (and there are some potential quirks that you may have to deal with).

The first featureCounts command would ignore reads that match a gene and a TE (e.g. a TE in a gene exon), because it would technically match two meta-features (the gene, and the TE). Since you're adding the strand-specificity, the two meta-features would also have be in the same orientation for the read to be ignored, but this is not an uncommon occurrence. It is also unclear if the first command would ignore any multi-mapping read (that is my suspicion). This should hopefully mirror the unique mode of TEtoolkit.

The second featureCounts command should handle the multi-mapped/multi-annotated reads from above, and would also fractionally distribute reads to multiple locations and multiple meta-features. Unlike TEtoolkit, the redistribution would be 1/n, where n is the number of genomic location x meta-feature found, whereas TEtoolkit would then perform an further expectation maximization optimization step to better re-distribute the reads.

Thus, you will get slightly different results between the second featureCounts command and TEtoolkit, and I can't predict how different they could be

Please let me know if you have further questions. Thanks

clouds-drift commented 5 years ago

Dear Oliver, Thanks a lot for your kind explanation! I'm sorry for my late response, since my email didn't get the notification. As you said, I felt something not good if I combine two gtf files. So I do the read count for gene annotation and TE annotation separately.

For the first command, I missed the "countMultiMappingReads=F". So it should be featureCounts(files, annot.ext="gene.gtf",GTF.featureType="exon",GTF.attrType="gene_id", isPairedEnd = T,strandSpecific=2, countMultiMappingReads=F)

For the second command, I know for sure TEtoolkit will assign the reads more reasonably than featureCounts by using the EM methods.

Thank you very much for your help! Xu