mhammell-laboratory / TEtranscripts

A package for including transposable elements in differential enrichment analysis of sequencing datasets.
http://hammelllab.labsites.cshl.edu/software/#TEtranscripts
GNU General Public License v3.0
217 stars 29 forks source link

TECount counting/assignment of 'split reads' to features #76

Closed lfra closed 4 years ago

lfra commented 4 years ago

Hello,

I have been using TECount in the 'multi' mode on human 50 bp single-end RNA-seq data aligned with STAR and was wondering whether you could give me some information on how TECount handles 'split reads' typically generated in STAR alignments & in general.

With split reads I mean reads which map to more than one feature (e.g. first half maps to featureA, second half of the read to featureB).

In a scenario where such a read maps to a gene exon + repeat feature, how is it counted? For the repeat, for the gene or for both? The same question applies if one half maps to a feature (exon or repeat) and the other half maps to no feature.

I know that for example in htseq-count there are different modes to exclude certain types of mapping from being counted (see https://htseq.readthedocs.io/en/release_0.11.1/count.html). Could you make a statement on how exactly this is solved for TE Count?

Thank you very much in advance.

olivertam commented 4 years ago

Hi,

Thanks for your interest in the software.

In your first scenario (where the split read maps to both gene exon & repeat), it depends on whether the read is uniquely or ambiguously aligned to the genome. If the read is uniquely aligned, the software would preferentially allocate the read to the gene, and to the TE if there are no gene annotations. For ambiguously mapped reads, the read would be preferentially allocated to the repeat, and then to the gene if there are no TE annotations. The initial read "count" for the repeat is 1/n (where n is the number of distinct repeats that the ambiguously mapped reads overlap), followed by EM steps to redistribute reads among the multiple matches.

In your second scenario (where the split read maps to a feature on one end, and no feature on the other), the software will assign the read to the feature if uniquely aligned, and distribute with EM if there are multiple alignments (with others matching to repeats). If a read fails to map to any feature, then it is not counted.

Hope this addresses your question. Please let me know if you have other questions or comments.

Thanks.

lfra commented 4 years ago

Hi,

thank you very much for your fast reply. I think I got it. Maybe just some clarification to this part you wrote:

"If the read is uniquely aligned, the software would preferentially allocate the read to the gene, and to the TE if there are no gene annotations"

So a read that maps uniquely but is 'split-aligned' to a gene AND a TE feature, is still preferentially allocated to the gene?

olivertam commented 4 years ago

Yes, that is correct. This is our conservative approach of assuming uniquely aligned sequences to be more likely gene-derived than TE-derived if they match both. We agree that this would cause problems with chimeric transcripts, but since our software is aiming to quantify based on known annotations, we cannot account for novel chimeric transcripts. Thanks.

lfra commented 4 years ago

Thanks for the clarification!