mhammell-laboratory / TEtranscripts

A package for including transposable elements in differential enrichment analysis of sequencing datasets.
http://hammelllab.labsites.cshl.edu/software/#TEtranscripts
GNU General Public License v3.0
206 stars 29 forks source link

half of the reads are not annotated #155

Closed MonkeySylvia closed 6 months ago

MonkeySylvia commented 8 months ago

Hi, I'm using TEtranscript to count gene/TE expression from human samples, everything ran smoothly but there are ~40% unannotated reads reported by TEtranscript. I didnt get any error messages so I cound't figure out why this is the case.

Using STAR for my mapping (Ensemble gtf file):

       STAR \
      --runMode alignReads \
      --genomeDir $INDEX \
      --readFilesIn $f1 $f2 \
      --runThreadN 100 \
      --sjdbGTFfile Homo_sapiens.GRCh38.110_chr.gtf \
      --outSAMtype BAM SortedByCoordinate \
      --outFilterMultimapNmax 100 \
      --alignIntronMax 1000000 \
      --outSAMmultNmax 1 \

Then TE transcript

TEtranscripts \
-t sample1.bam sample2.bam sample3.bam \
-c sample4.bam sample5.bam  sample6.bam \
--sortByPos \
--GTF Homo_sapiens.GRCh38.110_chr.gtf \
--TE GRCh38_Ensembl_rmsk_TE_chr.gtf \
--mode multi \
--stranded reverse &

and here is the output

In library sample1.bam:
Total annotated reads = 35048548
Total non-uniquely mapped reads = 1
Total unannotated reads = 21057084

I did check the NH flag in my bam file, and there are definitely reads with NH:20, NH:5 in there. Thank you!

olivertam commented 8 months ago

Hi,

Unannotated reads are not necessarily unusual. These could be intronic reads (which are ignored by default), basal transcription of the genome, or other reads derived from features that are not annotated by Ensembl (as a gene) or as a TE (some examples could be tRNA, ribosomal RNA, some mitochondrial genes).

However, I am quite surprised to see that Total non-uniquely mapped reads = 1, which is very unusual. However, I see the issue in your STAR command line. You specified --outSAMmultNmax 1, which means that even if you find multiple alignments, you're only reporting one in your SAM/BAM file. As a result, you don't actually see all the possible alignments, and thus lose the benefit of multimapping.

Thanks.

MonkeySylvia commented 8 months ago

Got you!! I'll redo the mapping again. Thanks!

github-actions[bot] commented 6 months ago

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days