mhammell-laboratory / TEtranscripts

A package for including transposable elements in differential enrichment analysis of sequencing datasets.
http://hammelllab.labsites.cshl.edu/software/#TEtranscripts
GNU General Public License v3.0
206 stars 29 forks source link

Read-counting with TEcount - low counts #46

Closed lfra closed 4 years ago

lfra commented 4 years ago

Hi,

I am currently using TECount on human paired-end RNA-seq data (aligned with STAR according to your multi-mapping recommendations). The alignment seems alright and I get ~50 Mio uniquely mapped reads and ~7 Mio multi-mapped reads per sample. I inspected the mapping with IGV and everything seems fine.

When looking at the TECount output (count table) for one sample, I realized that there was a bit more than 7 Mio counts when summing up over all genes and annotated repeats...

I do not really understand how I get to these numbers...where did the other 50 Mio reads go?

As TE gtf file I used hg38_rmsk.gtf from your website and as genic reference I used hg38_gencode.v31.annotation.gtf (ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_31/gencode.v31.annotation.gtf.gz) - the same file used by STAR for index generation and mapping.

Would be great if you have an idea.

Thanks in advance,

Lukas

lfra commented 4 years ago

Hi,

maybe to add - I had another look at the log file and somehow it tells at the end that most of the reads are "unannotated"? Maybe this helps... Best, Lukas

TEtranscripts 2.0.3 INFO @ Tue, 27 Aug 2019 10:24:22: ARGUMENTS LIST: name = SRR7689162_TE_counts BAM file = /media/bq-lsdf/lukas/retroelement_analaysis/osteosarcoma_RNA-seq/bam_files_STAR/SRR7689162_Aligned.sortedByCoord.out.bam GTF file = /media/bq-lsdf/lukas/annotation/hg38/hg38_gencode.v31.annotation.gtf TE file = /media/bq-lsdf/lukas/annotation/TEtranscripts_TE_GTF_files/hg38_rmsk_TE.gtf multi-mapper mode = multi stranded = yes number of iteration = 100 Alignments grouped by read ID = False

INFO @ Tue, 27 Aug 2019 10:24:22: Processing GTF files ...

INFO @ Tue, 27 Aug 2019 10:24:22: Building gene index .......

100000 GTF lines processed. 200000 GTF lines processed. 300000 GTF lines processed. 400000 GTF lines processed. 500000 GTF lines processed. 600000 GTF lines processed. 700000 GTF lines processed. 800000 GTF lines processed. 900000 GTF lines processed. 1000000 GTF lines processed. 1100000 GTF lines processed. 1200000 GTF lines processed. 1300000 GTF lines processed. INFO @ Tue, 27 Aug 2019 10:32:33: Done building gene index ......

INFO @ Tue, 27 Aug 2019 10:32:41: Building TE index .......

INFO @ Tue, 27 Aug 2019 10:38:09: Done building TE index ......

INFO @ Tue, 27 Aug 2019 10:38:09: Reading sample file ...

uniq te counts = 5158969.0 .......start iterative optimization .......... multi-reads = 182814 total means = 1708.02143484 after normalization toatl means0 = 1.0 SQUAREM iteraton [1] 1/3 num of multi reads = 182814 total multi counts = 151505.0 total multi counts = 151505.0 total means = 2184.49571507 after normalization toatl means = 1.0 2/3 num of multi reads = 182814 total multi counts = 151505.0 total multi counts = 151505.0 total means = 2238.55012381 after normalization toatl means = 1.0 alpha = 1.0, SQUAREM iteraton [2] 1/3 num of multi reads = 182814 total multi counts = 151505.0 total multi counts = 151505.0 total means = 2253.87392351 after normalization toatl means = 1.0 2/3 num of multi reads = 182814 total multi counts = 151505.0 total multi counts = 151505.0 total means = 2260.36433167 after normalization toatl means = 1.0 alpha = 2.3058862496. Performing a stabilization step. num of multi reads = 182814 total multi counts = 151505.0 total multi counts = 151505.0 total means = 2267.10896893 after normalization toatl means = 1.0 alpha = 2.3058862496, SQUAREM iteraton [3] 1/3 num of multi reads = 182814 total multi counts = 151505.0 total multi counts = 151505.0 total means = 2267.7487191 after normalization toatl means = 1.0 2/3 num of multi reads = 182814 total multi counts = 151505.0 total multi counts = 151505.0 total means = 2267.77040716 after normalization toatl means = 1.0 alpha = 2.95722249667. Performing a stabilization step. num of multi reads = 182814 total multi counts = 151505.0 total multi counts = 151505.0 total means = 2267.23508732 after normalization toatl means = 1.0 alpha = 2.95722249667, SQUAREM iteraton [4] 1/3 num of multi reads = 182814 total multi counts = 151505.0 total multi counts = 151505.0 total means = 2267.46762373 after normalization toatl means = 1.0 2/3 num of multi reads = 182814 total multi counts = 151505.0 total multi counts = 151505.0 total means = 2267.49762041 after normalization toatl means = 1.0 alpha = 3.24563603921. Performing a stabilization step. num of multi reads = 182814 total multi counts = 151505.0 total multi counts = 151505.0 total means = 2267.36873254 after normalization toatl means = 1.0 alpha = 3.24563603921, SQUAREM iteraton [5] 1/3 num of multi reads = 182814 total multi counts = 151505.0 total multi counts = 151505.0 total means = 2267.49593529 after normalization toatl means = 1.0 2/3 num of multi reads = 182814 total multi counts = 151505.0 total multi counts = 151505.0 total means = 2267.52760471 after normalization toatl means = 1.0 alpha = 3.79363400207. Performing a stabilization step. num of multi reads = 182814 total multi counts = 151505.0 total multi counts = 151505.0 total means = 2267.50525418 after normalization toatl means = 1.0 alpha = 3.79363400207, SQUAREM iteraton [6] 1/3 num of multi reads = 182814 total multi counts = 151505.0 total multi counts = 151505.0 total means = 2267.6081023 after normalization toatl means = 1.0 2/3 num of multi reads = 182814 total multi counts = 151505.0 total multi counts = 151505.0 total means = 2267.63753878 after normalization toatl means = 1.0 alpha = 4.0. Performing a stabilization step. num of multi reads = 182814 total multi counts = 151505.0 total multi counts = 151505.0 total means = 2267.6260466 after normalization toatl means = 1.0 alpha = 4.0, SQUAREM iteraton [7] 1/3 num of multi reads = 182814 total multi counts = 151505.0 total multi counts = 151505.0 total means = 2267.7185657 after normalization toatl means = 1.0 2/3 num of multi reads = 182814 total multi counts = 151505.0 total multi counts = 151505.0 total means = 2267.74460078 after normalization toatl means = 1.0 alpha = 6.11628003448. Performing a stabilization step. num of multi reads = 182814 total multi counts = 151505.0 total multi counts = 151505.0 total means = 2267.54954976 after normalization toatl means = 1.0 alpha = 6.11628003448, SQUAREM iteraton [8] 1/3 num of multi reads = 182814 total multi counts = 151505.0 total multi counts = 151505.0 total means = 2267.7979822 after normalization toatl means = 1.0 2/3 num of multi reads = 182814 total multi counts = 151505.0 total multi counts = 151505.0 total means = 2267.85303648 after normalization toatl means = 1.0 rNome = OPT_TOL converge at iteration 8 num of multi reads = 182814 total multi counts = 151505.0 TE counts total 5310474.0 Gene counts total 3729228.8247

In library /media/bq-lsdf/lukas/retroelement_analaysis/osteosarcoma_RNA-seq/bam_files_STAR/SRR7689162_Aligned.sortedByCoord.out.bam: Total annotated reads = 9039702.8247 Total non-uniquely mapped reads = 6777060 Total unannotated reads = 58882032

olivertam commented 4 years ago

Hi Lukas,

The parameter that you would want to check is --stranded

Depending on how your library is constructed, the sequenced read (read 1 in paired-end libraries) is either in the same orientation (sense) as the mRNA transcript (--stranded yes), or in the opposite orientation (antisense) as the mRNA transcript (--stranded reverse). If you are using the Illumina TruSeq stranded mRNA kit, they usually generate libraries where the sequenced read is in the antisense orientation. Thus you would choose --stranded reverse

See also HTSeq-count's description, which we based our parameter on:

-s <yes/no/reverse>, --stranded=<yes/no/reverse> For stranded=no, a read is considered overlapping with a feature regardless of whether it is mapped to the same or the opposite strand as the feature. For stranded=yes and single-end reads, the read has to be mapped to the same strand as the feature. For paired-end reads, the first read has to be on the same strand and the second read on the opposite strand. For stranded=reverse, these rules are reversed.

A simple check to see what --stranded mode is appropriate is to check which one had a higher count for a broadly expressed gene (e.g. Gapdh, beta-actin). If you quantified way more beta actin reads from the --stranded reverse run, then that's probably the correct mode to use.

Please let me know if this doesn't resolve the issue, or if you have further questions. Thanks

lfra commented 4 years ago

Thank you very much. I did a test run and it worked! Best, Lukas