mhammell-laboratory / TEtranscripts

A package for including transposable elements in differential enrichment analysis of sequencing datasets.
http://hammelllab.labsites.cshl.edu/software/#TEtranscripts
GNU General Public License v3.0
206 stars 29 forks source link

Zero TE counts #25

Closed lfra closed 6 years ago

lfra commented 6 years ago

Hello,

I have been using TEtranscript as well as TEcount directly from TEToolkit v2.02 to measure TE expression from BAM files which I previously aligned using STAR (according to the recommendations you gave for STAR as an aligner). I have tried both unsorted as well as readID/name sorted BAM files as well as position-sorted BAM files including the --sortbyPos option. In all the cases I get a count list which is fine for the annotated genes, however the TE counts are always 0 in all scenarios tested and I wondered what could be the problem? There is no error and it seems to successfully build both the genic and the TE index. Maybe you can have a look a my log file (TEcount for one exemplary readID sorted BAM file) Would be great! Thanks in advance.

TEcount 2.0.2 INFO @ Tue, 29 May 2018 18:25:16: ARGUMENTS LIST: name = TEcount_HeLa_LT_ASF1_tr1 BAM file = /media/bq-lsdf/lukas/results/STAR/data_Delia/2828_Hela_LT_siASF1_EtOH_1/Aligned_out_sortedName.bam GTF file = /media/bq-lsdf/lukas/annotation/Homo_sapiens.GRCh37.75.gtf TE file = /media/bq-lsdf/lukas/annotation/hg19_rmsk_TE.gtf multi-mapper mode = multi stranded = yes number of iteration = 100 Alignments grouped by read ID = True

INFO @ Tue, 29 May 2018 18:25:16: Processing GTF files ...

INFO @ Tue, 29 May 2018 18:25:16: Building gene index .......

100000 GTF lines processed. 200000 GTF lines processed. 300000 GTF lines processed. 400000 GTF lines processed. 500000 GTF lines processed. 600000 GTF lines processed. 700000 GTF lines processed. 800000 GTF lines processed. 900000 GTF lines processed. 1000000 GTF lines processed. 1100000 GTF lines processed. 1200000 GTF lines processed. 1300000 GTF lines processed. INFO @ Tue, 29 May 2018 18:35:54: Done building gene index ......

INFO @ Tue, 29 May 2018 18:35:59: Building TE index .......

INFO @ Tue, 29 May 2018 18:41:32: Done building TE index ......

INFO @ Tue, 29 May 2018 18:41:32: Reading sample file ...

4000000 alignments processed. 5000000 alignments processed. 6000000 alignments processed. 8000000 alignments processed. 9000000 alignments processed. 11000000 alignments processed. 12000000 alignments processed. 13000000 alignments processed. 16000000 alignments processed. 18000000 alignments processed. 19000000 alignments processed. 22000000 alignments processed. 24000000 alignments processed. 25000000 alignments processed. 26000000 alignments processed. 28000000 alignments processed. 29000000 alignments processed. 30000000 alignments processed. 31000000 alignments processed. 32000000 alignments processed. 34000000 alignments processed. 35000000 alignments processed. 36000000 alignments processed. 37000000 alignments processed. 38000000 alignments processed. 39000000 alignments processed. 40000000 alignments processed. 41000000 alignments processed. 42000000 alignments processed. 43000000 alignments processed. 44000000 alignments processed. 45000000 alignments processed. 46000000 alignments processed. 48000000 alignments processed. 49000000 alignments processed. 50000000 alignments processed. 51000000 alignments processed. 52000000 alignments processed. 53000000 alignments processed. 56000000 alignments processed. 57000000 alignments processed. uniq te counts = 0.0 .......start iterative optimization .......... TE counts total 0.0 Gene counts total 1299163.10675

In library /media/bq-lsdf/lukas/results/STAR/data_Delia/2828_Hela_LT_siASF1_EtOH_1/Aligned_out_sortedName.bam: Total annotated reads = 1299163.10675 Total non-uniquely mapped reads = 7293701 Total unannotated reads = 36799027

INFO @ Tue, 29 May 2018 18:53:24: Finished processing sample file Command being timed: "/opt/python2.7/bin/TEcount -b /media/bq-lsdf/lukas/results/STAR/data_Delia/2828_Hela_LT_siASF1_EtOH_1/Aligned_out_sortedName.bam --format BAM --mode multi --GTF /media/bq-lsdf/lukas/annotation/Homo_sapiens.GRCh37.75.gtf --TE /media/bq-lsdf/lukas/annotation/hg19_rmsk_TE.gtf --project TEcount_HeLa_LT_ASF1_tr1"

olivertam commented 6 years ago

Hi,

Would you be able to tell me where you got the gene GTF file? Is it from Ensembl or UCSC? One likely possibility is that chromosome ID that you're mapping to is not based on the UCSC hg19 nomenclature (which is used for the TE GTF, since it's derived from UCSC). For example: chromosome 1 in Ensembl is "1", whereas it is "chr1" in UCSC hg19. As a result, reads that aligned to "1" would not match "chr1". I will try to make a version of the TE GTF file that uses the Ensembl chromosome nomenclature, and will let you know when it's ready for you to test it out. Thanks.

olivertam commented 6 years ago

Hi, I have created a new TE GTF that should be using the Ensembl chromosome ID. You can download it here. Let me know if this works. Thanks.

lfra commented 6 years ago

Thank you very much. It was indeed only the chromosome format issue and then it worked!