mhammell-laboratory / TEtranscripts

A package for including transposable elements in differential enrichment analysis of sequencing datasets.
http://hammelllab.labsites.cshl.edu/software/#TEtranscripts
GNU General Public License v3.0
206 stars 29 forks source link

TEcount #5

Closed dpryan79 closed 6 years ago

dpryan79 commented 8 years ago

This PR adds TEcount, a modified version of TEtranscripts that only generates count tables and doesn't require multiple samples. This is a huge time saver, since this can then be run in parallel on each sample.

Another difference with TEcount, that you may or may not find useful, is that it presplits the counts into multiple files. So you get one file of gene counts, another of TE families, and so on. These can be used in DESeq2 conveniently with DESeqDatasetFromHTSeqCount().

dpryan79 commented 8 years ago

BTW, this effectively abbrogates #1 from the RNAseq side.

molikd commented 8 years ago

Thanks for your contribution! We have several test tests we'd like to run before we merge your code into the test branch.

molikd commented 8 years ago

TEtranscripts_out_class.txt TEtranscripts_out_family.txt TEtranscripts_out_gene.txt TEtranscripts_out_name.txt

Test Results:

13:45:13 dmolik@maritimus.cshl {~/tetoolkit}$ TEcount -b ~/Downloads/test_data_SE_control.bam --GTF ~/Downloads/dm3_refGene.gtf --TE ~/Downloads/dm3_rmsk_TE.gtf
INFO  @ Tue, 24 May 2016 13:45:34: 
# ARGUMENTS LIST:
# name = TEtranscripts_out
# treatment files = /Users/dmolik/Downloads/test_data_SE_control.bam
# control files = 
# GTF file = /Users/dmolik/Downloads/dm3_refGene.gtf 
# TE file = /Users/dmolik/Downloads/dm3_rmsk_TE.gtf 
# multi-mapper mode = multi 
# stranded = yes 
# normalization = NA (rpm: Reads Per Million mapped; quant: Quantile normalization)
# FDR cutoff = 1.00e+00
# fold-change cutoff =  0.00
# read count cutoff = 0
# number of iteration = 100
# Alignments grouped by read ID = True

INFO  @ Tue, 24 May 2016 13:45:34: Processing GTF files ...

INFO  @ Tue, 24 May 2016 13:45:34: Building gene index .......

100000 GTF lines processed.
INFO  @ Tue, 24 May 2016 13:45:52: Done building gene index ......

INFO  @ Tue, 24 May 2016 13:45:53: 
Building TE index .......

INFO  @ Tue, 24 May 2016 13:46:02: Done building TE index ......

INFO  @ Tue, 24 May 2016 13:46:02: 
Reading sample file ...

1000000 alignments  processed.
uniq te counts = 487199.0 
.......start iterative optimization ..........
multi-reads = 364 total means = 1.33146462749
after normalization toatl means0 = 1.0
SQUAREM iteraton [1]
1/3
num of multi reads = 364
total multi counts = 361.0
total multi counts = 361.0
total means = 1.67217302262
after normalization toatl means = 1.0
2/3
num of multi reads = 364
total multi counts = 361.0
total multi counts = 361.0
total means = 1.82868343589
after normalization toatl means = 1.0
alpha = 1.0, SQUAREM iteraton [2]
1/3
num of multi reads = 364
total multi counts = 361.0
total multi counts = 361.0
total means = 1.90789144035
after normalization toatl means = 1.0
2/3
num of multi reads = 364
total multi counts = 361.0
total multi counts = 361.0
total means = 1.94688592856
after normalization toatl means = 1.0
alpha = 1.45717813976.
 Performing a stabilization step.
num of multi reads = 364
total multi counts = 361.0
total multi counts = 361.0
total means = 1.97948495991
after normalization toatl means = 1.0
alpha = 1.45717813976, SQUAREM iteraton [3]
1/3
num of multi reads = 364
total multi counts = 361.0
total multi counts = 361.0
total means = 1.98253528257
after normalization toatl means = 1.0
2/3
num of multi reads = 364
total multi counts = 361.0
total multi counts = 361.0
total means = 1.98422404596
after normalization toatl means = 1.0
alpha = 1.42019425179.
 Performing a stabilization step.
num of multi reads = 364
total multi counts = 361.0
total multi counts = 361.0
total means = 1.98587656212
after normalization toatl means = 1.0
alpha = 1.42019425179, SQUAREM iteraton [4]
1/3
num of multi reads = 364
total multi counts = 361.0
total multi counts = 361.0
total means = 1.98618675712
after normalization toatl means = 1.0
2/3
num of multi reads = 364
total multi counts = 361.0
total multi counts = 361.0
total means = 1.98640178912
after normalization toatl means = 1.0
alpha = 2.13636845923.
 Performing a stabilization step.
num of multi reads = 364
total multi counts = 361.0
total multi counts = 361.0
total means = 1.98684179272
after normalization toatl means = 1.0
alpha = 2.13636845923, SQUAREM iteraton [5]
1/3
num of multi reads = 364
total multi counts = 361.0
total multi counts = 361.0
total means = 1.98690797474
after normalization toatl means = 1.0
2/3
num of multi reads = 364
total multi counts = 361.0
total multi counts = 361.0
total means = 1.98696791851
after normalization toatl means = 1.0
rNome = OPT_TOL 
converge at iteration 5
num of multi reads = 364
total multi counts = 361.0
TE counts total 487560.0
Gene counts total 212935.571825

In library /Users/dmolik/Downloads/test_data_SE_control.bam:
Total annotated reads = 700495.571825
Total non-uniquely mapped reads = 1970
Total unannotated reads = 741078

INFO  @ Tue, 24 May 2016 13:46:39: Finished processing sample file
molikd commented 8 years ago

TEtranscripts_out_class.txt TEtranscripts_out_family.txt TEtranscripts_out_gene.txt TEtranscripts_out_name.txt

Paired End Testing

TEcount -b ~/Downloads/test_data_PE_control.bam --GTF ~/Downloads/hg19_refGene.gtf --TE ~/Downloads/hg19_rmsk_TE.gtf --sortByPos
INFO  @ Tue, 24 May 2016 15:37:22: 
# ARGUMENTS LIST:
# name = TEtranscripts_out
# treatment files = /Users/dmolik/Downloads/test_data_PE_control.bam
# control files = 
# GTF file = /Users/dmolik/Downloads/hg19_refGene.gtf 
# TE file = /Users/dmolik/Downloads/hg19_rmsk_TE.gtf 
# multi-mapper mode = multi 
# stranded = yes 
# normalization = NA (rpm: Reads Per Million mapped; quant: Quantile normalization)
# FDR cutoff = 1.00e+00
# fold-change cutoff =  0.00
# read count cutoff = 0
# number of iteration = 100
# Alignments grouped by read ID = False

INFO  @ Tue, 24 May 2016 15:37:22: Processing GTF files ...

INFO  @ Tue, 24 May 2016 15:37:22: Building gene index .......

100000 GTF lines processed.
200000 GTF lines processed.
300000 GTF lines processed.
400000 GTF lines processed.
INFO  @ Tue, 24 May 2016 15:38:24: Done building gene index ......

INFO  @ Tue, 24 May 2016 15:39:12: 
Building TE index .......

INFO  @ Tue, 24 May 2016 15:47:05: Done building TE index ......

INFO  @ Tue, 24 May 2016 15:47:05: 
Reading sample file ...

uniq te counts = 49550.0 
.......start iterative optimization ..........
multi-reads = 3096 total means = 21.7195799547
after normalization toatl means0 = 1.0
SQUAREM iteraton [1]
1/3
num of multi reads = 3096
total multi counts = 2786.0
total multi counts = 2786.0
total means = 24.8356495679
after normalization toatl means = 1.0
2/3
num of multi reads = 3096
total multi counts = 2786.0
total multi counts = 2786.0
total means = 25.2051061849
after normalization toatl means = 1.0
alpha = 1.0, SQUAREM iteraton [2]
1/3
num of multi reads = 3096
total multi counts = 2786.0
total multi counts = 2786.0
total means = 25.3284359587
after normalization toatl means = 1.0
2/3
num of multi reads = 3096
total multi counts = 2786.0
total multi counts = 2786.0
total means = 25.38994795
after normalization toatl means = 1.0
alpha = 1.57748488324.
 Performing a stabilization step.
num of multi reads = 3096
total multi counts = 2786.0
total multi counts = 2786.0
total means = 25.4639373967
after normalization toatl means = 1.0
alpha = 1.57748488324, SQUAREM iteraton [3]
1/3
num of multi reads = 3096
total multi counts = 2786.0
total multi counts = 2786.0
total means = 25.4824604822
after normalization toatl means = 1.0
2/3
num of multi reads = 3096
total multi counts = 2786.0
total multi counts = 2786.0
total means = 25.4983297922
after normalization toatl means = 1.0
alpha = 3.79797985887.
 Performing a stabilization step.
num of multi reads = 3096
total multi counts = 2786.0
total multi counts = 2786.0
total means = 25.5749867705
after normalization toatl means = 1.0
alpha = 3.79797985887, SQUAREM iteraton [4]
1/3
num of multi reads = 3096
total multi counts = 2786.0
total multi counts = 2786.0
total means = 25.5819864319
after normalization toatl means = 1.0
2/3
num of multi reads = 3096
total multi counts = 2786.0
total multi counts = 2786.0
total means = 25.5881841176
after normalization toatl means = 1.0
alpha = 3.83639201827.
 Performing a stabilization step.
num of multi reads = 3096
total multi counts = 2786.0
total multi counts = 2786.0
total means = 25.6214249589
after normalization toatl means = 1.0
alpha = 3.83639201827, SQUAREM iteraton [5]
1/3
num of multi reads = 3096
total multi counts = 2786.0
total multi counts = 2786.0
total means = 25.6247821151
after normalization toatl means = 1.0
2/3
num of multi reads = 3096
total multi counts = 2786.0
total multi counts = 2786.0
total means = 25.6277413671
after normalization toatl means = 1.0
alpha = 3.69343873282.
 Performing a stabilization step.
num of multi reads = 3096
total multi counts = 2786.0
total multi counts = 2786.0
total means = 25.643106715
after normalization toatl means = 1.0
alpha = 3.69343873282, SQUAREM iteraton [6]
1/3
num of multi reads = 3096
total multi counts = 2786.0
total multi counts = 2786.0
total means = 25.6447658284
after normalization toatl means = 1.0
2/3
num of multi reads = 3096
total multi counts = 2786.0
total multi counts = 2786.0
total means = 25.6462186578
after normalization toatl means = 1.0
rNome = OPT_TOL 
converge at iteration 6
num of multi reads = 3096
total multi counts = 2786.0
TE counts total 52336.0
Gene counts total 393763.474791

In library /Users/dmolik/Downloads/test_data_PE_control.bam:
Total annotated reads = 446099.474791
Total non-uniquely mapped reads = 308767
Total unannotated reads = 164671

INFO  @ Tue, 24 May 2016 15:50:15: Finished processing sample file 
jiajiayb commented 6 years ago

Hi, sorry for asking a dumb question. May I know how to install TEcount in this case? Thanks!

dpryan79 commented 6 years ago

@jiajiayb You'd need to clone my fork (git clone https://github.com/dpryan79/tetoolkit), switch to the countOnly branch (cd tetoolkit; git checkout countOnly) and then follow the normal installation instructions. Having said that, I haven't followed development of this package in the 1.5 years since I made this pull request, so I don't know if any major advancements would then be missing (one of the authors would have to comment on that).

moskalenko commented 6 years ago

Was this PR considered for inclusion?

varsh1090 commented 6 years ago

Hi is there a way to specify output folder or file names in command line for TEcount? Thank you!

dpryan79 commented 6 years ago

There's a --prefix option that can be used for this purpose.

olivertam commented 6 years ago

Dear @dpryan79, Thank you for your input on the TEcount program. We have decided to implement a version of TEcount that more closely mimics the TEtranscripts output (e.g. single output file). Thank you again for all your help.