Closed dpryan79 closed 6 years ago
BTW, this effectively abbrogates #1 from the RNAseq side.
Thanks for your contribution! We have several test tests we'd like to run before we merge your code into the test branch.
TEtranscripts_out_class.txt TEtranscripts_out_family.txt TEtranscripts_out_gene.txt TEtranscripts_out_name.txt
Test Results:
13:45:13 dmolik@maritimus.cshl {~/tetoolkit}$ TEcount -b ~/Downloads/test_data_SE_control.bam --GTF ~/Downloads/dm3_refGene.gtf --TE ~/Downloads/dm3_rmsk_TE.gtf
INFO @ Tue, 24 May 2016 13:45:34:
# ARGUMENTS LIST:
# name = TEtranscripts_out
# treatment files = /Users/dmolik/Downloads/test_data_SE_control.bam
# control files =
# GTF file = /Users/dmolik/Downloads/dm3_refGene.gtf
# TE file = /Users/dmolik/Downloads/dm3_rmsk_TE.gtf
# multi-mapper mode = multi
# stranded = yes
# normalization = NA (rpm: Reads Per Million mapped; quant: Quantile normalization)
# FDR cutoff = 1.00e+00
# fold-change cutoff = 0.00
# read count cutoff = 0
# number of iteration = 100
# Alignments grouped by read ID = True
INFO @ Tue, 24 May 2016 13:45:34: Processing GTF files ...
INFO @ Tue, 24 May 2016 13:45:34: Building gene index .......
100000 GTF lines processed.
INFO @ Tue, 24 May 2016 13:45:52: Done building gene index ......
INFO @ Tue, 24 May 2016 13:45:53:
Building TE index .......
INFO @ Tue, 24 May 2016 13:46:02: Done building TE index ......
INFO @ Tue, 24 May 2016 13:46:02:
Reading sample file ...
1000000 alignments processed.
uniq te counts = 487199.0
.......start iterative optimization ..........
multi-reads = 364 total means = 1.33146462749
after normalization toatl means0 = 1.0
SQUAREM iteraton [1]
1/3
num of multi reads = 364
total multi counts = 361.0
total multi counts = 361.0
total means = 1.67217302262
after normalization toatl means = 1.0
2/3
num of multi reads = 364
total multi counts = 361.0
total multi counts = 361.0
total means = 1.82868343589
after normalization toatl means = 1.0
alpha = 1.0, SQUAREM iteraton [2]
1/3
num of multi reads = 364
total multi counts = 361.0
total multi counts = 361.0
total means = 1.90789144035
after normalization toatl means = 1.0
2/3
num of multi reads = 364
total multi counts = 361.0
total multi counts = 361.0
total means = 1.94688592856
after normalization toatl means = 1.0
alpha = 1.45717813976.
Performing a stabilization step.
num of multi reads = 364
total multi counts = 361.0
total multi counts = 361.0
total means = 1.97948495991
after normalization toatl means = 1.0
alpha = 1.45717813976, SQUAREM iteraton [3]
1/3
num of multi reads = 364
total multi counts = 361.0
total multi counts = 361.0
total means = 1.98253528257
after normalization toatl means = 1.0
2/3
num of multi reads = 364
total multi counts = 361.0
total multi counts = 361.0
total means = 1.98422404596
after normalization toatl means = 1.0
alpha = 1.42019425179.
Performing a stabilization step.
num of multi reads = 364
total multi counts = 361.0
total multi counts = 361.0
total means = 1.98587656212
after normalization toatl means = 1.0
alpha = 1.42019425179, SQUAREM iteraton [4]
1/3
num of multi reads = 364
total multi counts = 361.0
total multi counts = 361.0
total means = 1.98618675712
after normalization toatl means = 1.0
2/3
num of multi reads = 364
total multi counts = 361.0
total multi counts = 361.0
total means = 1.98640178912
after normalization toatl means = 1.0
alpha = 2.13636845923.
Performing a stabilization step.
num of multi reads = 364
total multi counts = 361.0
total multi counts = 361.0
total means = 1.98684179272
after normalization toatl means = 1.0
alpha = 2.13636845923, SQUAREM iteraton [5]
1/3
num of multi reads = 364
total multi counts = 361.0
total multi counts = 361.0
total means = 1.98690797474
after normalization toatl means = 1.0
2/3
num of multi reads = 364
total multi counts = 361.0
total multi counts = 361.0
total means = 1.98696791851
after normalization toatl means = 1.0
rNome = OPT_TOL
converge at iteration 5
num of multi reads = 364
total multi counts = 361.0
TE counts total 487560.0
Gene counts total 212935.571825
In library /Users/dmolik/Downloads/test_data_SE_control.bam:
Total annotated reads = 700495.571825
Total non-uniquely mapped reads = 1970
Total unannotated reads = 741078
INFO @ Tue, 24 May 2016 13:46:39: Finished processing sample file
TEtranscripts_out_class.txt TEtranscripts_out_family.txt TEtranscripts_out_gene.txt TEtranscripts_out_name.txt
Paired End Testing
TEcount -b ~/Downloads/test_data_PE_control.bam --GTF ~/Downloads/hg19_refGene.gtf --TE ~/Downloads/hg19_rmsk_TE.gtf --sortByPos
INFO @ Tue, 24 May 2016 15:37:22:
# ARGUMENTS LIST:
# name = TEtranscripts_out
# treatment files = /Users/dmolik/Downloads/test_data_PE_control.bam
# control files =
# GTF file = /Users/dmolik/Downloads/hg19_refGene.gtf
# TE file = /Users/dmolik/Downloads/hg19_rmsk_TE.gtf
# multi-mapper mode = multi
# stranded = yes
# normalization = NA (rpm: Reads Per Million mapped; quant: Quantile normalization)
# FDR cutoff = 1.00e+00
# fold-change cutoff = 0.00
# read count cutoff = 0
# number of iteration = 100
# Alignments grouped by read ID = False
INFO @ Tue, 24 May 2016 15:37:22: Processing GTF files ...
INFO @ Tue, 24 May 2016 15:37:22: Building gene index .......
100000 GTF lines processed.
200000 GTF lines processed.
300000 GTF lines processed.
400000 GTF lines processed.
INFO @ Tue, 24 May 2016 15:38:24: Done building gene index ......
INFO @ Tue, 24 May 2016 15:39:12:
Building TE index .......
INFO @ Tue, 24 May 2016 15:47:05: Done building TE index ......
INFO @ Tue, 24 May 2016 15:47:05:
Reading sample file ...
uniq te counts = 49550.0
.......start iterative optimization ..........
multi-reads = 3096 total means = 21.7195799547
after normalization toatl means0 = 1.0
SQUAREM iteraton [1]
1/3
num of multi reads = 3096
total multi counts = 2786.0
total multi counts = 2786.0
total means = 24.8356495679
after normalization toatl means = 1.0
2/3
num of multi reads = 3096
total multi counts = 2786.0
total multi counts = 2786.0
total means = 25.2051061849
after normalization toatl means = 1.0
alpha = 1.0, SQUAREM iteraton [2]
1/3
num of multi reads = 3096
total multi counts = 2786.0
total multi counts = 2786.0
total means = 25.3284359587
after normalization toatl means = 1.0
2/3
num of multi reads = 3096
total multi counts = 2786.0
total multi counts = 2786.0
total means = 25.38994795
after normalization toatl means = 1.0
alpha = 1.57748488324.
Performing a stabilization step.
num of multi reads = 3096
total multi counts = 2786.0
total multi counts = 2786.0
total means = 25.4639373967
after normalization toatl means = 1.0
alpha = 1.57748488324, SQUAREM iteraton [3]
1/3
num of multi reads = 3096
total multi counts = 2786.0
total multi counts = 2786.0
total means = 25.4824604822
after normalization toatl means = 1.0
2/3
num of multi reads = 3096
total multi counts = 2786.0
total multi counts = 2786.0
total means = 25.4983297922
after normalization toatl means = 1.0
alpha = 3.79797985887.
Performing a stabilization step.
num of multi reads = 3096
total multi counts = 2786.0
total multi counts = 2786.0
total means = 25.5749867705
after normalization toatl means = 1.0
alpha = 3.79797985887, SQUAREM iteraton [4]
1/3
num of multi reads = 3096
total multi counts = 2786.0
total multi counts = 2786.0
total means = 25.5819864319
after normalization toatl means = 1.0
2/3
num of multi reads = 3096
total multi counts = 2786.0
total multi counts = 2786.0
total means = 25.5881841176
after normalization toatl means = 1.0
alpha = 3.83639201827.
Performing a stabilization step.
num of multi reads = 3096
total multi counts = 2786.0
total multi counts = 2786.0
total means = 25.6214249589
after normalization toatl means = 1.0
alpha = 3.83639201827, SQUAREM iteraton [5]
1/3
num of multi reads = 3096
total multi counts = 2786.0
total multi counts = 2786.0
total means = 25.6247821151
after normalization toatl means = 1.0
2/3
num of multi reads = 3096
total multi counts = 2786.0
total multi counts = 2786.0
total means = 25.6277413671
after normalization toatl means = 1.0
alpha = 3.69343873282.
Performing a stabilization step.
num of multi reads = 3096
total multi counts = 2786.0
total multi counts = 2786.0
total means = 25.643106715
after normalization toatl means = 1.0
alpha = 3.69343873282, SQUAREM iteraton [6]
1/3
num of multi reads = 3096
total multi counts = 2786.0
total multi counts = 2786.0
total means = 25.6447658284
after normalization toatl means = 1.0
2/3
num of multi reads = 3096
total multi counts = 2786.0
total multi counts = 2786.0
total means = 25.6462186578
after normalization toatl means = 1.0
rNome = OPT_TOL
converge at iteration 6
num of multi reads = 3096
total multi counts = 2786.0
TE counts total 52336.0
Gene counts total 393763.474791
In library /Users/dmolik/Downloads/test_data_PE_control.bam:
Total annotated reads = 446099.474791
Total non-uniquely mapped reads = 308767
Total unannotated reads = 164671
INFO @ Tue, 24 May 2016 15:50:15: Finished processing sample file
Hi, sorry for asking a dumb question. May I know how to install TEcount in this case? Thanks!
@jiajiayb You'd need to clone my fork (git clone https://github.com/dpryan79/tetoolkit
), switch to the countOnly
branch (cd tetoolkit; git checkout countOnly
) and then follow the normal installation instructions. Having said that, I haven't followed development of this package in the 1.5 years since I made this pull request, so I don't know if any major advancements would then be missing (one of the authors would have to comment on that).
Was this PR considered for inclusion?
Hi is there a way to specify output folder or file names in command line for TEcount? Thank you!
There's a --prefix
option that can be used for this purpose.
Dear @dpryan79, Thank you for your input on the TEcount program. We have decided to implement a version of TEcount that more closely mimics the TEtranscripts output (e.g. single output file). Thank you again for all your help.
This PR adds
TEcount
, a modified version of TEtranscripts that only generates count tables and doesn't require multiple samples. This is a huge time saver, since this can then be run in parallel on each sample.Another difference with
TEcount
, that you may or may not find useful, is that it presplits the counts into multiple files. So you get one file of gene counts, another of TE families, and so on. These can be used in DESeq2 conveniently withDESeqDatasetFromHTSeqCount()
.