mhammell-laboratory / TEtranscripts

A package for including transposable elements in differential enrichment analysis of sequencing datasets.
http://hammelllab.labsites.cshl.edu/software/#TEtranscripts
GNU General Public License v3.0
206 stars 29 forks source link

error #88

Closed amitpande74 closed 3 years ago

amitpande74 commented 3 years ago

Hi, I got an error my running the tool:

TEtranscripts --sortByPos --format BAM --mode multi -t /media/amit/Elements/MDC_project/preeclampsia/PE/PE1/sorted.PE1.bam -c /media/amit/Elements/MDC_project/preeclampsia/control/contrl1/sorted.control1.bam --GTF refseq_hg19.gtf --TE GRCh37_Ensembl_rmsk_TE.gtf --project sample_sorted_test
INFO  @ Fri, 09 Apr 2021 14:09:33: 
# ARGUMENTS LIST:
# name = sample_sorted_test
# treatment files = ['/media/amit/Elements/MDC_project/preeclampsia/PE/PE1/sorted.PE1.bam']
# control files = ['/media/amit/Elements/MDC_project/preeclampsia/control/contrl1/sorted.control1.bam']
# GTF file = refseq_hg19.gtf 
# TE file = GRCh37_Ensembl_rmsk_TE.gtf 
# multi-mapper mode = multi 
# stranded = no
# differential analysis using DESeq2
# normalization = DESeq2_default
# FDR cutoff = 5.00e-02
# fold-change cutoff =  1.00
# read count cutoff = 1
# number of iteration = 100
# Alignments grouped by read ID = False

INFO  @ Fri, 09 Apr 2021 14:09:33: Processing GTF files ...

INFO  @ Fri, 09 Apr 2021 14:09:33: Building gene index ....... 

100000 GTF lines processed.
200000 GTF lines processed.
300000 GTF lines processed.
400000 GTF lines processed.
500000 GTF lines processed.
600000 GTF lines processed.
700000 GTF lines processed.
INFO  @ Fri, 09 Apr 2021 14:17:36: Done building gene index ...... 

INFO  @ Fri, 09 Apr 2021 14:17:40: Building TE index ....... 

INFO  @ Fri, 09 Apr 2021 14:20:50: Done building TE index ...... 

INFO  @ Fri, 09 Apr 2021 14:20:50: 
Reading sample files ... 

[E::idx_find_and_load] Could not retrieve index file for '.1617970850.172773.bam'
uniq te counts = 0 
.......start iterative optimization .......... 
TE counts total 0.0 
Gene counts total 28234279.4444442 

In library /media/amit/Elements/MDC_project/preeclampsia/PE/PE1/sorted.PE1.bam: 
Total annotated reads = 28234279 
Total non-uniquely mapped reads = 1540541 
Total unannotated reads = 6209346 

[E::idx_find_and_load] Could not retrieve index file for '.1617973229.090331.bam'
1000000 alignments processed. 
2000000 alignments processed. 
3000000 alignments processed. 
4000000 alignments processed. 
5000000 alignments processed. 
6000000 alignments processed. 
7000000 alignments processed. 
8000000 alignments processed. 
9000000 alignments processed. 
10000000 alignments processed. 
11000000 alignments processed. 
12000000 alignments processed. 
13000000 alignments processed. 
14000000 alignments processed. 
15000000 alignments processed. 
16000000 alignments processed. 
17000000 alignments processed. 
18000000 alignments processed. 
19000000 alignments processed. 
20000000 alignments processed. 
21000000 alignments processed. 
22000000 alignments processed. 
23000000 alignments processed. 
24000000 alignments processed. 
25000000 alignments processed. 
26000000 alignments processed. 
27000000 alignments processed. 
28000000 alignments processed. 
29000000 alignments processed. 
30000000 alignments processed. 
31000000 alignments processed. 
32000000 alignments processed. 
34000000 alignments processed. 
35000000 alignments processed. 
36000000 alignments processed. 
37000000 alignments processed. 
38000000 alignments processed. 
39000000 alignments processed. 
40000000 alignments processed. 
41000000 alignments processed. 
uniq te counts = 0 
.......start iterative optimization .......... 
TE counts total 0.0 
Gene counts total 30887051.014284983 

In library /media/amit/Elements/MDC_project/preeclampsia/control/contrl1/sorted.control1.bam: 
Total annotated reads = 30887051 
Total non-uniquely mapped reads = 2432217 
Total unannotated reads = 6138537 

INFO  @ Fri, 09 Apr 2021 15:24:47: Finished processing sample files 
INFO  @ Fri, 09 Apr 2021 15:24:47: Generating counts table 
INFO  @ Fri, 09 Apr 2021 15:24:47: Performing differential analysis ...

estimating size factors
estimating dispersions
Error in checkForExperimentalReplicates(object, modelMatrix) : 

  The design matrix has the same number of samples and coefficients to fit,
  so estimation of dispersion is not possible. Treating samples
  as replicates was deprecated in v1.20 and no longer supported since v1.22.

Calls: DESeq ... estimateDispersions -> .local -> checkForExperimentalReplicates
Execution halted
INFO  @ Fri, 09 Apr 2021 15:24:54: Done 

Kindly help.

olivertam commented 3 years ago

Hi,

Thank you for your interest in the software.

There are a few things that could be contributing to your error: 1) There are no replicates in your experiment: As DESeq2 indicated in its error message, it is no longer able to calculate dispersion if there are no replicates in your experiment. You would need at least one replicate (either treatment or control, optimally both) to have DESeq2 work. However, you can try to take the count table file (sample_sorted_test.cntTable) and use other DE software that could handle no-replicate experiments. 2) I noticed that you're using hg19 refseq for your gene GTF, but GRCh37 Ensembl for your TE GTF. Please note that although the genome builds are equivalent, they utilize very different nomenclature for their chromosomes (e.g. chr1 vs 1). I noted in your log that there were no TE counted, and this difference in chromosome nomenclature could explain that. If you had mapped to hg19, we recommend using the hg19 TE GTF instead.

Also, the[E::idx_find_and_load] Could not retrieve index file for '.1617970850.172773.bam' error can be safely ignored. See #82 for more details.

Please let us know if you have other questions.

Thanks.

amitpande74 commented 3 years ago

Dear @olivertam , I have paired end reads (8 control replicates and 10 diseased). Tired running 1 replicate each (diseased and control). Will try to run all the replicates of diseased and control datasets together and get back to you. Thanks for the 2nd point though this was deliberate as I was too hyper to use the tool and see the output first. But thanks for mentoring and a quick response.

warm regards, Amit.

olivertam commented 3 years ago

Hi Amit,

Thanks for the clarification. Given how many samples you have, you can either run TEtranscripts on all of them (one step, though will take some time because it will process each library in series), or run TEcount (included with TEtranscripts) on each of them separately (thus parallelizing), then join the output from each run together to make a combined count table, and then run differential analysis (e.g. DESeq2) independently. The latter is better if you have lots of libraries (since you can quantify in parallel), but it would require you to manually create the combined count table (e.g. using join), and set up your own differential analysis script (rather than have it generated by TEtranscripts). Just wanted to give you some options if you want to quickly look at your data.

Thanks.