mhammell-laboratory / TEtranscripts

A package for including transposable elements in differential enrichment analysis of sequencing datasets.
http://hammelllab.labsites.cshl.edu/software/#TEtranscripts
GNU General Public License v3.0
206 stars 29 forks source link

Error in EM step due to GTF #89

Closed pillailab closed 3 years ago

pillailab commented 3 years ago

Hello, I have previously run TEtranscripts succesfully on our cluster using TE GTF from your website. I am now trying with a new gtf generated from dfam (generared by the makeTEgtf.pl script. However, it is throwing the following error:

INFO  @ Mon, 17 May 2021 17:19:46: 
# ARGUMENTS LIST:
# name = /home/mp758/scratch60/SF3B1_PRC2_1/April_2021/Drip/Drip_Dfam.TE
# treatment files = ['/home/mp758/scratch60/SF3B1_PRC2_1/April_2021/Drip/Sample_MPG86/MPG86_TE_Aligned.sortedByCoord.out.bam', '/home/mp758/scratch60/SF3B1_PRC2_1/April_2021/Drip/Sample_MPG89/MPG89_TE_Aligned.sortedByCoord.out.bam']
# control files = ['/home/mp758/scratch60/SF3B1_PRC2_1/April_2021/Drip/Sample_MPG80/MPG80_TE_Aligned.sortedByCoord.out.bam', '/home/mp758/scratch60/SF3B1_PRC2_1/April_2021/Drip/Sample_MPG83/MPG83_TE_Aligned.sortedByCoord.out.bam']
# GTF file = /home/mp758/project/GenomeBuilds/Human/GRCh38/gencode.v29.annotation.gtf 
# TE file = /home/mp758/project/GenomeBuilds/Human/Dfam/hg38_dfam.nrph.gtf 
# multi-mapper mode = multi 
# stranded = yes
# differential analysis using DESeq2
# normalization = DESeq2_default
# FDR cutoff = 5.00e-02
# fold-change cutoff =  1.00
# read count cutoff = 1
# number of iteration = 100
# Alignments grouped by read ID = False

INFO  @ Mon, 17 May 2021 17:19:48: Processing GTF files ...

INFO  @ Mon, 17 May 2021 17:19:48: Building gene index .......

100000 GTF lines processed.
200000 GTF lines processed.
300000 GTF lines processed.
400000 GTF lines processed.
500000 GTF lines processed.
600000 GTF lines processed.
700000 GTF lines processed.
800000 GTF lines processed.
900000 GTF lines processed.
1000000 GTF lines processed.
1100000 GTF lines processed.
1200000 GTF lines processed.
INFO  @ Mon, 17 May 2021 17:29:30: Done building gene index ......

INFO  @ Mon, 17 May 2021 17:29:54: 
Building TE index .......

INFO  @ Mon, 17 May 2021 17:34:41: Done building TE index ......

INFO  @ Mon, 17 May 2021 17:34:41: 
Reading sample files ...

uniq te counts = 5291568.0 
.......start iterative optimization ..........
multi-reads = 120631 Error in optimization: the TE does not exist!
Error in optimization
Error occurred when assigning read gene/TE annotations in file /home/mp758/scratch60/SF3B1_PRC2_1/April_2021/Drip/Sample_MPG86/MPG86_TE_Aligned.sortedByCoord.out.bam.

Thank your for your help.

olivertam commented 3 years ago

Hi,

Thank you for your interest in the software. Based on the error, it appears that there might be an issue with the TE GTF file, where it cannot obtain the length of one (or more) of the elements. Would you be willing to share the custom TE GTF? We can see if we could find the potential error.

Thanks.

pillailab commented 3 years ago

Thanks Oliver, how would you like to share it ( the gzip file is 105 Mb) ? Here are the first few lines: chr1 DFAM_hg38 exon 10955 10464 . - . gene_id "TAR1"; transcript_id "TAR1"; family_id "TAR1"; class_id "TAR1"; chr1 DFAM_hg38 exon 11464 10826 . - . gene_id "TAR1"; transcript_id "TAR1_dup1"; family_id "TAR1"; class_id "TAR1"; chr1 DFAM_hg38 exon 11677 11502 . - . gene_id "L1MC4a_3end"; transcript_id "L1MC4a_3end"; family_id "L1MC4a_3end"; class_id "L1MC4a_3end"; chr1 DFAM_hg38 exon 11781 11677 . - . gene_id "MER5B"; transcript_id "MER5B"; family_id "MER5B"; class_id "MER5B"; chr1 DFAM_hg38 exon 15354 15265 . - . gene_id "MIR1_Amn"; transcript_id "MIR1_Amn"; family_id "MIR1_Amn"; class_id "MIR1_Amn"; chr1 DFAM_hg38 exon 16460 16362 . - . gene_id "Charlie15a"; transcript_id "Charlie15a"; family_id "Charlie15a"; class_id "Charlie15a"; chr1 DFAM_hg38 exon 18419 18649 . + . gene_id "L1M2b_5end"; transcript_id "L1M2b_5end"; family_id "L1M2b_5end"; class_id "L1M2b_5end"; chr1 DFAM_hg38 exon 18909 19049 . + . gene_id "L2b_3end"; transcript_id "L2b_3end"; family_id "L2b_3end"; class_id "L2b_3end"; chr1 DFAM_hg38 exon 18958 19048 . + . gene_id "L2a_3end"; transcript_id "L2a_3end"; family_id "L2a_3end"; class_id "L2a_3end"; chr1 DFAM_hg38 exon 19973 20727 . + . gene_id "L3"; transcript_id "L3"; family_id "L3"; class_id "L3"; chr1 DFAM_hg38 exon 21951 22344 . + . gene_id "MLT1K"; transcript_id "MLT1K"; family_id "MLT1K"; class_id "MLT1K"; chr1 DFAM_hg38 exon 23362 23121 . - . gene_id "MIR"; transcript_id "MIR"; family_id "MIR"; class_id "MIR"; chr1 DFAM_hg38 exon 23838 24010 . + . gene_id "L2b_3end"; transcript_id "L2b_3end_dup1"; family_id "L2b_3end"; class_id "L2b_3end"; chr1 DFAM_hg38 exon 24074 24249 . + . gene_id "MIR"; transcript_id "MIR_dup1"; family_id "MIR"; class_id "MIR"; chr1 DFAM_hg38 exon 24262 24447 . + . gene_id "L2a_3end"; transcript_id "L2a_3end_dup1"; family_id "L2a_3end"; class_id "L2a_3end"; chr1 DFAM_hg38 exon 25285 25182 . - . gene_id "MIRb"; transcript_id "MIRb"; family_id "MIRb"; class_id "MIRb"; chr1 DFAM_hg38 exon 25893 25959 . + . gene_id "MIR"; transcript_id "MIR_dup2"; family_id "MIR"; class_id "MIR"; chr1 DFAM_hg38 exon 26358 26428 . + . gene_id "MIR1_Amn"; transcript_id "MIR1_Amn_dup1"; family_id "MIR1_Amn"; class_id "MIR1_Amn"; chr1 DFAM_hg38 exon 26775 26574 . - . gene_id "L2c_3end"; transcript_id "L2c_3end"; family_id "L2c_3end"; class_id "L2c_3end"; chr1 DFAM_hg38 exon 26792 27062 . + . gene_id "AluSp"; transcript_id "AluSp"; family_id "AluSp"; class_id "AluSp"; chr1 DFAM_hg38 exon 27221 27058 . - . gene_id "L2c_3end"; transcript_id "L2c_3end_dup1"; family_id "L2c_3end"; class_id "L2c_3end"; chr1 DFAM_hg38 exon 27283 27541 . + . gene_id "MER33"; transcript_id "MER33"; family_id "MER33"; class_id "MER33"; chr1 DFAM_hg38 exon 27987 27833 . - . gene_id "MIRc"; transcript_id "MIRc"; family_id "MIRc"; class_id "MIRc"; chr1 DFAM_hg38 exon 28301 28151 . - . gene_id "MIRb"; transcript_id "MIRb_dup1"; family_id "MIRb"; class_id "MIRb"; chr1 DFAM_hg38 exon 29902 30196 . + . gene_id "L1MB3_3end"; transcript_id "L1MB3_3end"; family_id "L1MB3_3end"; class_id "L1MB3_3end"; chr1 DFAM_hg38 exon 30533 30343 . - . gene_id "MER53"; transcript_id "MER53"; family_id "MER53"; class_id "MER53";

olivertam commented 3 years ago

Hi,

Thanks for sharing the excerpt. The first thing that I noticed is that for all the annotations on the - strand, the start position is larger than the stop position. I'm sure that this was the output from Dfam, but given that there is an expectation that the stop position is larger than the start position (at least based on my understanding of GTF), the length calculation would now give negative values. This is what is causing the error. I would recommend trying to fix that, and see if that resolves the error.

Thanks.

pillailab commented 3 years ago

Thank you, that indeed was the error, i had not checked the stop/ start from the gtf since it was generated by the perl script. Flipping the coordinates for the negative strands fixed it. Manoj