Error in EM step due to GTF

pillailab commented 3 years ago

Hello, I have previously run TEtranscripts succesfully on our cluster using TE GTF from your website. I am now trying with a new gtf generated from dfam (generared by the makeTEgtf.pl script. However, it is throwing the following error:

INFO  @ Mon, 17 May 2021 17:19:46: 
# ARGUMENTS LIST:
# name = /home/mp758/scratch60/SF3B1_PRC2_1/April_2021/Drip/Drip_Dfam.TE
# treatment files = ['/home/mp758/scratch60/SF3B1_PRC2_1/April_2021/Drip/Sample_MPG86/MPG86_TE_Aligned.sortedByCoord.out.bam', '/home/mp758/scratch60/SF3B1_PRC2_1/April_2021/Drip/Sample_MPG89/MPG89_TE_Aligned.sortedByCoord.out.bam']
# control files = ['/home/mp758/scratch60/SF3B1_PRC2_1/April_2021/Drip/Sample_MPG80/MPG80_TE_Aligned.sortedByCoord.out.bam', '/home/mp758/scratch60/SF3B1_PRC2_1/April_2021/Drip/Sample_MPG83/MPG83_TE_Aligned.sortedByCoord.out.bam']
# GTF file = /home/mp758/project/GenomeBuilds/Human/GRCh38/gencode.v29.annotation.gtf 
# TE file = /home/mp758/project/GenomeBuilds/Human/Dfam/hg38_dfam.nrph.gtf 
# multi-mapper mode = multi 
# stranded = yes
# differential analysis using DESeq2
# normalization = DESeq2_default
# FDR cutoff = 5.00e-02
# fold-change cutoff =  1.00
# read count cutoff = 1
# number of iteration = 100
# Alignments grouped by read ID = False

INFO  @ Mon, 17 May 2021 17:19:48: Processing GTF files ...

INFO  @ Mon, 17 May 2021 17:19:48: Building gene index .......

100000 GTF lines processed.
200000 GTF lines processed.
300000 GTF lines processed.
400000 GTF lines processed.
500000 GTF lines processed.
600000 GTF lines processed.
700000 GTF lines processed.
800000 GTF lines processed.
900000 GTF lines processed.
1000000 GTF lines processed.
1100000 GTF lines processed.
1200000 GTF lines processed.
INFO  @ Mon, 17 May 2021 17:29:30: Done building gene index ......

INFO  @ Mon, 17 May 2021 17:29:54: 
Building TE index .......

INFO  @ Mon, 17 May 2021 17:34:41: Done building TE index ......

INFO  @ Mon, 17 May 2021 17:34:41: 
Reading sample files ...

uniq te counts = 5291568.0 
.......start iterative optimization ..........
multi-reads = 120631 Error in optimization: the TE does not exist!
Error in optimization
Error occurred when assigning read gene/TE annotations in file /home/mp758/scratch60/SF3B1_PRC2_1/April_2021/Drip/Sample_MPG86/MPG86_TE_Aligned.sortedByCoord.out.bam.

Thank your for your help.

olivertam commented 3 years ago

Hi,

Thank you for your interest in the software. Based on the error, it appears that there might be an issue with the TE GTF file, where it cannot obtain the length of one (or more) of the elements. Would you be willing to share the custom TE GTF? We can see if we could find the potential error.

Thanks.

pillailab commented 3 years ago

Thanks Oliver, chr1 DFAM_hg38 chr1 DFAM_hg38 chr1 DFAM_hg38 chr1 DFAM_hg38 chr1 DFAM_hg38 chr1 DFAM_hg38 chr1 DFAM_hg38 chr1 DFAM_hg38 chr1 DFAM_hg38 chr1 DFAM_hg38 chr1 DFAM_hg38 chr1 DFAM_hg38 chr1 DFAM_hg38 chr1 DFAM_hg38 chr1 DFAM_hg38 chr1 DFAM_hg38 chr1 DFAM_hg38 chr1 DFAM_hg38 chr1 DFAM_hg38 chr1 DFAM_hg38 chr1 DFAM_hg38 chr1 DFAM_hg38 chr1 DFAM_hg38 chr1 DFAM_hg38 chr1 DFAM_hg38 chr1 DFAM_hg38 how would you like to share it ( the gzip file is 105 Mb) ? Here are the first few lines: exon 10955 10464 . - . gene_id "TAR1"; transcript_id "TAR1"; family_id "TAR1"; class_id "TAR1"; exon 11464 10826 . - . gene_id "TAR1"; transcript_id "TAR1_dup1"; family_id "TAR1"; class_id "TAR1"; exon 11677 11502 . - . gene_id "L1MC4a_3end"; transcript_id "L1MC4a_3end"; family_id "L1MC4a_3end"; class_id "L1MC4a_3end"; exon 11781 11677 . - . gene_id "MER5B"; transcript_id "MER5B"; family_id "MER5B"; class_id "MER5B"; exon 15354 15265 . - . gene_id "MIR1_Amn"; transcript_id "MIR1_Amn"; family_id "MIR1_Amn"; class_id "MIR1_Amn"; exon 16460 16362 . - . gene_id "Charlie15a"; transcript_id "Charlie15a"; family_id "Charlie15a"; class_id "Charlie15a"; exon 18419 18649 . + . gene_id "L1M2b_5end"; transcript_id "L1M2b_5end"; family_id "L1M2b_5end"; class_id "L1M2b_5end"; exon 18909 19049 . + . gene_id "L2b_3end"; transcript_id "L2b_3end"; family_id "L2b_3end"; class_id "L2b_3end"; exon 18958 19048 . + . gene_id "L2a_3end"; transcript_id "L2a_3end"; family_id "L2a_3end"; class_id "L2a_3end"; exon 19973 20727 . + . gene_id "L3"; transcript_id "L3"; family_id "L3"; class_id "L3"; exon 21951 22344 . + . gene_id "MLT1K"; transcript_id "MLT1K"; family_id "MLT1K"; class_id "MLT1K"; exon 23362 23121 . - . gene_id "MIR"; transcript_id "MIR"; family_id "MIR"; class_id "MIR"; exon 23838 24010 . + . gene_id "L2b_3end"; transcript_id "L2b_3end_dup1"; family_id "L2b_3end"; class_id "L2b_3end"; exon 24074 24249 . + . gene_id "MIR"; transcript_id "MIR_dup1"; family_id "MIR"; class_id "MIR"; exon 24262 24447 . + . gene_id "L2a_3end"; transcript_id "L2a_3end_dup1"; family_id "L2a_3end"; class_id "L2a_3end"; exon 25285 25182 . - . gene_id "MIRb"; transcript_id "MIRb"; family_id "MIRb"; class_id "MIRb"; exon 25893 25959 . + . gene_id "MIR"; transcript_id "MIR_dup2"; family_id "MIR"; class_id "MIR"; exon 26358 26428 . + . gene_id "MIR1_Amn"; transcript_id "MIR1_Amn_dup1"; family_id "MIR1_Amn"; class_id "MIR1_Amn"; exon 26775 26574 . - . gene_id "L2c_3end"; transcript_id "L2c_3end"; family_id "L2c_3end"; class_id "L2c_3end"; exon 26792 27062 . + . gene_id "AluSp"; transcript_id "AluSp"; family_id "AluSp"; class_id "AluSp"; exon 27221 27058 . - . gene_id "L2c_3end"; transcript_id "L2c_3end_dup1"; family_id "L2c_3end"; class_id "L2c_3end"; exon 27283 27541 . + . gene_id "MER33"; transcript_id "MER33"; family_id "MER33"; class_id "MER33"; exon 27987 27833 . - . gene_id "MIRc"; transcript_id "MIRc"; family_id "MIRc"; class_id "MIRc"; exon 28301 28151 . - . gene_id "MIRb"; transcript_id "MIRb_dup1"; family_id "MIRb"; class_id "MIRb"; exon 29902 30196 . + . gene_id "L1MB3_3end"; transcript_id "L1MB3_3end"; family_id "L1MB3_3end"; class_id "L1MB3_3end"; exon 30533 30343 . - . gene_id "MER53"; transcript_id "MER53"; family_id "MER53"; class_id "MER53";

olivertam commented 3 years ago

Hi,

Thanks for sharing the excerpt. The first thing that I noticed is that for all the annotations on the - strand, the start position is larger than the stop position. I'm sure that this was the output from Dfam, but given that there is an expectation that the stop position is larger than the start position (at least based on my understanding of GTF), the length calculation would now give negative values. This is what is causing the error. I would recommend trying to fix that, and see if that resolves the error.

Thanks.

pillailab commented 3 years ago

Thank you, that indeed was the error, i had not checked the stop/ start from the gtf since it was generated by the perl script. Flipping the coordinates for the negative strands fixed it. Manoj

mhammell-laboratory / TEtranscripts

Error in EM step due to GTF #89