mhammell-laboratory / TEtranscripts

A package for including transposable elements in differential enrichment analysis of sequencing datasets.
http://hammelllab.labsites.cshl.edu/software/#TEtranscripts
GNU General Public License v3.0
206 stars 29 forks source link

TE inconsistency #154

Closed ONAgaganb closed 7 months ago

ONAgaganb commented 8 months ago

Hello, I feel honored to have access to such a great tool! I'm having a bit of a problem when I'm running it, but I don't know how to fix it, and I'm getting a bit of a message repeated in my nohup log: TE inconsistency! 12537667..12544609:B03:12537667..12544609:Gypsy:LTR Error: 1 [Exception type: SystemExit, raised in TEindex.py:380]

I looked at some places where I thought there might be a problem, such as my TE GTF file, and I found the following: B01 EDTA exon 12873 13061 388 - . gene_id "TE_00000334"; transcript_id "TE_00000334"; family_id "Helitron"; class_id "DNA"; gene_name "TE_00000334:TE"; B01 EDTA exon 25753 25891 515 - . gene_id "TE_00000605"; transcript_id "TE_00000605"; family_id "Helitron"; class_id "DNA"; gene_name "TE_00000605:TE"; B01 EDTA exon 31176 31383 1438 - . gene_id "TE_00000424"; transcript_id "TE_00000424"; family_id "Helitron"; class_id "DNA"; gene_name "TE_00000424:TE"; B01 EDTA exon 96334 96417 357 - . gene_id "TE_00000341"; transcript_id "TE_00000341"; family_id "DTC"; class_id "DNA"; gene_name "TE_00000341:TE";

I thought that the presence of ":" in my "gene_name" section might have caused a syntax error, so I removed that section and it became the following: B01 EDTA exon 12873 13061 388 - . gene_id "TE_00000334"; transcript_id "TE_00000334"; family_id "Helitron"; class_id "DNA"; B01 EDTA exon 25753 25891 515 - . gene_id "TE_00000605"; transcript_id "TE_00000605"; family_id "Helitron"; class_id "DNA"; B01 EDTA exon 31176 31383 1438 - . gene_id "TE_00000424"; transcript_id "TE_00000424"; family_id "Helitron"; class_id "DNA"; B01 EDTA exon 96334 96417 357 - . gene_id "TE_00000341"; transcript_id "TE_00000341"; family_id "DTC"; class_id "DNA"; But when I re-run it, the error message is still the same, can you help me to see what the reason is, thank you very much for your help!

Below are my parameters and all the files entered: nohup TEtranscripts -t out_3H_I_FB_rep1.sam -c out_3H_C_FB_rep1.sam --TE /80T/cly/data/for_turn/output_test1013.gtf --GTF /80T/cly/data/ref _data/Bombus_terrestris.Bter_1.0.57.gtf &

ONAgaganb commented 8 months ago

test_1013.gtf.gz mynohup.out.gz

olivertam commented 8 months ago

Hi,

I just checked your GTF, and you have overlapping annotations:

B03     EDTA    exon    12537667        12538304        .       -       .      gene_id "Parent=repeat_region_1"; transcript_id "Parent=repeat_region_1_dup1"; family_id "B03:12537667..12544609"; class_id "B03:12537667..12544609";
B03     EDTA    exon    12537667        12544609        .       -       .      gene_id "Parent=repeat_region_1"; transcript_id "Parent=repeat_region_1_dup2"; family_id "B03:12537667..12544609"; class_id "B03:12537667..12544609";

You can see that the second annotation completely envelopes the first. Are these annotations where one repeat is embedded in another? In that case, you would want to make them non-overlapping, as it is assumed that a genomic region should not belong to two independent TE insertions. You might want to check your original GFF3 file and select entries where column 3 is exon, and see if there are still overlapping annotations.

Hope this helps.

Thanks.

Cheers, Oliver

ONAgaganb commented 8 months ago

Thank you very much for your help, I checked my original GFF3 file and removed all the overlapping annotations, but now there is a new reported error.

TE inconsistency! 635970..633052:B04:635970..633052:Helitron:DNA 
Error: 1 
[Exception type: SystemExit, raised in TEindex.py:380] 
INFO  @ Fri, 20 Oct 2023 15:14:14: 
# ARGUMENTS LIST:
# name = TEtranscripts_out
# treatment files = ['out_3H_I_FB_rep1.sam']
# control files = ['out_3H_C_FB_rep1.sam']
# GTF file = /80T/cly/data/ref_data/Bombus_terrestris.Bter_1.0.57.gtf 
# TE file = /80T/cly/data/for_turn/Bombus_terrestris.fa.mod.EDTA.TEanno.gtf 
# multi-mapper mode = multi 
# stranded = no
# differential analysis using DESeq2
# normalization = DESeq2_default
# FDR cutoff = 5.00e-02
# fold-change cutoff =  1.00
# read count cutoff = 1
# number of iteration = 100
# Alignments grouped by read ID = True

INFO  @ Fri, 20 Oct 2023 15:14:14: Processing GTF files ...

INFO  @ Fri, 20 Oct 2023 15:14:14: Building gene index ....... 

100000 GTF lines processed.
200000 GTF lines processed.
INFO  @ Fri, 20 Oct 2023 15:14:32: Done building gene index ...... 

INFO  @ Fri, 20 Oct 2023 15:14:32: Building TE index ....... 

Error in building TE index 

The previous log file stopped at the "TE inconsistency" part, this time a similar message appeared, but the program continued to work until "Error in building TE index". So I searched the new gtf file for the line it indicated "635970..633052:B04:635970..633052:Helitron:DNA", but it didn't come up, and the next error also left me clueless. Your help would be greatly appreciated, thank you.

my new log file nohup.txt my new GFF3 file processed_Bombus_terrestris.fa.mod.EDTA.TEanno .zip

olivertam commented 8 months ago

Hi,

It appears that your GTF file has a lot of errors in it. There are lines that were warnings from the makeTEgtf.pl script:

Strand information on line 3451 (.) is not recognized. Only "+", "-" or "C" are accepted
Line 3451 is skipped

Unfortunately, these were printed within the actual entries, and thus breaking them up. I'm suspecting that you might be putting both the standard output (STDOUT) and standard error (STDERR) into the same output file (either because of nohup or setting the HPC logs to the same file. I would recommend splitting them up so that you're not accidentally breaking up lines of output with warning messages.

Thanks.

github-actions[bot] commented 7 months ago

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days