mhammell-laboratory / TEtranscripts

A package for including transposable elements in differential enrichment analysis of sequencing datasets.
http://hammelllab.labsites.cshl.edu/software/#TEtranscripts
GNU General Public License v3.0
228 stars 30 forks source link

Error in building TE index #201

Closed jsnoliva closed 3 weeks ago

jsnoliva commented 2 months ago

INFO @ Wed, 11 Sep 2024 16:26:23: Processing GTF files ...

INFO @ Wed, 11 Sep 2024 16:26:23: Building gene index .......

100000 GTF lines processed. 200000 GTF lines processed. 300000 GTF lines processed. INFO @ Wed, 11 Sep 2024 16:27:32: Done building gene index ......

INFO @ Wed, 11 Sep 2024 16:27:33: Building TE index .......

Chr1 EDTA helitron 1133 1242 406 + . gene_id "TE_homo_0"; transcript_id "TE_homo_0"; Name "Os1968"; classification "DNAnona/Helitron"; identity "0.824"; method "homology"; sequence_ontology "SO:0000544"; TE GTF format error! There is no annotation at line 1. Error in building TE index

olivertam commented 2 months ago

Hi,

Thank you for your interest in the software. The reason why it's failing is because TEtranscripts expects the following two fields in the INFO portion: family_id and gene_id. They don't have to contain meaningful information (e.g. they could all be "Unknown"), but you could also convert your classification information to the family_id and class_id if you like.

Thanks.

jsnoliva commented 2 months ago

Thanks for your response, I changed classification to family_id but am still getting the same error.

nohup: ignoring input
INFO  @ Tue, 17 Sep 2024 11:09:09: 
# ARGUMENTS LIST:
# name = TEtranscripts_out
# treatment files = ['SRR17151206_sorted.bam', 'SRR17151214_sorted.bam', 'SRR17151215_sorted.bam']
# control files = ['SRR17151213_sorted.bam', 'SRR17151216_sorted.bam', 'SRR17151217_sorted.bam']
# GTF file = all.gtf 
# TE file = fixed_OS_Rice_MSU7.fasta.mod.EDTA.TEanno.gtf 
# multi-mapper mode = multi 
# stranded = no
# differential analysis using DESeq2
# normalization = DESeq2_default
# FDR cutoff = 5.00e-02
# fold-change cutoff =  1.00
# read count cutoff = 1
# number of iteration = 100
# Alignments grouped by read ID = False

INFO  @ Tue, 17 Sep 2024 11:09:09: Processing GTF files ...

INFO  @ Tue, 17 Sep 2024 11:09:09: Building gene index ....... 

100000 GTF lines processed.
200000 GTF lines processed.
300000 GTF lines processed.
INFO  @ Tue, 17 Sep 2024 11:10:05: Done building gene index ...... 

INFO  @ Tue, 17 Sep 2024 11:10:06: Building TE index ....... 

Chr1    EDTA    helitron        1133    1242    406     +       .       gene_id "TE_homo_0"; ID "TE_homo_0"; Name "Os1968"; family_id "DNAnona/Helitron"; identity "0.824"; method "homology"; sequence_ontology "SO:0000544"; 
TE GTF format error! There is no annotation at line 1. 
Error in building TE index 

This is what the beginning of mt .gtf file looks like

##gtf-version X
# GFF-like GTF i.e. not checked against any GTF specification. Conversion based on GFF input, standardised by AGAT.
##date Fri Aug 23 10:51:26 AM EDT 2024
##This file contains repeats annotated by EDTA v2.2.2 with both structural and homology methods. Repeats can be overlapping due to nested insertions.
##This file follows the ENSEMBL standard: https://useast.ensembl.org/info/website/upload/gff3.html
##Column 3: Sequence Ontology of repeat features. Please refer to the SO database for more details: http://www.sequenceontology.org/. In cases where the SO database does not have the repeat feature, tentative SO names are used, with a full list included in EDTA/bin/TE_Sequence_Ontology.txt (Enhancement notes), and the sequence_ontology in Column 9 uses the closest parent SO.
##Column 7: The Smith-Waterman score generated by RepeatMasker, only available for homology entries.
##Column 9: 
##      ID: unique ID for this feature in the genome.
##      classification: Same as Column 3 but formatted following the RepeatMasker naming convention.
##      sequence_ontology: Sequence Ontology ID of the feature.
##      identity: Sequence identity (0-1) between the library sequence and the target region.
##      ltr_identity: Sequence identity (0-1) between the left and right LTR regions for structurally annotated LTR elements.
##      Name: Repeat family name. Some may be shown as coordinates, which are single-copy and structrually identified elements that are not included in the repeat library.
##      method: Indicate if this entry is produced by structural annotation or homology annotation.
##      motif/TSD/TIR: structural features of structurally annotated LTR and TIR elements.
##For more details about this file, please refer to the EDTA wiki: https://github.com/oushujun/EDTA/wiki/Making-sense-of-EDTA-usage-and-outputs---Q&A
##seqid source sequence_ontology start end score strand phase attributes
ChrSy   AGAT    gene    4       1181    .       +       .       gene_id "agat-gene-677"; ID "agat-gene-677"; Name "Os0376_LTR"; family_id "LTR/Gypsy"; identity "0.776"; method "homology"; sequence_ontology "SO:0002265";
ChrSy   EDTA    Gypsy_LTR_retrotransposon       4       1181    3021    +       .       gene_id "agat-gene-677"; transcript_id "TE_homo_320354"; ID "TE_homo_320354"; Name "Os0376_LTR"; Parent "agat-gene-677"; family_id "LTR/Gypsy"; identity "0.776"; method "homology"; sequence_ontology "SO:0002265";
ChrSy   AGAT    gene    1254    1598    .       +       .       gene_id "agat-gene-678"; ID "agat-gene-678"; Name "Os1598_LTR"; family_id "LTR/Gypsy"; identity "0.871"; method "homology"; sequence_ontology "SO:0002265";
ChrSy   EDTA    Gypsy_LTR_retrotransposon       1254    1598    2014    +       .       gene_id "agat-gene-678"; transcript_id "TE_homo_320355"; ID "TE_homo_320355"; Name "Os1598_LTR"; Parent "agat-gene-678"; family_id "LTR/Gypsy"; identity "0.871"; method "homology"; sequence_ontology "SO:0002265";
olivertam commented 2 months ago

Hi,

You still need the class_id field in the last column. You can either just use a placeholder (e.g. class_id "TE"), or try to split the classification to two entries (e.g. classification "LTR/Gypsy" to family_id "Gypsy"; class_id "LTR")

Thanks.

jsnoliva commented 2 months ago

I tried both methods and now it seems to be a different line issue

100000 GTF lines processed.
200000 GTF lines processed.
300000 GTF lines processed.
INFO  @ Tue, 17 Sep 2024 13:08:22: Done building gene index ...... 

INFO  @ Tue, 17 Sep 2024 13:08:24: Building TE index ....... 

Chr1    EDTA    helitron        1133    1242    406     +       .       gene_id "TE_homo_0"; ID "TE_homo_0"; Name "Os1968"; class_id "DNAnona"; family_id "Hel
itron"; identity "0.824"; method "homology"; sequence_ontology "SO:0000544"; 
TE GTF format error! There is no annotation at line 1. 
Error in building TE index 
100000 GTF lines processed.
200000 GTF lines processed.
300000 GTF lines processed.
INFO  @ Tue, 17 Sep 2024 13:12:34: Done building gene index ...... 

INFO  @ Tue, 17 Sep 2024 13:12:35: Building TE index ....... 

Chr1    EDTA    helitron        1133    1242    406     +       .       gene_id "TE_homo_0"; class_id "TE"; ID "TE_homo_0"; Name "Os1968"; family_id "DNAnona/Helitron"; identity "0.824"; method "homology"; sequence_ontology "SO:0000544"; 
TE GTF format error! There is no annotation at line 1. 
Error in building TE index 
olivertam commented 2 months ago

Hi,

Sorry, I just realized another issue. The third column should be exon, since that's the entry that is recognized by TEtranscripts to be used for annotation.

Apologies.

jsnoliva commented 1 month ago

It seems to have solved the TE GTF format error but error in building TE index remains

INFO  @ Tue, 24 Sep 2024 11:50:40: Processing GTF files ...

INFO  @ Tue, 24 Sep 2024 11:50:40: Building gene index ....... 

100000 GTF lines processed.
200000 GTF lines processed.
300000 GTF lines processed.
INFO  @ Tue, 24 Sep 2024 11:51:37: Done building gene index ...... 

INFO  @ Tue, 24 Sep 2024 11:51:37: Building TE index ....... 

Error in building TE index 

First lines in my file

Chr1 EDTA exon 1133 1242 406 + . gene_id "TE_homo_0"; family_id "Os1968"; class_id "DNAnona/Helitron";
Chr1 EDTA exon 1282 1422 352 + . gene_id "TE_homo_1"; family_id "TE_00001024"; class_id "DNA/Helitron";
Chr1 EDTA exon 1444 1780 919 + . gene_id "TE_homo_2"; family_id "TE_00006580"; class_id "DNA/Helitron";
Chr1 EDTA exon 1855 2027 843 - . gene_id "TE_homo_3"; family_id "TE_00006580"; class_id "DNA/Helitron";
Chr1 EDTA exon 1986 2199 1121 + . gene_id "TE_homo_4"; family_id "TE_00001024"; class_id "DNA/Helitron";
Chr1 EDTA exon 2297 2472 1332 - . gene_id "TE_homo_5"; family_id "Os0073"; class_id "DNAnona/unknown";
Chr1 EDTA exon 2536 2924 . . . gene_id "TE_struc_279"; family_id "Os1667"; class_id "MITE/DTH;  
Chr1 EDTA exon 4579 4700 742 - . gene_id "TE_homo_6"; family_id "TE_00005294"; class_id "DNA/Helitron";
Chr1 EDTA exon 4794 5030 1083 - . gene_id "TE_homo_7"; family_id "TE_00005294"; class_id "DNA/Helitron";
Chr1 EDTA exon 5684 5886 . . . gene_id "TE_struc_280"; family_id "Os2924"; class_id "MITE/DTT;  
Chr1 EDTA exon 8877 9129 693 - . gene_id "TE_homo_8"; family_id "TE_00000098"; class_id "DNA/DTT";
Chr1 EDTA exon 9034 9162 461 + . gene_id "TE_homo_9"; family_id "TE_00006547"; class_id "DNA/DTA";
Chr1 EDTA exon 11019 11107 461 - . gene_id "TE_homo_10"; family_id "Os2076"; class_id "DNAnona/MULE";
olivertam commented 1 month ago

Hi,

Here are some common issues:

  1. Check that your GTF file is tab-delimited for 9 columns, with all the fields in column 9 space-delimited.
  2. Check that you have the following fields in column 9: gene_id, transcript_id (that can be any unique value), family_id and class_id
  3. Check that there is no scientific notation in your start (column 4), end (column 5) and score (column 6).

If you are still having issue, feel free to share the GTF and we can troubleshoot it further.

Thanks.

github-actions[bot] commented 4 weeks ago

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days