mhammell-laboratory / TEtranscripts

A package for including transposable elements in differential enrichment analysis of sequencing datasets.
http://hammelllab.labsites.cshl.edu/software/#TEtranscripts
GNU General Public License v3.0
206 stars 29 forks source link

TE GTF format error #180

Closed conery closed 4 months ago

conery commented 4 months ago

I used the makeTEgtf.pl script you wrote (thank you very much for sending it) to create a GTF file for my transposon records. I got a few warning messages, but the resulting GTF file looks OK to me.

When I run TEtranscripts I get an error message:

INFO  @ Mon, 05 Feb 2024 12:28:27: Building TE index ....... 

N2_chrI reasonaTE   exon    921649  921779  .   +   .   gene_id "MITE"; transcript_id "MITE_dup6"; family_id "DNATransposon"; class_id ""; gene_name "MITE:TE"; 
TE GTF format error! There is no annotation at line 405. 
Error in building TE index 

The exon record shown above is not on line 405 in the file (or anywhere near it). It also looks just like several other MITE records. The record that is on line 405 also looks OK. Any idea what's going on?

I can send the complete GTF file, or the CSV file I used as input to the Perl script, if that would help.

olivertam commented 4 months ago

Hi,

Thank you for your interest in the software. The TE GTF is re-sorted upon TE index generation, hence why the line number of the problematic exon record won't be the same as the GTF that was used as input. If you took a close look at that annotation, you will notice that there is no value for the class_id attribute. I don't know if this is common to all the MITE entries, but you might have to modify the input file to ensure that those annotations have a class_id value (even if it's something not ultra meaningful).

If you're still having issues, I can take a look at your CSV input to see if there's a quick fix.

Thanks.

conery commented 4 months ago

Ah, that explains the line number business. And yes, all my MITE records are missing a class. I'll go add them and let you know how it goes.

conery commented 4 months ago

That was it! Thanks again, really appreciate the quick replies.

mobilegenome commented 4 months ago

Hi @olivertam,

sorry for using this closed issue, but I was wondering if you could make the mentioned makeTEgtf.pl script available or if you could specify the format requirements for the TE GTF? I couldn't find it as part of this reposistory. We're using a non-model organisms and would like to use your software.

Thanks!

olivertam commented 4 months ago

Hi,

Thank you for your interest in the software. The script is available here. The usage information is as such:

 Usage: makeTEgtf.pl -c [chrom column] -s [start column] -e [stop/end column] 
                     -o [strand column] -n [source] -t [TE name column] 
                     (-f [TE family column] -C [TE class column] -1)
                     [INFILE]
        makeTEgtf.pl -U [UCSC rmsk table output]
        makeTEgtf.pl -R [RepeatMasker raw output]

 Output is printed to STDOUT

 Preset parameters
   -U                   -    Use settings for UCSC rmsk table output
   -R                   -    Use settings for RepeatMasker raw output

 Required parameters:
  -c [chrom column]     -    Column containing chromosome name
  -s [start column]     -    Column containing feature start position
  -e [stop/end column]  -    Column containing feature stop/end position
  -o [strand column]    -    Column containing strand information (+ or -)
  -t [TE name column]   -    Column containing TE name
  [INFILE]              -    File name to be processed into GTF

 Optional parameters:
  -n [source]           -    Source of the TE information 
                             (e.g. mm9_rmsk for RepeatMasker track from
                              mm9 mouse genome)
                             Defaults to "user-provided" if not specified
  -f [TE family column] -    Column containing TE family name. 
                             Defaults to TE name if not specified
  -C [TE class column]  -    Column containing TE class name. 
                             Defaults to TE family name if not specified
  -S [score column]     -    Column containing the score of the TE prediction
                             (e.g. score from RepeatMasker)
  -1                    -    Input coordinates uses 1-based indexing
                             This should be used if the input file uses
                             1-based coordinates. This should be invoked
                             if the genomic coordinates are obtained from
                             a GFF3, GTF, SAM or VCF file
                             Default: off if using BED, BAM or UCSC rmsk
                                      input files

In brief, the TE GTF format follows the standard GTF format for the first 8 columns. However, TEtranscripts will only process "exon" entries (column 3 is exon), and so each TE insertion is an entry designated as an exon. For the INFO fields (column 9), TEtranscripts require the following entries:

You can put other fields here, but the four above must exist with values.

Please let us know if you're having difficulty generating the TE GTF, and we can help troubleshoot.

Thanks.

mobilegenome commented 4 months ago

Awesome, thank you for the quick response :+1: