Closed conery closed 4 months ago
Hi,
Thank you for your interest in the software.
The TE GTF is re-sorted upon TE index generation, hence why the line number of the problematic exon record won't be the same as the GTF that was used as input.
If you took a close look at that annotation, you will notice that there is no value for the class_id
attribute. I don't know if this is common to all the MITE entries, but you might have to modify the input file to ensure that those annotations have a class_id
value (even if it's something not ultra meaningful).
If you're still having issues, I can take a look at your CSV input to see if there's a quick fix.
Thanks.
Ah, that explains the line number business. And yes, all my MITE records are missing a class. I'll go add them and let you know how it goes.
That was it! Thanks again, really appreciate the quick replies.
Hi @olivertam,
sorry for using this closed issue, but I was wondering if you could make the mentioned makeTEgtf.pl
script available or if you could specify the format requirements for the TE GTF? I couldn't find it as part of this reposistory. We're using a non-model organisms and would like to use your software.
Thanks!
Hi,
Thank you for your interest in the software. The script is available here. The usage information is as such:
Usage: makeTEgtf.pl -c [chrom column] -s [start column] -e [stop/end column]
-o [strand column] -n [source] -t [TE name column]
(-f [TE family column] -C [TE class column] -1)
[INFILE]
makeTEgtf.pl -U [UCSC rmsk table output]
makeTEgtf.pl -R [RepeatMasker raw output]
Output is printed to STDOUT
Preset parameters
-U - Use settings for UCSC rmsk table output
-R - Use settings for RepeatMasker raw output
Required parameters:
-c [chrom column] - Column containing chromosome name
-s [start column] - Column containing feature start position
-e [stop/end column] - Column containing feature stop/end position
-o [strand column] - Column containing strand information (+ or -)
-t [TE name column] - Column containing TE name
[INFILE] - File name to be processed into GTF
Optional parameters:
-n [source] - Source of the TE information
(e.g. mm9_rmsk for RepeatMasker track from
mm9 mouse genome)
Defaults to "user-provided" if not specified
-f [TE family column] - Column containing TE family name.
Defaults to TE name if not specified
-C [TE class column] - Column containing TE class name.
Defaults to TE family name if not specified
-S [score column] - Column containing the score of the TE prediction
(e.g. score from RepeatMasker)
-1 - Input coordinates uses 1-based indexing
This should be used if the input file uses
1-based coordinates. This should be invoked
if the genomic coordinates are obtained from
a GFF3, GTF, SAM or VCF file
Default: off if using BED, BAM or UCSC rmsk
input files
In brief, the TE GTF format follows the standard GTF format for the first 8 columns. However, TEtranscripts
will only process "exon" entries (column 3 is exon
), and so each TE insertion is an entry designated as an exon
.
For the INFO
fields (column 9), TEtranscripts
require the following entries:
gene_id
- typically the subfamiliy/TE name from a repeat databasetranscript_id
- a unique identifier. The script typically generates one based on the gene_id
family_id
- the TE family name. It can be the same as gene_id
or class_id
if none is available.class_id
- the TE class name. Again, it can be shared with gene_id
or family_id
if none is available.You can put other fields here, but the four above must exist with values.
Please let us know if you're having difficulty generating the TE GTF, and we can help troubleshoot.
Thanks.
Awesome, thank you for the quick response :+1:
I used the
makeTEgtf.pl
script you wrote (thank you very much for sending it) to create a GTF file for my transposon records. I got a few warning messages, but the resulting GTF file looks OK to me.When I run TEtranscripts I get an error message:
The exon record shown above is not on line 405 in the file (or anywhere near it). It also looks just like several other MITE records. The record that is on line 405 also looks OK. Any idea what's going on?
I can send the complete GTF file, or the CSV file I used as input to the Perl script, if that would help.