mhammell-laboratory / TEtranscripts

A package for including transposable elements in differential enrichment analysis of sequencing datasets.
http://hammelllab.labsites.cshl.edu/software/#TEtranscripts
GNU General Public License v3.0
206 stars 29 forks source link

Use the GFF3 format file to get the TE GTF format needed for TEtranscript #152

Closed ONAgaganb closed 7 months ago

ONAgaganb commented 8 months ago

Hello, I feel honored to be using such a remarkable tool, but I have so far encountered the following problem that is causing me a lot of frustration. I got the annotation file in GFF3 format using EDTA, the file is Bombus_terrestris.fa.mod.EDTA.TEanno.gff3, and I would like to use the makeTEgtf.pl that you provided to convert this file into the GTF format that is required for TE to input the TEtranscript for the next TE Differential expression analysis, but the conversion process went wrong, I got a GTF file with duplicated content, such as this: B11 2 exon 16357640 16357868 1822 + . gene_id "ID=TE_homo_32101;Name=TE_00000590;Classification=DNA/Helitron;Sequence_ontology=SO:0000544;Identity=0.934;Method=homology"; transcript_id "ID=TE_homo_32101;Name=TE_00000590;Classification=DNA/Helitron;Sequence_ontology=SO:0000544;Identity=0.934;Method=homology"; family_id "ID=TE_homo_32101;Name=TE_00000590;Classification=DNA/Helitron;Sequence_ontology=SO:0000544;Identity=0.934;Method=homology"; class_id "ID=TE_homo_32101;Name=TE_00000590;Classification=DNA/Helitron;Sequence_ontology=SO:0000544;Iden^Ctity=0.934;Method=homology"; gene_name "ID=TE_homo_32101;Name=TE_00000590;Classification=DNA/Helitron;Sequence_ontology=SO:0000544;Identity=0.934;Method=homology:TE";

obviously this is not right, but I have no clue how to change it, here are the commands and parameters I used, as well as my GFF3 formatted file, kindly look forward to a reply, I am very grateful for your help!

the commands and parameters I used: nohup perl makeTEgtf.pl -c 1 -s 4 -e 5 -o 7 -n 2 -t 9 -f 9 -C 9 -S 6 -1 Bombus_terrestris.fa.mod.EDTA.TEanno.gff3 > log_makeGTF & [1] 21091

Translated with www.DeepL.com/Translator (free version)

ONAgaganb commented 8 months ago

Bombus_terrestris.fa.mod.EDTA.TEanno.gff3.gz log_makeGTF.gz

olivertam commented 8 months ago

Hi,

Thank you for your interest in the software.

Unfortunately, our perl script is not designed for raw GFF3, as the INFO field (column 9) is too highly variable to easily parse. You need to convert the GFF into a format where the TE name, class and family are in separate columns (so preprocess column 9).

$ sed 's/;/<tab>/g'  Bombus_terrestris.fa.mod.EDTA.TEanno.gff3 | \
     cut -d "<tab" -f 1-8,10,11 | \
     sed 's/Name=//;s/Classification=//;\//<tab>/;' | \
     awk 'BEGIN{FS="\t";OFS="\t"}; $1~/^#/; $1!~/^#/ && NF==11; $1!~/^#/ && NF<11{$11=$10;print}' \
     > preprocessed.txt

$ perl makeTEgtf.pl  -c 1 -s 4 -e 5 -o 7 -n EDTA -t 9 -f 11 -C 10 -S 6 -1 preprocessed.txt \
     > Bombus_terrestris.fa.mod.EDTA.TEanno.gtf

You will get multiple warnings that there are lines skipped. This is because those entries did not have any strand information (. instead of either + or -), and would confuse the software if it's trying to handle stranded RNAseq libraries.

I have attached the preprocessed text file and the GTF here.

Thanks

ONAgaganb commented 8 months ago

Thank you very much for your help! It works successfully now, thanks again!

github-actions[bot] commented 7 months ago

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days