Personal GTF file error

AlBaarS commented 3 years ago

Hello

I am using a custom GTF file for a fungus, but whatever I do, I cannot get TEtranscripts to accept the file.

I have read issues #21 and #6, and I followed the instructions described in issue #21, adding the class_id and family_id with a python script, as the class_id and family_id fields were not present at first. However, even when adding those, I still get the error.

Here is the first line of my (final) GTF file: HiC_scaffold_1 EDTA transposon 3172 3272 . . . gene_id "LTR/Copia"; ID "851"; Name "98"; thickStart "LTR/Copia"; family_id "LTR/Copia"; class_id "LTR/Copia";

I have had to do several conversion steps, as the output I had was a BED file, which I converted (with AGAT) to GFF3 and then to GTF, and finally adding the class_id and family_id.

Here is the same line of my (original) BED file: HiC_scaffold_1 3171 3272 98 C RLX-incomp-chim_C7_LTR LTR/Copia

And the converted GFF3 file: HiC_scaffold_1 EDTA transposon 3172 3272 . TE_00000160_LTR . ID=851;Name=98;thickStart=LTR/Copia

With the number of conversion/modification steps (3 in total), I don't see any information disappearing, however, AGAT inflates some info. Does this cause the error? Or is something else wrong? Do you have any tips for fixing it?

Thanks in advance :)

Alejandro

EDIT: While it looks like spaces, it is tab separated (triple-checked)!

olivertam commented 3 years ago

Hi Alejandro,

Thank you for your interest in the software. I noticed that the third column is transposon. It should be exon, as this is the feature that would be recognized by the software to use as annotation. It is also missing the transcript_id field. It is also interesting that there is no standard strand information (e.g. + or -) for your TE, though I suspect that your BED file column 5 (with the C) probably refers to Watson/Crick strand.

In your case, I would recommend starting from your BED file, and generate your TE GTF with the following information:

GTF column 1 = BED column 1 (chrom)
GTF column 2 = source of your TE annotation/prediction (can be same on every line)
GTF column 3 = exon
GTF column 4 = BED column 2 (start)
GTF column 5 = BED column 3 (end)
GTF column 6 = your column 4? (score)
GTF column 7 = - if BED column 5 = C, and + otherwise (I can't tell if they use W or a dot (.)
GTF column 8 = . (frame)
GTF column 9
- gene_id = BED column 6 (e.g. RLX-incomp-chim_C7_LTR)
- transcript_id = a unique ID (we typically add _copyX to the gene_id value, where X could be the line number of the file
- family_id = BED column 7 (e.g. LTR/Copia). Ideally, we would prefer to use Copia for family_id, but unless you know that there are always two values in this column, separated by a /, then it might be safer to keep this combined.
- class_id = BED column 7. Again, we would prefer to use LTR for class_id, but safer to keep it combined.

That should be sufficient to make it work.

You can also try using our TE GTF generating script (unzip to use):

Change values in BED column 5 to + or -

Run the TE GTF script using the following parameters:

$ perl makeTEgtf.pl -c 1 -s 2 -e 3 -o 5 -t 6 -f 7 -C 7 [your BED file] > [TE GTF]

Please let us know if you encounter any more issues. Thanks.

AlBaarS commented 3 years ago

Hi Oliver,

Thank you for the response. Using your perl script to create my GTF helped, TEtranscripts is now running fine by the looks of it. Thank you for sharing the script. I would like to suggest adding it as is to the toolkit, so that others who encounter the same issue in the future can immediately solve it :)

olivertam commented 3 years ago

Hi Alejandro,

Thank you for your suggestion. The script is technically in beta (and in another language), so we decided not to include it with the toolkit for now. We prefer to recommend it on a case-by-case basis, as it is heavily dependent on what is used as the input.

Thanks.

mhammell-laboratory / TEtranscripts

Personal GTF file error #90