jorvis / biocode

Bioinformatics code libraries and scripts
MIT License
499 stars 248 forks source link

Formatting Issue? #72

Open remiketchum opened 3 years ago

remiketchum commented 3 years ago

Hi,

I am currently trying to convert a gtf or gff3 file output from Augustus to a format that will be read by PASA. However, when I try to run this command:

convert_augustus_to_gff3.py -i augustus.hints.gtf -o test.gff3

I consistently get an error code like this one:

File "/users/rketchu1/.local/bin/convert_augustus_to_gff3.py", line 133, in main raise Exception("ERROR: GTF detected but gene row has bad 9th column format: {0}".format(cols[8])) Exception: ERROR: GTF detected but gene row has bad 9th column format: jg33035

I have rerun Augustus with the -gff3 flag and tried to run the same command pasted above on the gff3 file but the following error:

Traceback (most recent call last): File "/users/rketchu1/.local/bin/convert_augustus_to_gff3.py", line 189, in main() File "/users/rketchu1/.local/bin/convert_augustus_to_gff3.py", line 173, in main raise Exception("ERROR: Found CDS column with parent ({0}) mRNA not yet in the file".format(parent_id)) Exception: ERROR: Found CDS column with parent (jg33035.t1) mRNA not yet in the file

I'm not entirely sure what is wrong with the file formatting. I am running Augustus 3.4.0 through Braker 2.1.5.

jorvis commented 3 years ago

Hmm, in the documentation of the convert_augustus_to_gff3.py script it has an example gene block from Augustus. Can you see how yours compares with that structure? Maybe it has been updated. Regarding the 2nd error, after you've used the -gff3 flag, it seems the features are out of order with some being referenced before they are actually defined. This may be another augustus issue but possibly corrected with this script:

biocode/gff/correct_gff_feature_order.pl

Could you possibly copy/paste a block representing an entire gene's rows here into the ticket?

remiketchum commented 3 years ago

The first GTF that I get an error on (I put the - in front of the # just to copy paste below):

-# overlap start -------------------------------------------------------------------------------- -# this overlap has 1 different transcripts -# This transcript jg33035.t1 is derived from g24141.t1 from the input file /scratch/rketchu1/Dovetail_Genome_EM/ANNOTATION/BRAKER3/augustus.Ppri5.gtf -# It is supported by 0 other predicted genes -# the core of this joined transcript has priority 2 spez_2 AUGUSTUS gene 44188381 44210443 . + . jg33035 spez_2 AUGUSTUS transcript 44188381 44210443 . + . transcript_id "jg33035.t1"; gene_id "jg33035" spez_2 AUGUSTUS start_codon 44188381 44188383 . + 0 transcript_id "jg33035.t1"; gene_id "jg33035"; spez_2 AUGUSTUS CDS 44188381 44188722 1 + 0 transcript_id "jg33035.t1"; gene_id "jg33035"; spez_2 AUGUSTUS exon 44188381 44188722 . + . transcript_id "jg33035.t1"; gene_id "jg33035";

The gff3 file looks like this:

spez_2 AUGUSTUS gene 44188381 44210443 . + . ID=jg33035; spez_2 AUGUSTUS mRNA 44188381 44210443 . + . ID=jg33035.t1;Parent=jg33035; spez_2 AUGUSTUS start_codon 44188381 44188383 . + 0 ID=jg33035.t1.start1;Parent=jg33035.t1; spez_2 AUGUSTUS CDS 44188381 44188722 1 + 0 ID=jg33035.t1.CDS1;Parent=jg33035.t1; spez_2 AUGUSTUS exon 44188381 44188722 . + . ID=jg33035.t1.exon1;Parent=jg33035.t1; spez_2 AUGUSTUS intron 44188723 44189267 . + . ID=jg33035.t1.intron1;Parent=jg33035.t1; spez_2 AUGUSTUS CDS 44189268 44189333 1 + 0 ID=jg33035.t1.CDS2;Parent=jg33035.t1; spez_2 AUGUSTUS exon 44189268 44189333 . + . ID=jg33035.t1.exon2;Parent=jg33035.t1; spez_2 AUGUSTUS intron 44189334 44189619 . + . ID=jg33035.t1.intron2;Parent=jg33035.t1; spez_2 AUGUSTUS CDS 44189620 44189671 1 + 0 ID=jg33035.t1.CDS3;Parent=jg33035.t1;

I definitely see some differences but not sure how to resolve the issue.

I am trying to locate correct_gff_feature_order.pl, it looks like my install of biocode is missing this script.