lh3 / miniprot

Align proteins to genomes with splicing and frameshift
https://lh3.github.io/miniprot/
MIT License
310 stars 16 forks source link

GFF3 output is malformed for stop_codon features #55

Closed carsonhh closed 4 months ago

carsonhh commented 5 months ago

Miniprot output when set to --gff produces CDS and stop_codon features; however, the stop_codon is being produced as being excluded from the CDS based on its coordinates rather than overlapping the CDS. The GFF3 spec and Sequence Ontology specify that the stop codon is part of the CDS, and when added a stop_codon feature would be expected to overlap the CDS coordinates http://sequenceontology.org/browser/release_2.5.3/term/SO:0000316

This is the current output: Chr01 miniprot mRNA 132416083 132416229 216 + . ID=MP000276 Chr07 miniprot CDS 132416083 132416226 216 + 0 Parent=MP000276 Chr07 miniprot stop_codon 132416227 132416229 0 + 0 Parent=MP000276

The correct output would be as follows: Chr01 miniprot mRNA 132416083 132416229 216 + . ID=MP000276 Chr07 miniprot CDS 132416083 132416229 216 + 0 Parent=MP000276 Chr07 miniprot stop_codon 132416227 132416229 0 + 0 Parent=MP000276

Note the stop codon position is now part of the CDS, and any separate stop_codon feature line will overlap the CDS for the three base pairs comprising the codon. Also note that GTF and GFF3 differ in this respect. In GTF the stop codon is excluded from the CDS, but in GFF3 it is included. This is because GFF3 specifically follows the rules of the Sequence Ontology for feature organization.

lh3 commented 4 months ago

Thanks. Fixed on github HEAD.