oschwengers / bakta

Rapid & standardized annotation of bacterial genomes, MAGs & plasmids
GNU General Public License v3.0
448 stars 55 forks source link

Revise truncated pseudo attributes #333

Closed oschwengers closed 1 month ago

oschwengers commented 1 month ago

Bakta and various standard output formats (Genbank, EMBL, GFF3) use slightly different terms and approaches how to declare truncated genes and pseudogenes.

In Bakta, a feature is declared as truncated if there is information from a downstream analysis tool, e.g. Pyrodigal, Infernal, etc.

Besides these, Bakta accepts true pseudogenes from tRNAscan-SE and from its own internal CDS workflow.

To strictly follow INSDC specs, for Genbank, EMBL and GFF3 output files (#330), Bakta now declares all truncated features as pseudo reflecting technical issues like sequencing and assembly errors on the one side, and true pseudogenes on the other side emerging from biological pseudogenization events like InDels and mutations.

Internally, Bakta uses truncated and pseudogene attributes to reflect the different states. In the human readable TSV output file (meant for a quick glimpse), Bakta adds feature product prefixes (pseudo), (truncated), (5' truncated) and (3' truncated)`.