nigyta / dfast_core

DDBJ Fast Annotation and Submission Tool
77 stars 14 forks source link

inflated partial gene detection in unknown genomes #51

Open xvazquezc opened 2 months ago

xvazquezc commented 2 months ago

Hi there, I've been running some annotation tests on DFAST for a collection of MAGs and I noticed that in some cases, the a huge number of partial pseudogenes being detected, sometimes close to 20% of all called CDS! Most of the MAG collection don't belong to any known species or genus (even within the GTDB)... so I tested one MAG with a known genus adding just two known relatives and the number of partial genes halved!! (589 vs 242 partial genes). Despite being MAGs, the portion of partial genes at the ends of the contigs is relatively small (16 in this specific genome).

Is there a way to separate the detection of pseudogenes due to frameshifts/internal stop codons from the partial genes? I ask because the pseudogene detection appears as a single process in the config file. I think that detecting translation exceptions to selenocystein/pyrrolysine or frameshifts might still be globally useful, but the partial gene detection seems a slippery slope when applied to MAGs/genomes of poorly characterised lineages.

Cheers,

nigyta commented 2 months ago

Currently, the logic for pseudogene annotation and detection of translation exceptions are closely related with each other. So they are not separatable.