ncbi / pgap

NCBI Prokaryotic Genome Annotation Pipeline
Other
310 stars 90 forks source link

Annotation Discrepancy #116

Closed ChinPok23 closed 3 years ago

ChinPok23 commented 3 years ago

Hi, PGAP is an amazing tool! However, atleast for me sometimes its hard to understand the reasoning behind things. In MDS42: In an older annotation 2 neighboring CDS that were associated, were annotated seperatly, however now the feature is annotated as one with a frame shift in it. Why is the annotation changed? what is the reasoning behind it, can it be changed with a different input command while running PGAP. to compare old annotation: predicted 6-phospho-beta-glucosidase (pseudogene)/ predicted protein (pseudogene)

Old_Annotation

New Annotation: inactive 6-phospho-alpha-glucosidase

New_Annotation
thibaudnis commented 3 years ago

Thank you for the feedback. I am happy to hear that PGAP is helpful. Small changes from one version of the software to the next are expected. No automated annotation pipeline is perfect and we are constantly making incremental improvements to the software. Our subject matter experts also modify the data that is used to support the annotation process (proteins in the custom Blast databases and HMM models in particular), so it is not surprising that you see changes over time. For this reason, we recommend that you use the same version of PGAP for all the genomes you will be comparing to each other. If there is one version of PGAP that you particularly like, you can add --use-version <release>. For example, add --use-version 2020-07-09.build4716 to the pgap.py command to use release 2020-07-09.build4716. Our software and data are getting better over time, so we don't recommend using older versions unless you really need the stability. That said, I will inquire about the change in the 6-phospho-alpha-glucosidase that you found.

danielhhaft commented 3 years ago

The current GenBank annotation of the comparable region of the Escherichia coli str. K-12 substr. MG1655 genome

https://www.ncbi.nlm.nih.gov/nuccore/U00096.3?from=3861000&to=3862200&strand=2

shows that glvG and ysdC are treated as separate features, and AYC08251.1, equivalent to UniProtKB entry P31450 (GLVG_ECOLI), has a protein accession number suggesting treatment as a protein.

PGAP handles the situation differently. Reasoning by homology shows the whole region with glvG and ysdC aligns to intact proteins such as NP_312649.2 from Escherichia coli O157:H7 str. Sakai, leading PGAP to create a single pseudogene feature. We view the prior treatment of the region as having two genes (pseudogenes) as an artifact from looking at open reading frames as distinct features even though they represent two regions of what we consider a single pseudogene.

Over time, both changes in PGAPs algorithm and in the collection of proteins used as evidence by homology to assist structural annotation can lead to changes in program output. Small changes should be expected to occur regularly, and we expect the majority of changes to be in the direction of improved structural and functional annotation.

Please let us know if you have further questions.

ChinPok23 commented 3 years ago

Thank you very much for this indepth, explanation. It completely makes sense! Nothing to add really just, that PGAP is great, and that maybe a feature to easier understand changes in annotation from one PGAP version to the next could be included. Thank you very much.

Edit: actually yes I am interested in one more feature. yjbL shows in NCBI in many to stop after the first stop codon, but the new PGAP predicts a frameshift I would like to understand the concept behind that. I would like to know on what evidence in general frameshifts are predicted. Thanks very much