ncbi / pgap

NCBI Prokaryotic Genome Annotation Pipeline
Other
301 stars 89 forks source link

[BUG] dnaA truncated by PGAP gene calling #252

Closed tuspjo closed 1 year ago

tuspjo commented 1 year ago

Describe the bug Depositing some bacterial genomes, I noticed that 9 of them come back with a /pseudo tag on the dnaA gene after PGAP annotation. They all have this field /note="incomplete; partial on complete genome; missing N-terminus; Derived by automated computational analysis using gene prediction method: Protein Homology." However these genes were not truncated in a prodigal annotation of the genes, and the closest database reference determined by autoMLST also has the full dnaA sequence (see AA alignment below). image

For most of the strains with the /pseudo tag, Streptomyces niveus strains have the highest %ANI from autoMLST (Streptomyces_niveus_GCF_002009175, Streptomyces_niveus_NCIMB_11891_GCF_000497425, )

The submitted genomes are not yet publicly available but I can supply you with the genbank files if necessary.

Since this is annotation performed at NCBI, I don't have the log files and software versions requested in the bug report form. The issue in 9 genomes is not consistent as many of the Streptomyces genomes I deposited do not have the /pseudo tag.

Best

Tue Sparholt Jørgensen

azat-badretdin commented 1 year ago

Thank you, Tue, for your report. We will investigate this issue in internal ticket.

tuspjo commented 1 year ago

great, please let me know if I can be of any help. I'm thinking this could be relevant information: Several more genomes have a similarly suspecious gene calling, without the /pseudo tag: image

azat-badretdin commented 1 year ago

I'm thinking this could be relevant information

Agreed. Thanks!

azat-badretdin commented 1 year ago

Could you please post some of the input genomes?

tuspjo commented 1 year ago

I can 't post them here unfortunately, as they are "embargoed" but I can send a safe download link to your email address? Do you want only the ones with /pseudo or also some of the ones with the same gene calling but not /pseudo on dnaA?

azat-badretdin commented 1 year ago

as they are "embargoed" but I can send a safe download link to your email address?

Sure.

Do you want only the ones with /pseudo or also some of the ones with the same gene calling but not /pseudo on dnaA?

The more examples the better.

Thanks!

tuspjo commented 1 year ago

Great, I'll collect and send you a link to the genomes tomorrow wednesday. Best, Tue

tir. 4. apr. 2023 kl. 18.42 skrev Azat Badretdin @.***>:

as they are "embargoed" but I can send a safe download link to your email address?

Sure.

Do you want only the ones with /pseudo or also some of the ones with the same gene calling but not /pseudo on dnaA?

The more examples the better.

Thanks!

— Reply to this email directly, view it on GitHub https://github.com/ncbi/pgap/issues/252#issuecomment-1496286607, or unsubscribe https://github.com/notifications/unsubscribe-auth/ANDPVIJZFJWUUZA3GXDLVI3W7RFPTANCNFSM6AAAAAAWSRDGOM . You are receiving this because you authored the thread.Message ID: @.***>

azat-badretdin commented 1 year ago

Looking forward to it, Tue!

azat-badretdin commented 1 year ago

I got the genomes, Tue, thanks!

azat-badretdin commented 1 year ago

Unfortunately, they post output data, not the input data. We need input FASTA files.

tuspjo commented 1 year ago

Dear Azat, I've sent the input genomes in fasta format, did you receive them? Best, Tue

azat-badretdin commented 1 year ago

Thanks, Tue! Not yet. So far I got only the original package from 4/5. The data goes through a different group, they will notify us when it comes.

tuspjo commented 1 year ago

Ok. I sent a link to the fasta files on Tuesday so hopefully they will make their way to you soon.

azat-badretdin commented 1 year ago

Tue, judging by the output you sent us in the first tarball, it looks like you did not use standalone PGAP for these annotation, but GenBank submission service. Could you please confirm?

tuspjo commented 1 year ago

yes, that is correct, the annotation was performed at NCBI, not by the standalone PGAP CLI.

azat-badretdin commented 1 year ago

Thank you for confirming, Tue. That explains the confusion.

tuspjo commented 1 year ago

Hi again,

The genomes in PGAP input format (fasta w info in header) I sent weren't received, do you want me to reupload them, and how do I get the download link to you if the genomes@ncbi.nlm.nih.gov is not a good channel? Best,

Tue

azat-badretdin commented 1 year ago

Since they were submitted via Genbank, we have the input data already. Thanks!

tuspjo commented 9 months ago

A quick followup on this, in case anyone stumbles on this bug report: the dnaA genecalling was modified/improved, which resolved all the observed issues by identifying the complete dnaA genes rather than partial genes.

azat-badretdin commented 9 months ago

Thank you, Tue!