ncbi / pgap

NCBI Prokaryotic Genome Annotation Pipeline
Other
310 stars 90 forks source link

Running pgap on a novel genus #164

Closed tgurbich closed 3 years ago

tgurbich commented 3 years ago

Hello! I was wondering if there is a way to use pgap to annotate a novel genus that is not present in the NCBI taxonomy? I understand that one way would be to first have the taxon registered but wanted to see whether pgap can be used without this step, perhaps using a higher taxonomic rank instead?

azat-badretdin commented 3 years ago

perhaps using a higher taxonomic rank instead?

Feel free to try it. We use taxonomy in couple of places: QC of input genome (too small/too large), determination of genomic code (11 or 4) which is important for protein products, etc.

thibaudnis commented 3 years ago

You can pick a genus that you think may be related and run pgap.py with the --taxcheck-only flag first, to run the Average Nucleotide Identity (ANI) tool. ANI will hopefully be able to make a real taxonomic assignment for the assembly, based on the type strain assemblies in GenBank (see more details in https://github.com/ncbi/pgap/wiki/Taxonomy-Check .

tgurbich commented 3 years ago

Thank you very much for the suggestions. I am trying the --taxcheck-only flag and will test how well this works for these cases. Do you have any advice on how to fill out the yaml file if real taxonomic assignment is not possible? Would the best option be to proceed with the closest known genus?

thibaudnis commented 3 years ago

Yes, the closest genus would be your best option. The taxonomic assignment is needed for choosing two things: the genetic code, and the set of proteins to align. Specifying the genus is sufficient for both.

tgurbich commented 3 years ago

Ok, I will try that. Thank you very much for your help!

hdore commented 3 years ago

Hello,

I have an additional question regarding this issue. I understand that pgap needs a genus assignment, and that it is needed for choosing the genetic code, and the set of proteins to align. In my case, I have a circular MAG (this is a polished assembly of long reads). I have tried to give it a taxonomic assignment using gtdbtk, which could only assign it to the class level using pplacer (there was no reference close enough in their database to compute ANI and aligned fraction): dBacteria;pCalditrichota;cCalditrichia;o;f;g;s__

I could choose randomly a genus within this whole class, but since this would change the set of proteins to align, I'm worried that it could change the output annotation.

My question is: how would you proceed at NCBI to annotate such a genome if it was submitted? What would be the set of proteins chosen?

Thank you,

Hugo Doré

azat-badretdin commented 3 years ago

Hi, Hugo.

The most important parameter is genetic code, superkingdom and order. If you can match with your arbitrary genus assignment the desired set, the job is done.

hdore commented 3 years ago

Hello,

Thank you for your quick reply. Would it not make sense to lower the constraint of providing a genus then?

I also tried to use taxcheck-only on some other genomes, and it gave me unexpectedly low (Query coverage, Subject coverage) values (around 1% or below 1%), where I would expect much higher values. But maybe this belongs to another ticket. In addition, even if the status of taxcheck is "inconclusive", it outputs a "Predicted organism" and seems to choose the one with highest ANI even if the "coverage" value is very low, which I find surprising.

azat-badretdin commented 3 years ago

Would it not make sense to lower the constraint of providing a genus then?

For taxonomic checks it's important.

But maybe this belongs to another ticket.

That would help us to keep Issues more organized, yes. Thanks!

thibaudnis commented 3 years ago

The genetic code varies by genus. Different genera in the same family may have different genetic codes. That said, the most common genetic code is 11. If you selected a genus in the order you have pinpointed (can you do that?), chances are high that the annotation would turn out fine.

Do you obtained these ANI results with MAGs or with fully characterized genomes?

hdore commented 3 years ago

Thank you for the additional information/comment on genetic code. I can select a genus in the same class as my genome's. For cases where gtdb-tk indicates NCBI's assembly number of the closest reference, I can use that. But for cases as the one above, where gtdbtk does not give the closest reference, this will be a quite random genus since I'm trying to automate the process for a number of MAGs.

I'll open a new ticket for the issues regarding taxcheck (these are MAGs, but very closely related to some of NCBI genomes).

silvtal commented 2 years ago

Hi, I'm not sure if this is the best place to ask about this but I felt that it didn't need its own thread

So, most tools use only 2 or 3 genetic codes - GMS2 uses 3, RAST uses 2... But according to for example this page, there are many more. For example, the Candidate Division one includes a difference in stop codons.

In this thread you talk about only two codes too, 11 and 4. So I was wondering - what happens if I want to annotate a Gracillibacteria genome? Which steps are affected by the genetic code choice, and how?

azat-badretdin commented 2 years ago

The best way to answer your question about Candidatus Gracilibacteria phylum is to try one of the taxons under it.

As for the steps. The version of tRNAscan that we use does not recognize 25 right now and it will default to one of standard codes (1 or 11).

silvtal commented 2 years ago

I see, I'll keep it in mind then. Thank you for answering!