Closed tgurbich closed 3 years ago
perhaps using a higher taxonomic rank instead?
Feel free to try it. We use taxonomy in couple of places: QC of input genome (too small/too large), determination of genomic code (11 or 4) which is important for protein products, etc.
You can pick a genus that you think may be related and run pgap.py
with the --taxcheck-only
flag first, to run the Average Nucleotide Identity (ANI) tool. ANI will hopefully be able to make a real taxonomic assignment for the assembly, based on the type strain assemblies in GenBank (see more details in https://github.com/ncbi/pgap/wiki/Taxonomy-Check .
Thank you very much for the suggestions. I am trying the --taxcheck-only
flag and will test how well this works for these cases. Do you have any advice on how to fill out the yaml file if real taxonomic assignment is not possible? Would the best option be to proceed with the closest known genus?
Yes, the closest genus would be your best option. The taxonomic assignment is needed for choosing two things: the genetic code, and the set of proteins to align. Specifying the genus is sufficient for both.
Ok, I will try that. Thank you very much for your help!
Hello,
I have an additional question regarding this issue. I understand that pgap needs a genus assignment, and that it is needed for choosing the genetic code, and the set of proteins to align. In my case, I have a circular MAG (this is a polished assembly of long reads). I have tried to give it a taxonomic assignment using gtdbtk, which could only assign it to the class level using pplacer (there was no reference close enough in their database to compute ANI and aligned fraction): dBacteria;pCalditrichota;cCalditrichia;o;f;g;s__
I could choose randomly a genus within this whole class, but since this would change the set of proteins to align, I'm worried that it could change the output annotation.
My question is: how would you proceed at NCBI to annotate such a genome if it was submitted? What would be the set of proteins chosen?
Thank you,
Hugo Doré
Hi, Hugo.
The most important parameter is genetic code, superkingdom and order. If you can match with your arbitrary genus assignment the desired set, the job is done.
Hello,
Thank you for your quick reply. Would it not make sense to lower the constraint of providing a genus then?
I also tried to use taxcheck-only on some other genomes, and it gave me unexpectedly low (Query coverage, Subject coverage) values (around 1% or below 1%), where I would expect much higher values. But maybe this belongs to another ticket. In addition, even if the status of taxcheck is "inconclusive", it outputs a "Predicted organism" and seems to choose the one with highest ANI even if the "coverage" value is very low, which I find surprising.
Would it not make sense to lower the constraint of providing a genus then?
For taxonomic checks it's important.
But maybe this belongs to another ticket.
That would help us to keep Issues more organized, yes. Thanks!
The genetic code varies by genus. Different genera in the same family may have different genetic codes. That said, the most common genetic code is 11. If you selected a genus in the order you have pinpointed (can you do that?), chances are high that the annotation would turn out fine.
Do you obtained these ANI results with MAGs or with fully characterized genomes?
Thank you for the additional information/comment on genetic code. I can select a genus in the same class as my genome's. For cases where gtdb-tk indicates NCBI's assembly number of the closest reference, I can use that. But for cases as the one above, where gtdbtk does not give the closest reference, this will be a quite random genus since I'm trying to automate the process for a number of MAGs.
I'll open a new ticket for the issues regarding taxcheck (these are MAGs, but very closely related to some of NCBI genomes).
Hi, I'm not sure if this is the best place to ask about this but I felt that it didn't need its own thread
So, most tools use only 2 or 3 genetic codes - GMS2 uses 3, RAST uses 2... But according to for example this page, there are many more. For example, the Candidate Division one includes a difference in stop codons.
In this thread you talk about only two codes too, 11 and 4. So I was wondering - what happens if I want to annotate a Gracillibacteria genome? Which steps are affected by the genetic code choice, and how?
The best way to answer your question about Candidatus Gracilibacteria phylum is to try one of the taxons under it.
As for the steps. The version of tRNAscan
that we use does not recognize 25 right now and it will default to one of standard codes (1 or 11).
I see, I'll keep it in mind then. Thank you for answering!
Hello! I was wondering if there is a way to use pgap to annotate a novel genus that is not present in the NCBI taxonomy? I understand that one way would be to first have the taxon registered but wanted to see whether pgap can be used without this step, perhaps using a higher taxonomic rank instead?