ncbi / pgap

NCBI Prokaryotic Genome Annotation Pipeline
Other
294 stars 89 forks source link

Running PGAP with Metagenomic Assembled Genome #275

Closed kevinmyers closed 4 months ago

kevinmyers commented 8 months ago

Describe the bug I have a set of MAGs that I would like to use PGAP to annotate. We have done this previously when submitting the MAGs to NCBI but would like to run this on the local command line. I have successfully run PGAP (version 2023-10-03.build7061) with the test genome. When I try to run one of my MAGs, PGAP fails.

To Reproduce I am able to successfully run PGAP with the test genome. I am attaching the FASTA file used.

The command used was:

python pgap.py -r -o 3300062455_11_results -g 3300062455_11_cleaned.fasta -s 'Eggerthellales'

I have also tried this with the following command and get the same error:

python pgap.py -r -o 3300062455_11_results2 -g 3300062455_11_cleaned.fasta -s 'Eggerthellales' --taxcheck --auto-correct-tax

Expected behavior I expected to have PGAP run. I am not sure if it is due to the species specification because I do not have classification down to the species level for many of these MAGs

Software versions (please complete the following information):

Log Files I am attaching the compressed version of the tmp-outdir directory. I'm also attaching the FASTA file used. tmp-outdir.tar.gz 3300062455_11_cleaned.fasta.txt

azat-badretdin commented 8 months ago

Thank you for your report, Kevin! We opened an internal investigation and we will proceed with this case ASAP

thibaudnis commented 8 months ago

Hi Kevin - Eggerthellales is a taxonomic order. In order to produce an annotation, PGAP needs to be provided with the binomial name for the species, or with a genus. Please try the command with --auto-correct-tax again, but this time providing -s a 'best guess' for the genus. If you still don't get results, please post the top 10 lines of the output file ani-tax-report.txt.

kevinmyers commented 8 months ago

Thanks for the suggestion. The closest match to the MAG I am using is an "uncultured bacterium" so I picked another genus within the family for the -s command. I used the following command:

python pgap.py -r -o 3300062455_11_results3 -g 3300062455_11_cleaned.fasta --taxcheck --auto-correct-tax -s "Eggerthella"

And got the following error on the command line:

TAXCHECK completed successfully. DEBUG: args.output = 3300062455_11_results3 DEBUG: params.outputdir = /3300062455_11_results3 ERROR: taxcheck failed to assign a species with high confidence, thus PGAP will not execute. See 3300062455_11_results3/ani-tax-report.txt

I am attaching the file ani-tax-report.txt and I am printing the first 10 lines of the ani-tax-report.txt here:

      1 ANI report for assembly: 3300062455_11_cleaned.fasta
      2 Submitted organism: Eggerthella (taxid = 84111, rank = genus, lineage = Bacteria; Actinomycetota; Coriobacteriia; Eggerthellales; Eggerthellaceae)
      3 Best match: (none)
      4 Submitted organism has type: No
      5 Status: INCONCLUSIVE
      6 Confidence: LOW
      7 Table legend:
      8 ANI  : ANI value between this assembly and the type listed in this row
      9 (Coverages) : query-coverage and subject-coverage of this assembly (query) and the type (subject)
     10 NewSeq : the count of bases best assigned to this type assembly

It appears it failed due to low ANI match. Is there a way to change the ANI threshold for MAGs?

ani-tax-report.txt

azat-badretdin commented 8 months ago

ANI thresholds are specified in our reference data package ANI_cutoff.xml. You are of course welcome to hack it supplying different values for your taxid, but we won't be able to provide support since we have not been testing our software for this sort of activity.

kevinmyers commented 8 months ago

Thanks. I don't really want to try to hack the .xml file.

But would that be the only way to run PGAP locally with these MAGs?

azat-badretdin commented 8 months ago

One of our internal conditions for annotations is taxonomic confidence in what is it that we are annotating. We need to know for example, if we need to use gcode 4 or 11, and taxonomy is the way to know that. We also need to pick taxonomically relevant reference data.

kevinmyers commented 8 months ago

Thanks for the information.

I know that when I submit the MAGs to NCBI (for publication) they are able to run PGAP even for MAGs for which we have little specific taxonomic classification in the genus or species level. I guess I am wondering how NCBI can run PGAP on these MAGs and if I run PGAP on these MAGs locally?

azat-badretdin commented 8 months ago

I am not sure. Any public examples of such assemblies? Have you tried to run them through PGAP yourself?

kevinmyers commented 8 months ago

Here are some examples of assemblies that we submitted to NCBI and were successfully annotated with PGAP.

https://www.ncbi.nlm.nih.gov/datasets/genome/GCA_022482765.1/ (order classification) https://www.ncbi.nlm.nih.gov/datasets/genome/GCA_022486885.1/ (family classification) https://www.ncbi.nlm.nih.gov/datasets/genome/GCA_022651085.1/ (order classification)

I have not tried to run them through PGAP locally.

azat-badretdin commented 8 months ago

Thank you, Kevin!

thibaudnis commented 8 months ago

Stand-alone PGAP is more conservative than PGAP used on submitted assemblies. It requires that there is a genus in the lineage of the organism that the user provides or that ANI assigns, which is not the case for GCA_022482765.1 mentioned in a comment above for example. When you submit an assembly to GenBank and request PGAP annotation, there is no such constraint. PGAP will use the genetic code in NCBI Taxonomy for the annotation. However, there is no guarantee that this is the correct genetic code to use. This is not an ideal situation. We would like to develop or use an existing method to determine with confidence the genetic code based exclusively on the genomic sequence. We have a few ideas on how that could be done, but I don't have a timeline for when we this work will be completed.