merenlab / anvio

An analysis and visualization platform for 'omics data
http://merenlab.org/software/anvio
GNU General Public License v3.0
423 stars 144 forks source link

[BUG] anvi-meta-pan-genome does not allow me to use gene calls not from prodigal #2308

Closed Louis-MG closed 1 month ago

Louis-MG commented 1 month ago

Short description of the problem

anvi-meta-pan-genome yields config error 'no gene calls' because the gene caller is not prodigal but I cannot tell anvio to change the gene-caller.

anvi'o version

Anvi'o .......................................: marie (v8)
Python .......................................: 3.10.14

Profile database .............................: 38
Contigs database .............................: 21
Pan database .................................: 16
Genome data storage ..........................: 7
Auxiliary data storage .......................: 2
Structure database ...........................: 2
Metabolic modules database ...................: 4
tRNA-seq database ............................: 2

System info

Anvio is installed following the conda (mamba) installation instructions.

Detailed description of the issue

Hi ! I am reaching the end of my workflow (following https://dutter.github.io/projects/oral_metapan, particularily step 4) and this is bugging me. I have fungi genomes, thus using augustus as a gene caller. I have obtained profile DB's and PAN DB's per environment. My command for site in dry sebaceous moist; do anvi-meta-pan-genome -i "$site"-internal-genomes-table.txt -g fungi-"$site"-GENOMES.db -p fungi-"$site"-PAN/*PAN.db; done yields the error message:

Config Error: None of your genomes seem to have a gene call, which is a typical error you get
              if you are working with contigs databases with external gene calls. You can    
              solve it by looking at the output of the program `anvi-db-info` for a given    
              contigs database in your collection, and use one of the gene caller sources    
              listed in the output using the `--gene-caller` parameter. 

I checked using anvi-db-info fungi_contigs-db.db :

DB Info (no touch)
===============================================
Database Path ................................: fungi_contigs-db.db
description ..................................: [Not found, but it's OK]
db_type ......................................: contigs (variant: unknown)
version ......................................: 21

DB Info (no touch also)
===============================================
project_name .................................: Fungi
contigs_db_hash ..............................: hash0fa2cb47
split_length .................................: 9223372036854775807
kmer_size ....................................: 4
num_contigs ..................................: 72
total_length .................................: 16324625
num_splits ...................................: 72
gene_level_taxonomy_source ...................: None
genes_are_called .............................: 1
external_gene_calls ..........................: 1
external_gene_amino_acid_seqs ................: 1
skip_predict_frame ...........................: 0
splits_consider_gene_calls ...................: 1
scg_taxonomy_was_run .........................: 0
scg_taxonomy_database_version ................: None
trna_taxonomy_was_run ........................: 0
trna_taxonomy_database_version ...............: None
creation_date ................................: 1720035642.02313
gene_function_sources ........................: augustus

* Please remember that it is never a good idea to change these values. But in some
  cases it may be absolutely necessary to update something here, and a
  programmer may ask you to run this program and do it. But even then, you
  should be extremely careful.

AVAILABLE GENE CALLERS
===============================================
* 'augustus' (6,317 gene calls)

AVAILABLE FUNCTIONAL ANNOTATION SOURCES
===============================================
* augustus (6,317 annotations)

AVAILABLE HMM SOURCES
===============================================
* This contigs db does not have HMMs :/

So there are gene calls. I checked the options for anvi-meta-pan-genome and anvi-pan-genome but none of them are --gene-caller as the config error message would suggest. Still tried, failed.

Files / commands to reproduce the issue

https://drive.google.com/file/d/1gNJVQkwQ8jX7C5Liso5ospzNhgFMg28C/view?usp=sharing contains contig, genome, pan DBs for one environment.

ivagljiva commented 1 month ago

@Louis-MG , you should specify the gene caller when you create the genomes storage database that you provide as input to this program. anvi-gen-genomes-storage accepts the --gene-caller flag.

Louis-MG commented 1 month ago

Which I did, using anvi-gen-genomes-storage -i moist-internal-genomes-table.txt -o fungi-moist-GENOMES.db --gene-caller augustus. Here is the DB info:

DB Info (no touch)
===============================================
Database Path ................................: fungi-moist-GENOMES.db
db_type ......................................: genomestorage (variant: None)
version ......................................: 7

DB Info (no touch also)
===============================================
creation_date ................................: 1721225060.69279
functions_are_available ......................: 1
gene_function_sources ........................: augustus
hash .........................................: hash82f57bcd

* Please remember that it is never a good idea to change these values. But in some
  cases it may be absolutely necessary to update something here, and a
  programmer may ask you to run this program and do it. But even then, you
  should be extremely careful.

GENOMES IN STORAGE (n=2)
===============================================
Malassezia_globosa, Malassezia_restricta
Louis-MG commented 1 month ago

Juste re-ran the command to give you the output:

WARNING
===============================================
Good news! Anvi'o found all these functions that are common to all of your
genomes and will use them for downstream analyses and is very proud of you:
'augustus'.

Internal genomes .............................: 2 have been initialized.                                                                                                                                                                                                                                                                              
External genomes .............................: 0 found.                                                                                                                                                                                                                                                                                              

WARNING
===============================================
The contigs databases you are using for this analysis are missing HMMs for
single-copy core genes. Maybe you haven't run `anvi-run-hmms` on your contigs
database, or they didn't contain any hits. It is perfectly legal to have anvi'o
contigs databases without HMMs or SCGs for things to work, but we wanted to give
you heads up so you can have your 'aha' moment if you see funny things in the
interface.

* Malassezia_globosa is stored with 3,105 genes (9 of which were partial)
* Malassezia_restricta is stored with 3,210 genes (0 of which were partial)                                                                                                                                                                                                                                                                           

The new genomes storage ......................: fungi-moist-GENOMES.db (v7, signature: hashbffe03ca)
Number of genomes ............................: 2 (internal: 2, external: 0)
Number of gene calls .........................: 6,315
Number of partial gene calls .................: 9
ivagljiva commented 1 month ago

Thanks for the additional context. In that case, it is certainly a bug, and it should be fixable by simply adding the --gene-caller flag to those programs.

The gene caller sanity check is being run by the GenomeDescriptions class (see it in the codebase here) which is called from the MetaPangenome class here. The GenomeDescriptions class already accepts the gene-caller parameter, so the fix should simply be a matter of including that parameter in anvi-meta-pan-genome (probably also anvi-pan-genome but I haven't checked that one yet). The GenomeDescriptions instance is initialized with the same arguments object as the MetaPangenome object so no need to do anything extra to pass that parameter along.

ivagljiva commented 1 month ago

As far as I can tell, we shouldn't have to change anvi-pan-genome at all, because that program simply uses whichever genes are present in the genomes storage database to create the pangenome (and anvi-gen-genomes-storage accepts the --gene-caller flag). Which makes sense, since you didn't get an error at that step complaining about having no genes to make a pangenome out of.

So, in theory, the only thing to be done is to allow anvi-meta-pan-genome to accept that flag so you can get past the sanity check. Which I have already implemented, but working on testing it, just to make sure that using a different gene caller doesn't break something else downstream.

Louis-MG commented 1 month ago

Excellent news ! I'm curious as to why the gene caller used is not the one present when there is only one. Si there something under the hood that does something by default for prodigal but has to change for other gene callers ?

ivagljiva commented 1 month ago

No, nothing is done differently for prodigal. Gene calls are stored in the same fashion in the databases regardless of source. And while we could have implemented something smart like that, to select the only gene calls available regardless of their source, we didn't do that 😂

Instead, we implemented GenomeDescriptions so that it could accept the gene caller flag if provided, and if not provided, the gene caller is set to the default (which is prodigal at the moment). Since the MetaPangenome class doesn't accept the gene caller flag (yet), when it initializes GenomeDescriptions that class doesn't get that argument, so it defaults to prodigal. And in your case, the sanity check then kicks in and puts a stop to things.

ivagljiva commented 1 month ago

So, I added the --gene-caller flag to anvi-meta-pan-genome with commit b62613321d870d5261592c5ae3f1e381fefb2fa2, and it seems to have fixed the issue without requiring other changes.

This is how I tested it. I downloaded the Prochlorococcus isolates used in our metapangenome study, migrated everything to the lastest db versions, and slightly changed the gene call source:

tar -xvf ANVIOPROFILESFORPROCHLOROCOCCUSISOLATESACROSSTARA.tar.gz
cd ANVIO-PROFILES-FOR-PROCHLOROCOCCUS-ISOLATES-ACROSS-TARA/
anvi-migrate --migrate-safely *.db
cp CONTIGS.db CONTIGS_NEW_GENE_CALLER.db

# change the name of the gene caller
sqlite3 CONTIGS_NEW_GENE_CALLER.db "update genes_in_contigs set source = replace( source, 'prodigal', 'pyrodigal')"

Then I imported the collection of isolate genomes, and replicated the Prochlorococcus pangenome, but using the non-default --gene-caller option:

anvi-import-collection anvio-collection-for-pro-isolates.txt -c CONTIGS_NEW_GENE_CALLER.db -p PROFILE.db -C Genomes
anvi-script-gen-genomes-file -c CONTIGS_NEW_GENE_CALLER.db -p PROFILE.db -C Genomes -o internal-genomes.txt
anvi-gen-genomes-storage -i internal-genomes.txt                          -o Prochlorococcus-PAN-GENOMES.db --gene-caller pyrodigal
anvi-pan-genome -g Prochlorococcus-PAN-GENOMES.db \
                --minbit 0.5 \
                --mcl-inflation 10 \
                --project-name Prochlorococcus-PAN \
                --num-threads 15

Finally I ran anvi-meta-pan-genome with the new flag:

anvi-meta-pan-genome -g Prochlorococcus-PAN-GENOMES.db \
           -p  Prochlorococcus-PAN/Prochlorococcus-PAN-PAN.db \
           -i internal-genomes.txt \
           --gene-caller pyrodigal

It finished successfully :) And I can display the metapangenome and see all the layers (though it looks very ugly and busy without any polishing):

image

@Louis-MG , if you could switch to the development version of anvi'o, test it out on your data (you should only have to run the anvi-meta-pan-genome command with the new flag), and let us know how it goes, we would appreciate it :)

Louis-MG commented 1 month ago

I'm going to try it this afternoon, i'll keep you updated.

Louis-MG commented 1 month ago

I installed anvio dev. Updating the BDs:

$ anvi-migrate --migrate-safely fungi-moist-GENOMES.db

NEW MIGRATION TASK
===============================================
Input file path ..............................: fungi-moist-GENOMES.db
Input file type ..............................: genomestorage

* Your genomestorage looks good and needs no updatin' ☘️
anvi-migrate --migrate-safely fungi-moist-GENOMES.db

NEW MIGRATION TASK
===============================================
Input file path ..............................: fungi-moist-GENOMES.db
Input file type ..............................: genomestorage

* Your genomestorage looks good and needs no updatin' ☘️
anvi-migrate --migrate-safely fungi-moist-GENOMES.db

NEW MIGRATION TASK
===============================================
Input file path ..............................: fungi-moist-GENOMES.db
Input file type ..............................: genomestorage

* Your genomestorage looks good and needs no updatin' ☘️

$ anvi-migrate --migrate-safely fungi-moist-PAN/fungi-moist-PAN.db
NEW MIGRATION TASK
===============================================
Input file path ..............................: fungi-moist-PAN/fungi-moist-PAN.db
Input file type ..............................: pan
Current Version ..............................: 16
Target Version ...............................: 17
Migration mode ...............................: Safe

SQLite Version ...............................: 3.46.0

* Congratulations! Your pan database is now version 17, which means it now
  contains two new empty tables. These tables are not as boring as they first
  appear, because you can now run `anvi-reaction-network` to store a network of
  the metabolic reactions that may be encoded by gene clusters. A metabolic
  model representing the network can be exported from the database using `anvi-
  get-metabolic-model-file`.

$ anvi-migrate --migrate-safely /mnt/projects_tn03/metapangenome/DATA/results/pangenomeDB/fungi_pangenome/bowtie/moist/moist-MERGED/PROFILE.db

NEW MIGRATION TASK
===============================================
Input file path ..............................: /mnt/projects_tn03/metapangenome/DATA/results/pangenomeDB/fungi_pangenome/bowtie/moist/moist-MERGED/PROFILE.db
Input file type ..............................: profile
Current Version ..............................: 38
Target Version ...............................: 40
Migration mode ...............................: Safe

SQLite Version ...............................: 3.46.0

* The profile database is now 39. We recently added a parameter for ancient DNA-
  friendly profiling of SNVs, SCVs, and SAAVs. If you are interested to learn
  more, see https://github.com/merenlab/anvio/pull/2081. Version bump was
  necessary to make sure all anvi'o profile-dbs have a means to track whether an
  additional parameter was used by the user to skip the edges of short reads
  during profiling.

* Congratulations! Your profile database is now version 40, which means it now
  contains two new empty tables. Now you can run `anvi-import-protein-profile`
  and `anvi-import-metabolite-profile` to store protein and metabolite
  abundances across samples.

$ anvi-migrate --migrate-safely /mnt/projects_tn03/metapangenome/DATA/results/pangenomeDB/fungi_pangenome/fungi_contigs-db.db

NEW MIGRATION TASK
===============================================
Input file path ..............................: /mnt/projects_tn03/metapangenome/DATA/results/pangenomeDB/fungi_pangenome/fungi_contigs-db.db
Input file type ..............................: contigs
Current Version ..............................: 21
Target Version ...............................: 23
Migration mode ...............................: Safe

SQLite Version ...............................: 3.46.0

* Congratulations! Your contigs database is now version 22. This update fixes a
  bug that prevented construction and storage of a metabolic reaction network in
  version 21 databases that were created from scratch and not migrated from
  version 20. You can blame the author of this script for the oversight
  responsible for the bug and this annoying migration. If the version 21
  database was created from scratch, then placeholders for three 'metavariables'
  should now have been added to the database; if the version 21 database
  migrated from version 20, then nothing at all should have changed in the
  database except the version number.

* Your contigs database is now version 23. Sadly this update removed all SCG
  taxonomy data in this contigs-db due to a change in the set of SCGs anvi'o now
  uses for taxonomy estimation. As a result, you will need to re-run anvi-run-
  scg-taxonomy command on this contigs-db :/ If you would like to learn why this
  was necessary, please visit https://github.com/merenlab/anvio/issues/2211. We
  thank you for your patience!

$ anvi-meta-pan-genome -i moist-internal-genomes-table.txt  -g fungi-moist-GENOMES.db -p fungi-moist-PAN/*PAN.db --gene-caller augustus

That last command yields a bunch of text that concludes with Config Error: The requested genome storage hash ('hash82f57bcd') does not match with the one read from the database ('hash5399655c'). I guess redoing the genome DB to check the gene caller broke the hash correspondance, So let's build it again :

anvi-gen-genomes-storage -i moist-internal-genomes-table.txt -o fungi-moist-GENOMES.db --gene-caller augustus

Now the anvi-meta-pan-genome is currently doing the work with --gene-caller augustus ! Thanks a lot.