merenlab / anvio

An analysis and visualization platform for 'omics data
http://merenlab.org/software/anvio
GNU General Public License v3.0
413 stars 142 forks source link

[BUG] Anvi'o-7 anvi-compute-functional-enrichment do not have an option to choose gene caller #2257

Closed TomKLHui closed 2 months ago

TomKLHui commented 2 months ago

Short description of the problem

followed the error message upon running anvi-compute-functional-enrichment to include a gene caller name, but surprised to find that the script cannot specify gene caller name as it suggested.

anvi'o version

Anvi'o .......................................: hope (v7)

Profile database .............................: 35 Contigs database .............................: 20 Pan database .................................: 14 Genome data storage ..........................: 7 Auxiliary data storage .......................: 2 Structure database ...........................: 2 Metabolic modules database ...................: 2 tRNA-seq database ............................: 1

System info

Ubuntu 20.04.2 LTS (GNU/Linux 5.4.0-171-generic x86_64) conda installed anvio-7 enrvironment

Detailed description of the issue

anvi-compute-functional-enrichment -i TT_internal_genomes.txt -o TT_functional_enrichment.txt --annotation-source KOfam -F TT_FUNC_OCCURENCE.txt

CITATION
===============================================
This program will compute enrichment scores using an R script developed by Amy
Willis. You can find more information about it in the following paper: Shaiber
et al (https://doi.org/10.1186/s13059-020-02195-w). When you publish your
findings, please do not forget to properly credit this work. :)

WARNING
===============================================
Good news! Anvi'o found all these functions that are common to all of your
genomes and will use them for downstream analyses and is very proud of you:
'KEGG, COG, KOfam, COGPATH, PFAM, KEGGPATH, KEGG_Module, KEGG_Class'.

Internal genomes .............................: 78 have been initialized.                                               External genomes .............................: 0 found.                                                                

JUST FYI
===============================================
Some of your genomes had gene calls identified by gene callers other than the
anvi'o default, 'prodigal', and will not be processed. Use the `--debug` flag if
this sounds important and you would like to see more of this message.

Config Error: None of your genomes seem to have a gene call, which is a typical error you get
              if you are working with contigs databases with external gene calls. You can    
              solve it by looking at the output of the program `anvi-db-info` for a given    
              contigs database in your collection, and use one of the gene caller sources    
              listed in the output using the `--gene-caller` parameter.                      

(anvio-7) tom@poweredge:/home/tom/data/HSC$ anvi-compute-functional-enrichment -i TT_internal_genomes.txt -o TT_functional_enrichment.txt -F TT_FUNC_OCCURENCE.txt^C
(anvio-7) tom@poweredge:/home/tom/data/HSC$ anvi-db-info TT_anvio/CONTIGS.db 

DB Info (no touch)
===============================================
Database Path ................................: TT_anvio/CONTIGS.db
Description ..................................: [Not found, but it's OK]
Type .........................................: contigs
Version ......................................: 20

DB Info (no touch also)
===============================================
project_name .................................: TT
contigs_db_hash ..............................: hash119e72f2
split_length .................................: 20000
kmer_size ....................................: 4
num_contigs ..................................: 9174248
total_length .................................: 6836483035
num_splits ...................................: 9180253
gene_level_taxonomy_source ...................: None
genes_are_called .............................: 1
external_gene_calls ..........................: 1
external_gene_amino_acid_seqs ................: 0
skip_predict_frame ...........................: 0
splits_consider_gene_calls ...................: 1
trna_taxonomy_was_run ........................: 0
trna_taxonomy_database_version ...............: None
creation_date ................................: 1710938809.93128
scg_taxonomy_was_run .........................: 1
scg_taxonomy_database_version ................: v95
modules_db_hash ..............................: 45b7cc2e4fdc
gene_function_sources ........................: KEGG_Class,KOfam,KEGG,COG,PFAM,COGPATH,KEGGPATH,KEGG_Module

* Please remember that it is never a good idea to change these values. But in some
cases it may be absolutely necessary to update something here, and a programmer
may ask you to run this program and do it. But even then, you should be
extremely careful.

AVAILABLE GENE CALLERS
===============================================
* 'barrnap:0.9-dev' (6,126 gene calls)
* 'aragorn' (30,857 gene calls)
* 'Ribosomal_RNA_28S' (67 gene calls)
* 'Ribosomal_RNA_23S' (153 gene calls)
* 'Ribosomal_RNA_18S' (32 gene calls)
* 'Ribosomal_RNA_16S' (63 gene calls)
* 'Prodigal_v2.6.3' (9,132,127 gene calls)

AVAILABLE HMM SOURCES
===============================================
* 'Ribosomal_RNA_12S' (type: Ribosomal_RNA_12S; num genes: 1)
* 'Ribosomal_RNA_16S' (type: Ribosomal_RNA_16S; num genes: 3)
* 'Ribosomal_RNA_28S' (type: Ribosomal_RNA_28S; num genes: 1)
* 'Ribosomal_RNA_23S' (type: Ribosomal_RNA_23S; num genes: 2)
* 'Archaea_76' (type: singlecopy; num genes: 76)
* 'Ribosomal_RNA_18S' (type: Ribosomal_RNA_18S; num genes: 1)
* 'Protista_83' (type: singlecopy; num genes: 83)
* 'Ribosomal_RNA_5S' (type: Ribosomal_RNA_5S; num genes: 5)
* 'Bacteria_71' (type: singlecopy; num genes: 71)

but when i run the same command with an additional --gene-caller flag it just return the help message indicating unknown arguments.

I wonder if there is any way I could change the name of gene caller (to remove the _v2.6.3" after, or make the script recognize other gene callers? I know v7 is a very old version of anvio, but i imported the CONTIG.db and PROFILE.db from other pipelines with prepared annotations and prefer not to rerun the whole annotation/mapping process if i could.

If this issue is solved in v8, would it be practical/compatible to do it in the anvio-8 environment? or do i have to (sadly) run all the things all over again because the additional information provided on the gene-caller?

Thanks, Tom

ivagljiva commented 2 months ago

I don't think this script ever accepted the --gene-caller flag.

In anvi'o v8, we split this program into three parts depending on input type - the equivalent program for what you want to do in v8 would be anvi-compute-functional-enrichment-across-genomes (https://anvio.org/help/8/programs/anvi-compute-functional-enrichment-across-genomes/) - and none of those 3 programs accept the --gene-caller flag. So using your custom gene caller is not possible even in anvi'o v8. This is something that we would have to implement in anvi'o dev.

So I guess there is no point in upgrading your anvi'o since this is not yet implemented. But for future reference, we usually try to make it so that you don't lose any data when you upgrade. There is this program called anvi-migrate (https://anvio.org/help/8/programs/anvi-migrate/) which updates your databases to be compatible with newer versions of anvi'o, and most of the time it doesn't get rid of any data (there are a few exceptions, like in our most recent update to SCG taxonomy which gets rid of the old taxonomy information and requires people to re-run anvi-run-scg-taxonomy). I believe you could safely update to v8 (and if you are worried about it, keep a copy of the older databases before migrating). But like I said, there is no point to upgrading for this specific issue since we haven't implemented this flag for the enrichment programs, and I'm not sure when that will happen.

If you want, there is a silly hack you could do, which is to modify the genes_in_contigs table in your contigs databases to change your custom gene caller name of Prodigal_v2.6.3 to the default gene caller name prodigal (in the source column of that table). You'd have to write an SQL update command to do it, and run that on each database. Once you have the gene caller named with the string prodigal, this script would be able to extract those gene calls and run the enrichment test. If you try it and it still doesn't work, let us know :)

TomKLHui commented 2 months ago

Thanks for the prompt response! The silly little hack works well so i guess its not so silly afterall :D