merenlab / anvio

An analysis and visualization platform for 'omics data
http://merenlab.org/software/anvio
GNU General Public License v3.0
427 stars 145 forks source link

[BUG] `anvi-script-gen-function-matrix-across-genomes` does not know option `--skip-checking-genome-hashes` #2343

Open Louis-MG opened 2 days ago

Louis-MG commented 2 days ago

Short description of the problem

Help message indicates that I should use an option that is then refused as unknown.

anvi'o version

dev, installed using the documentation. Updated just sept.19 18h54 (canadian time).

Detailed description of the issue

I want to duplicate contigsDB to compute functional enrichment accross genomes. I want to ignore the hashes of the db to force anvio. I believe you just forgot to implement it :D .

(anvio-dev) me@vls142:/mnt/scratch/LM/pangenomic$ anvi-script-gen-function-matrix-across-genomes -e external_genomes.txt -G groups.txt --annotation-source Uniref90  --output-file-prefix functional_enrichment_all --skip-checking-genome-hashes

Groups found and parsed ......................: dry, moist, sebaceous, toenail

Config Error: While working on your external genomes, anvi'o realized that genome            
              GCF_000069245_1_ASM6924v1_genomic_contigs_db_db and                            
              GCF_000069245_1_ASM6924v1_genomic_contigs_db_db2 seem to have the same hash. If
              you are aware of this and/or if you would like anvi'o to not check genome      
              hashes, please use the flag `--skip-checking-genome-hashes`.  
meren commented 13 hours ago

Hey @Louis-MG,

I don't think this program actually cares about the duplicate hashes. I just tested it with copy-pasta genomes without using --skip-checking-genome-hashes, it worked fine :)

$ anvi-script-gen-function-matrix-across-genomes -e external-genomes.txt \
                                                 -G groups.txt \
                                                 --annotation-source KOfam \
                                                 --output-file-prefix functional_enrichment_all
Groups found and parsed ......................: E_faecali, E_faecium

WARNING
===============================================
Just FYI, for any gene call with multiple functional annotations from the same
source in a given genome, anvi'o only kept the annotation with the BEST e-value.
Keep this in mind when interpreting the output of this program.

Number of KOfam functions found across 2 groups : 1,545
Number of KOfam functions associated with all groups and SKIPPED : 1,113
Number of KOfam functions in final occurrence table : 432

CITATION
===============================================
This program will compute enrichment scores using an R script developed by Amy
Willis. You can find more information about it in the following paper: Shaiber,
Willis et al (https://doi.org/10.1186/s13059-020-02195-w). When you publish your
findings, please do not forget to properly credit this work. :)

AMY's ENRICHMENT ANALYSIS 🚀
===============================================
Functional occurrence stats input file path:  : /var/folders/gw/5mdblzs94gsb1ss44llgl3_h0000gn/T/tmp4wy4yzrb/FUNC_OCCURENCE_STATS.txt
Functional enrichment output file path:  .....: /Users/meren/Downloads/INFANT-GUT-TUTORIAL/additional-files/pangenomics/functional_enrichment_all-FUNCTIONAL-ENRICHMENT.txt
Temporary log file (use `--debug` to keep):  .: /var/folders/gw/5mdblzs94gsb1ss44llgl3_h0000gn/T/tmpk3lz5hky

Functions across genomes (frequency) .........: /Users/meren/Downloads/INFANT-GUT-TUTORIAL/additional-files/pangenomics/functional_enrichment_all-FREQUENCY.txt
Functions across genomes (presence/absence) ..: /Users/meren/Downloads/INFANT-GUT-TUTORIAL/additional-files/pangenomics/functional_enrichment_all-PRESENCE-ABSENCE.txt

Only when I had literally identical genomes in two groups, anvi'o complained:

Groups found and parsed ......................: E_faecali, E_faecium

WARNING
===============================================
In an ideal world, each group would describe at least two layer names. It is not
the case for these groups: E_faecali, E_faecium. That is OK and anvi'o will
continue with this analysis, but if something goes wrong with your stats or
whatever, you will remember this moment and go like, "Hmm. That's why my
adjusted q-values are like one point zero 🤔".

WARNING
===============================================
Just FYI, for any gene call with multiple functional annotations from the same
source in a given genome, anvi'o only kept the annotation with the BEST e-value.
Keep this in mind when interpreting the output of this program.

Number of KOfam functions found across 2 groups : 1,246
Number of KOfam functions associated with all groups and SKIPPED : 1,246
Number of KOfam functions in final occurrence table : 0

Config Error: Something weird is happening here :( It seems every single function across your
              genomes is associated with all groups you have defined. There is nothing much
              anvi'o can work with here. If you think this is a mistake, please let us know.

I'm having hard time reproducing this :(

Can you explain EXACTLY how you ended up here? Perhaps you can send us the external genomes file you have?