merenlab / anvio

An analysis and visualization platform for 'omics data
http://merenlab.org/software/anvio
GNU General Public License v3.0
423 stars 144 forks source link

[BUG] EcoPhylo re-runs sanity checks for every rule on HPC #1993

Closed mschecht closed 1 year ago

mschecht commented 1 year ago

anvi'o version

$ anvi-self-test --version
Anvi'o .......................................: hope (v7.1-dev)

Profile database .............................: 38
Contigs database .............................: 20
Pan database .................................: 16
Genome data storage ..........................: 7
Auxiliary data storage .......................: 2
Structure database ...........................: 2
Metabolic modules database ...................: 4
tRNA-seq database ............................: 2

Detailed description of the issue

I noticed that straightforward rules from EcoPhylo, e.g. anvi_run_hmms_hmmsearch on a singular genome, were taking an unrealistic amount of time to run on the HPC.

Upon further examination, I saw that the log file from an individual job from anvi_run_hmms_hmmsearch printed this warning while the rule was running and does not end up in the log file after the rule is complete. (I'm not sure why the log file gets copied over after the rule is complete):

$ tail -F ECOPHYLO_WORKFLOW/00_LOGS/anvi_run_hmms_hmmsearch-genome.log
# CLUSTERIZE submitted: 2022-10-09 13:01:37.970509
# command: /project2/meren/PEOPLE/mschechter/SCG_workflow_tutorial/.snakemake/tmp.mta5w4bq/snakejob.anvi_run_hmms_hmmsearch.14.sh

HMM profiles .................................: 9 sources have been loaded:
                                                Ribosomal_RNA_16S (3 genes,
                                                domain: None), Ribosomal_RNA_28S
                                                (1 genes, domain: None),
                                                Ribosomal_RNA_18S (1 genes,
                                                domain: None), Protista_83 (83
                                                genes, domain: eukarya),
                                                Ribosomal_RNA_23S (2 genes,
                                                domain: None), Bacteria_71 (71
                                                genes, domain: bacteria),
                                                Archaea_76 (76 genes, domain:
                                                archaea), Ribosomal_RNA_5S (5
                                                genes, domain: None),
                                                Ribosomal_RNA_12S (1 genes,
                                                domain: None)

WARNING
===============================================
We are initiating parameters for the ecophylo workflow

WARNING
===============================================
Some of your genomes (1 of the 3, to be precise) seem to have no functional
annotation. Since this workflow can only use matching functional annotations
across all genomes involved, having even one genome without any functions means
that there will be no matching function across all. Things will continue to
work, but you will have no functions at the end for your gene clusters.

These warnings ^ are coming from the EcoPhylo __init__.py file. I assumed that this file would only be run ONCE at the beginning of the workflow but I no longer think this is the case.

At least while using anvi-run-workflow -A --cluster (haven't found a way to test it locally) it appears that the Snakefile runs top to bottom and thus everytime a rule is launched there is a re-initialization here.

Due to this, MetagenomeDescriptions and GenomeDescriptions are re-run for every rule. These methods are used in the EcoPhylo init file to sanity check the incoming external-genomes.txt and metagenomes.txt. The original idea was that the user should be aware of issues with these files before the workflow begins. However, since this is being re-run for every rule its causing a huge bottleneck, especially when external-genomes.txt files have 1000's of genomes.

For now, a simple solution is to give the user to have the option to run sanity checks for external-genomes.txt and metagenomes.txt.

I have implemented this in the branch ecophylo-skip-sanity-check. Simply switch run_genomes_sanity_check to false in the config file, and sanity checks for external-genomes.txt and metagenomes.txt will be skipped.

@ivagljiva this will dramatically speed up large EcoPhylo workflows ^

It would be great if expensive sanity checks could be run outside of anvio snakemake workflow init files since they appear to be re-run for every rule on the cluster. This would require some refactoring and might be more trouble than it's worth.

Any suggestions and comments are most welcome :)

meren commented 1 year ago

Did we not resolve this, @mschecht?

mschecht commented 1 year ago

@meren I started a PR and am working on final tests: #2004

meren commented 1 year ago

I see! Thank you very much for following up on this :

mschecht commented 1 year ago

PR was merged, we're done here :)