I noticed that straightforward rules from EcoPhylo, e.g. anvi_run_hmms_hmmsearch on a singular genome, were taking an unrealistic amount of time to run on the HPC.
Upon further examination, I saw that the log file from an individual job from anvi_run_hmms_hmmsearch printed this warning while the rule was running and does not end up in the log file after the rule is complete. (I'm not sure why the log file gets copied over after the rule is complete):
$ tail -F ECOPHYLO_WORKFLOW/00_LOGS/anvi_run_hmms_hmmsearch-genome.log
# CLUSTERIZE submitted: 2022-10-09 13:01:37.970509
# command: /project2/meren/PEOPLE/mschechter/SCG_workflow_tutorial/.snakemake/tmp.mta5w4bq/snakejob.anvi_run_hmms_hmmsearch.14.sh
HMM profiles .................................: 9 sources have been loaded:
Ribosomal_RNA_16S (3 genes,
domain: None), Ribosomal_RNA_28S
(1 genes, domain: None),
Ribosomal_RNA_18S (1 genes,
domain: None), Protista_83 (83
genes, domain: eukarya),
Ribosomal_RNA_23S (2 genes,
domain: None), Bacteria_71 (71
genes, domain: bacteria),
Archaea_76 (76 genes, domain:
archaea), Ribosomal_RNA_5S (5
genes, domain: None),
Ribosomal_RNA_12S (1 genes,
domain: None)
WARNING
===============================================
We are initiating parameters for the ecophylo workflow
WARNING
===============================================
Some of your genomes (1 of the 3, to be precise) seem to have no functional
annotation. Since this workflow can only use matching functional annotations
across all genomes involved, having even one genome without any functions means
that there will be no matching function across all. Things will continue to
work, but you will have no functions at the end for your gene clusters.
These warnings ^ are coming from the EcoPhylo __init__.py file. I assumed that this file would only be run ONCE at the beginning of the workflow but I no longer think this is the case.
At least while using anvi-run-workflow -A --cluster (haven't found a way to test it locally) it appears that the Snakefile runs top to bottom and thus everytime a rule is launched there is a re-initialization here.
Due to this, MetagenomeDescriptions and GenomeDescriptions are re-run for every rule. These methods are used in the EcoPhylo init file to sanity check the incoming external-genomes.txt and metagenomes.txt. The original idea was that the user should be aware of issues with these files before the workflow begins. However, since this is being re-run for every rule its causing a huge bottleneck, especially when external-genomes.txt files have 1000's of genomes.
For now, a simple solution is to give the user to have the option to run sanity checks for external-genomes.txt and metagenomes.txt.
I have implemented this in the branch ecophylo-skip-sanity-check. Simply switch run_genomes_sanity_check to false in the config file, and sanity checks for external-genomes.txt and metagenomes.txt will be skipped.
@ivagljiva this will dramatically speed up large EcoPhylo workflows ^
It would be great if expensive sanity checks could be run outside of anvio snakemake workflow init files since they appear to be re-run for every rule on the cluster. This would require some refactoring and might be more trouble than it's worth.
anvi'o version
Detailed description of the issue
I noticed that straightforward rules from EcoPhylo, e.g.
anvi_run_hmms_hmmsearch
on a singular genome, were taking an unrealistic amount of time to run on the HPC.Upon further examination, I saw that the log file from an individual job from
anvi_run_hmms_hmmsearch
printed this warning while the rule was running and does not end up in the log file after the rule is complete. (I'm not sure why the log file gets copied over after the rule is complete):These warnings ^ are coming from the EcoPhylo
__init__.py
file. I assumed that this file would only be run ONCE at the beginning of the workflow but I no longer think this is the case.At least while using
anvi-run-workflow -A --cluster
(haven't found a way to test it locally) it appears that theSnakefile
runs top to bottom and thus everytime a rule is launched there is a re-initialization here.Due to this, MetagenomeDescriptions and GenomeDescriptions are re-run for every rule. These methods are used in the EcoPhylo init file to sanity check the incoming
external-genomes.txt
andmetagenomes.txt
. The original idea was that the user should be aware of issues with these files before the workflow begins. However, since this is being re-run for every rule its causing a huge bottleneck, especially whenexternal-genomes.txt
files have 1000's of genomes.For now, a simple solution is to give the user to have the option to run sanity checks for
external-genomes.txt
andmetagenomes.txt
.I have implemented this in the branch ecophylo-skip-sanity-check. Simply switch
run_genomes_sanity_check
to false in the config file, and sanity checks forexternal-genomes.txt
andmetagenomes.txt
will be skipped.@ivagljiva this will dramatically speed up large EcoPhylo workflows ^
It would be great if expensive sanity checks could be run outside of anvio snakemake workflow init files since they appear to be re-run for every rule on the cluster. This would require some refactoring and might be more trouble than it's worth.
Any suggestions and comments are most welcome :)