Closed blindner6 closed 6 months ago
Hello,
If you want aviary to just hit the refine_dastool
step and then exit, all you should need is to add -w refine_dastool
to your command. Like so
aviary recover -1 ${r1} -2 ${r2} -l ${lr} -t 6 -o ${out} -s 5000 --skip-abundances -b 100000 --tmpdir ${tmp} -w refine_dastool
Also see https://github.com/rhysnewell/aviary/issues/185 if you wish to run binning but not singlem
. Specific flags will be added in future to exclusively run binning.
Is gtdbtk
stalling completely (i.e. no activity on the CPU but program still listed as running in top
/htop
), or is it running with CPU but just taking a very long time? If it is the latter, this likely because you are only providing 6 cores and gtdbtk
is a very CPU intensive program especially when assigning many MAGs
Cheers, Rhys
Hey Rhys,
Thanks for the quick reply!
gtdbtk
is stalling and at an early step that's not terribly computationally intensive compared to later steps like building trees and getting distance. The stall is occurring when identifying TIGRFAM proteins in the genes predicted by prodigal -- in my past experience (admittedly usually with 10+ threads) runs at a rate of about 80-100 genomes/minute and these recent aviary runs only had 300-400 genomes in final_bins/
but the gtdbtk
step ran for >1 day and the log file for gtdbtk
reads:
[2023-12-21 10:19:04] INFO: gtdbtk classify_wf --skip_ani_screen --cpus 6 --pplacer_cpus 6 --extension fna --genome_dir bins/final_bins --out_dir data/gtdbtk
[2023-12-21 10:19:04] INFO: Using GTDB-Tk reference data version r214: /storage/coda1/p-ktk3/0/shared/rich_project_bio-konstantinidis/shared3/DB/aviary/logs/db/gtdb
[2023-12-21 10:19:04] INFO: Identifying markers in 113 genomes with 6 threads.
[2023-12-21 10:19:04] TASK: Running Prodigal V2.6.3 to identify genes.
[2023-12-21 10:20:45] INFO: Completed 113 genomes in 1.67 minutes (67.62 genomes/minute).
[2023-12-21 10:20:45] TASK: Identifying TIGRFAM protein families.
The timestamp for gtdbtk.log
shows a time long before the job ended (which was killed the next day when requested walltime was reached).
But this isn't really a big deal, i.e., if other users aren't reporting this, I suspect its just an issue on my end and not worth your time. Mostly I'm interested to use aviary recover
up to the point of producing refined bins; meaning what I think I want is just to run all steps up to and including refine_dastool
. I think what you've suggested above just runs that step alone? I'll play with it again today but it seems like I just need to pass all of the steps I want run to -w
? I'll also take a look at the issue you linked in the event that also addresses my question.
Thanks, Blake
I was wrong, -w refine_dastool
is exactly the behavior I wanted. Thanks for your help!
Hey all,
Sorry for the funky title! A quick question on having
aviary recover
stop short of the workflow's last 3 steps:Three of my most recent runs of
aviary recover
are hanging indefinitely at thegtdbtk
step (40/43) which is unfortunately churning through an awful amount of CPU time. In all cases, identifying TIGRFAM proteins is wheregtdbtk
stalls (but I don't mind this, see further below).Here is how I am calling
aviary recover
:aviary recover -1 ${r1} -2 ${r2} -l ${lr} -t 6 -o ${out} -s 5000 --skip-abundances -b 100000 --tmpdir ${tmp}
Yet, I don't need the final steps from
aviary recover
(i.e., taxonomic assignments, dereplicated bin set). Is there an easy way I can callaviary recover
(or some other instance of the wrapper) such that it stops the usualaviary recover
workflow before thegtdbtk
and subsequent steps? That is, I just want to run throughrefine_dastool
(step 39) and exit. I could manually check my logs and kill the process once it concludes with the steps I'm interested in but I'm quite sure there's much better ways to go about this. I've looked at options for-w
and--snakemake-cmds
but I'm unfortunately still a bit mystified by snakemake. Any insights/guidance you're able to provide would be appreciated!