rhysnewell / aviary

A hybrid assembly and MAG recovery pipeline (and more!)
GNU General Public License v3.0
76 stars 11 forks source link

aviary recover stop after bin refinement / pipeline stalling indefinitely at GTDBtk step #190

Closed blindner6 closed 6 months ago

blindner6 commented 6 months ago

Hey all,

Sorry for the funky title! A quick question on having aviary recover stop short of the workflow's last 3 steps:

Three of my most recent runs of aviary recover are hanging indefinitely at the gtdbtk step (40/43) which is unfortunately churning through an awful amount of CPU time. In all cases, identifying TIGRFAM proteins is where gtdbtk stalls (but I don't mind this, see further below).

Here is how I am calling aviary recover:

aviary recover -1 ${r1} -2 ${r2} -l ${lr} -t 6 -o ${out} -s 5000 --skip-abundances -b 100000 --tmpdir ${tmp}

Yet, I don't need the final steps from aviary recover (i.e., taxonomic assignments, dereplicated bin set). Is there an easy way I can call aviary recover (or some other instance of the wrapper) such that it stops the usual aviary recover workflow before the gtdbtk and subsequent steps? That is, I just want to run through refine_dastool (step 39) and exit. I could manually check my logs and kill the process once it concludes with the steps I'm interested in but I'm quite sure there's much better ways to go about this. I've looked at options for -w and --snakemake-cmds but I'm unfortunately still a bit mystified by snakemake. Any insights/guidance you're able to provide would be appreciated!

rhysnewell commented 6 months ago

Hello,

If you want aviary to just hit the refine_dastool step and then exit, all you should need is to add -w refine_dastool to your command. Like so

aviary recover -1 ${r1} -2 ${r2} -l ${lr} -t 6 -o ${out} -s 5000 --skip-abundances -b 100000 --tmpdir ${tmp} -w refine_dastool

Also see https://github.com/rhysnewell/aviary/issues/185 if you wish to run binning but not singlem. Specific flags will be added in future to exclusively run binning.

Is gtdbtk stalling completely (i.e. no activity on the CPU but program still listed as running in top/htop), or is it running with CPU but just taking a very long time? If it is the latter, this likely because you are only providing 6 cores and gtdbtk is a very CPU intensive program especially when assigning many MAGs

Cheers, Rhys

blindner6 commented 6 months ago

Hey Rhys,

Thanks for the quick reply!

gtdbtk is stalling and at an early step that's not terribly computationally intensive compared to later steps like building trees and getting distance. The stall is occurring when identifying TIGRFAM proteins in the genes predicted by prodigal -- in my past experience (admittedly usually with 10+ threads) runs at a rate of about 80-100 genomes/minute and these recent aviary runs only had 300-400 genomes in final_bins/ but the gtdbtk step ran for >1 day and the log file for gtdbtk reads:

[2023-12-21 10:19:04] INFO: gtdbtk classify_wf --skip_ani_screen --cpus 6 --pplacer_cpus 6 --extension fna --genome_dir bins/final_bins --out_dir data/gtdbtk
[2023-12-21 10:19:04] INFO: Using GTDB-Tk reference data version r214: /storage/coda1/p-ktk3/0/shared/rich_project_bio-konstantinidis/shared3/DB/aviary/logs/db/gtdb
[2023-12-21 10:19:04] INFO: Identifying markers in 113 genomes with 6 threads.
[2023-12-21 10:19:04] TASK: Running Prodigal V2.6.3 to identify genes.
[2023-12-21 10:20:45] INFO: Completed 113 genomes in 1.67 minutes (67.62 genomes/minute).
[2023-12-21 10:20:45] TASK: Identifying TIGRFAM protein families.

The timestamp for gtdbtk.log shows a time long before the job ended (which was killed the next day when requested walltime was reached).

But this isn't really a big deal, i.e., if other users aren't reporting this, I suspect its just an issue on my end and not worth your time. Mostly I'm interested to use aviary recover up to the point of producing refined bins; meaning what I think I want is just to run all steps up to and including refine_dastool. I think what you've suggested above just runs that step alone? I'll play with it again today but it seems like I just need to pass all of the steps I want run to -w? I'll also take a look at the issue you linked in the event that also addresses my question.

Thanks, Blake

blindner6 commented 6 months ago

I was wrong, -w refine_dastool is exactly the behavior I wanted. Thanks for your help!