steineggerlab / foldseek

Foldseek enables fast and sensitive comparisons of large structure sets.
https://foldseek.com
GNU General Public License v3.0
814 stars 100 forks source link

Force reuse for large jobs #311

Open dpretorius opened 3 months ago

dpretorius commented 3 months ago

Expected Behavior

I am running a large all-against-all alignment (40580 vs 40580 proteins), this is taking a long time, I would like to reuse information from a previous run after my wall-time is reached on my HPC.

Current Behavior

Even with using the flags :--remove-tmp-files 0 and --force-reuse 1 - it seems that the information from the alignment is not used in the next run, with the aln files in the tmp directories being wiped to 0.

Steps to Reproduce (for bugs)

!/bin/bash PBS -l select=1:ncpus=200:mem=50gb PBS -l walltime=72:00:00

module load anaconda3/personal source activate foldseek

cd $PBS_O_WORKDIR

input_path="/my_database" output_file="aln" temp_dir="tmp"

foldseek easy-search $input_path $input_path $output_file $temp_dir \ --alignment-type 1 \ --tmscore-threshold 0.0 \ --format-output "query,target,alntmscore,u,t" \ --exhaustive-search 1 \ -e inf \ --threads 200 \ --remove-tmp-files 0 \ --force-reuse 1

What can I do to reuse alignment information during the tmalign stage in successive runs?

milot-mirdita commented 3 months ago

What step is it currently at? Could you paste the terminal output up to this point?

milot-mirdita commented 3 months ago

Ah sorry, I misread the post. Foldseek will not reuse any results from within a module invocation, only between two module invocations. So if it ran and concluded a long prefilter step for a few days, and you cancel it within the alignment stage. It will start from the beginning of the alignment stage again.

dpretorius commented 3 months ago

Hi Milot,

Thanks for such a prompt response! This tool & the team are so important to my work!

I see, so each module has to run to completion, and this can be used afterwards.