question about optimizing foldseek for metagenomics gene catalogue

szimmerman92 commented 4 months ago

Hi,

I have a little over 1 million proteins from a metagenome I would like to perform an easy-search on Alphafold/UniProt50. I am running foldseek on Google Cloud so I am fairly flexible on CPUs and RAM, but ideally would like to keep costs down. I have tried running foldseek with over 600 GBs of memory but it still dies during prefiltering.

What are suggestions for optimizing foldseek on such a large query? I tried the command --sort-by-structure-bits 0 . I could try --prefilter-mode 1 but that only seems to be effective on smaller queries. In this dataset, I also have some proteins larger than 1000 amino acids long. Would you recommend I remove those?

I also tried to run createindex --split-memory-limit 60G database_afuniprot50 temp to make the database use a little less memory but I got the error "Database database_afuniprot50 needs header information". Do you know why this occured?

Is this too much data to use foldseek on? Any help would be appreciated and if you need more information please let me know.

Below please find the command I run and the output with a machine of 624 GBs of memory and 49 CPU cores.

Thank you very much for this incredible software.

Best, Sam

foldseek easy-search --sort-by-structure-bits 0 /home/zimmerma/fhs_prostt5_foldseek_db/fhs_db /mnt/disks/mydisk/database_afuniprot50/afuniprot50 foldseek_results_fhs_aa temp

easy-search --sort-by-structure-bits 0 /home/zimmerma/fhs_prostt5_foldseek_db/fhs_db /mnt/disks/mydisk/database_afuniprot50/afuniprot50 foldseek_results_fhs_aa temp

MMseqs Version: 3bf3cdf59810dc9127601ea343d4396e46036942 Seq. id. threshold 0 Coverage threshold 0 Coverage mode 0 Max reject 2147483647 Max accept 2147483647 Add backtrace false TMscore threshold 0 TMalign hit order 0 TMalign fast 1 Preload mode 0 Threads 96 Verbosity 3 LDDT threshold 0 Sort by structure bit score 0 Alignment type 2 Substitution matrix aa:3di.out,nucl:3di.out Alignment mode 3 Alignment mode 0 E-value threshold 10 Min alignment length 0 Seq. id. mode 0 Alternative alignments 0 Max sequence length 65535 Compositional bias 1 Compositional bias 1 Gap open cost aa:10,nucl:10 Gap extension cost aa:1,nucl:1 Compressed 0 Seed substitution matrix aa:3di.out,nucl:3di.out Sensitivity 9.5 k-mer length 6 Target search mode 0 k-score seq:2147483647,prof:2147483647 Max results per query 1000 Split database 0 Split mode 2 Split memory limit 0 Diagonal scoring true Exact k-mer matching 0 Mask residues 0 Mask residues probability 0.99995 Mask lower case residues 1 Minimum diagonal score 30 Selected taxa
Spaced k-mers 1 Spaced k-mer pattern
Local temporary path
Exhaustive search mode false Prefilter mode 0 Search iterations 1 Remove temporary files true MPI runner
Force restart with latest tmp false Cluster search 0 Chain name mode 0 Write mapping file 0 Mask b-factor threshold 0 Coord store mode 2 Write lookup file 1 Tar Inclusion Regex . Tar Exclusion Regex ^$ Input format 0 File Inclusion Regex . File Exclusion Regex ^$ Alignment format 0 Format alignment output query,target,fident,alnlen,mismatch,gapopen,qstart,qend,tstart,tend,evalue,bits Database output false Greedy best hits false

Create directory temp/5295610052143856966/search_tmp search /home/zimmerma/fhs_prostt5_foldseek_db/fhs_db /mnt/disks/mydisk/database_afuniprot50/afuniprot50 temp/5295610052143856966/result temp/5295610052143856966/search_tmp --sort-by-structure-bits 0 --alignment-mode 3 --comp-bias-corr 1 --gap-open aa:10,nucl:10 --gap-extend aa:1,nucl:1 -s 9.5 -k 6 --mask 0 --mask-prob 0.99995 --remove-tmp-files 1

prefilter /home/zimmerma/fhs_prostt5_foldseek_db/fhs_db_ss /mnt/disks/mydisk/database_afuniprot50/afuniprot50_ss temp/5295610052143856966/search_tmp/13789935573118591569/pref --sub-mat 'aa:3di.out,nucl:3di.out' --seed-sub-mat 'aa:3di.out,nucl:3di.out' -s 9.5 -k 6 --target-search-mode 0 --k-score seq:2147483647,prof:2147483647 --alph-size aa:21,nucl:5 --max-seq-len 65535 --max-seqs 1000 --split 0 --split-mode 2 --split-memory-limit 0 -c 0 --cov-mode 0 --comp-bias-corr 1 --comp-bias-corr-scale 0.15 --diag-score 1 --exact-kmer-matching 0 --mask 0 --mask-prob 0.99995 --mask-lower-case 1 --min-ungapped-score 30 --add-self-matches 0 --spaced-kmer-mode 1 --db-load-mode 0 --pca substitution:1.100,context:1.400 --pcb substitution:4.100,context:5.800 --threads 96 --compressed 0 -v 3

Query database size: 1073247 type: Aminoacid Estimated memory consumption: 337G Target database size: 53665860 type: Aminoacid Index table k-mer threshold: 78 at k-mer size 6 Index table: counting k-mers [=================================================================] 100.00% 53.67M 36s 22ms
Index table: Masked residues: 0 Index table: fill [=================================================================] 100.00% 53.67M 1m 4s 179ms
Index statistics Entries: 11018018466 DB size: 63533 MB Avg k-mer size: 172.156539 Top 10 k-mers DDDDDD 25475954 DDDDDP 22448571 DDDDPP 19428803 VVVVPD 17790099 DPVVVV 16769600 SVSVVV 14134938 SVVVVV 13707739 DDVVVV 12394056 LVVVVV 11880759 VSVVVV 10654575 Time for index table init: 0h 4m 32s 714ms Process prefiltering step 1 of 1

k-mer similarity threshold: 78 Starting prefiltering scores calculation (step 1 of 1) Query db start 1 to 1073247 Target db start 1 to 53665860 Killed ] 0.00% 1 eta - Error: Kmer matching step died Error: Search died

milot-mirdita commented 4 months ago

We had an issue where we could lose good hits and implemented a fix for correctness. However, that resulted in potentially huge memory allocations for some queries matching many target sequences.

We are aware of the issue and are thinking how to fix this. In the meantime, try forcing the use of 7-mers instead of 6-mers with -k 7 and using less threads --threads 32 (currently it seems to be using 96 threads). The issue is the per thread memory consumption, not the index.

I would recommend to avoid creating a precomputed index with createindex, especially when you have many queries. That comes with very different trade-offs.

szimmerman92 commented 4 months ago

Hi Milot,

Thank you very much for the quick response. I added the arguments you suggested but it still crashed. Here is the command I ran foldseek easy-search -k 7 --threads 32 /home/zimmerma/fhs_prostt5_foldseek_db/fhs_db /mnt/disks/mydisk/database_afuniprot50/afuniprot50 foldseek_results_fhs_aa temp

And here is the output. Any more suggestions that you think might work? Do you think it is worth trying to use the previous version before the update?

Thanks again!

Best, Sam

easy-search -k 7 --threads 32 /home/zimmerma/fhs_prostt5_foldseek_db/fhs_db /mnt/disks/mydisk/database_afuniprot50/afuniprot50 foldseek_results_fhs_aa temp

MMseqs Version: 250319672bedf19eb0c80804ee92a3c7087a5f0f Seq. id. threshold 0 Coverage threshold 0 Coverage mode 0 Max reject 2147483647 Max accept 2147483647 Add backtrace false TMscore threshold 0 TMalign hit order 0 TMalign fast 1 Preload mode 0 Threads 32 Verbosity 3 LDDT threshold 0 Sort by structure bit score 1 Alignment type 2 Substitution matrix aa:3di.out,nucl:3di.out Alignment mode 3 Alignment mode 0 E-value threshold 10 Min alignment length 0 Seq. id. mode 0 Alternative alignments 0 Max sequence length 65535 Compositional bias 1 Compositional bias 1 Gap open cost aa:10,nucl:10 Gap extension cost aa:1,nucl:1 Compressed 0 Seed substitution matrix aa:3di.out,nucl:3di.out Sensitivity 9.5 k-mer length 7 Target search mode 0 k-score seq:2147483647,prof:2147483647 Max results per query 1000 Split database 0 Split mode 2 Split memory limit 0 Diagonal scoring true Exact k-mer matching 0 Mask residues 0 Mask residues probability 0.99995 Mask lower case residues 1 Minimum diagonal score 30 Selected taxa
Spaced k-mers 1 Spaced k-mer pattern
Local temporary path
Exhaustive search mode false Prefilter mode 0 Search iterations 1 Remove temporary files true MPI runner
Force restart with latest tmp false Cluster search 0 Chain name mode 0 Write mapping file 0 Mask b-factor threshold 0 Coord store mode 2 Write lookup file 1 Tar Inclusion Regex . Tar Exclusion Regex ^$ Input format 0 File Inclusion Regex . File Exclusion Regex ^$ Alignment format 0 Format alignment output query,target,fident,alnlen,mismatch,gapopen,qstart,qend,tstart,tend,evalue,bits Database output false Greedy best hits false

search /home/zimmerma/fhs_prostt5_foldseek_db/fhs_db /mnt/disks/mydisk/database_afuniprot50/afuniprot50 temp/12968833095025548755/result temp/12968833095025548755/search_tmp --threads 32 --alignment-mode 3 --comp-bias-corr 1 --gap-open aa:10,nucl:10 --gap-extend aa:1,nucl:1 -s 9.5 -k 7 --mask 0 --mask-prob 0.99995 --remove-tmp-files 1

prefilter /home/zimmerma/fhs_prostt5_foldseek_db/fhs_db_ss /mnt/disks/mydisk/database_afuniprot50/afuniprot50_ss temp/12968833095025548755/search_tmp/12349765448822118560/pref --sub-mat 'aa:3di.out,nucl:3di.out' --seed-sub-mat 'aa:3di.out,nucl:3di.out' -s 9.5 -k 7 --target-search-mode 0 --k-score seq:2147483647,prof:2147483647 --alph-size aa:21,nucl:5 --max-seq-len 65535 --max-seqs 1000 --split 0 --split-mode 2 --split-memory-limit 0 -c 0 --cov-mode 0 --comp-bias-corr 1 --comp-bias-corr-scale 0.15 --diag-score 1 --exact-kmer-matching 0 --mask 0 --mask-prob 0.99995 --mask-lower-case 1 --min-ungapped-score 30 --add-self-matches 0 --spaced-kmer-mode 1 --db-load-mode 0 --pca substitution:1.100,context:1.400 --pcb substitution:4.100,context:5.800 --threads 32 --compressed 0 -v 3

Query database size: 1073247 type: Aminoacid Estimated memory consumption: 185G Target database size: 53665860 type: Aminoacid Index table k-mer threshold: 90 at k-mer size 7 Index table: counting k-mers [=================================================================] 100.00% 53.67M 1m 15s 39ms
Index table: Masked residues: 0 Index table: fill [=================================================================] 100.00% 53.67M 3m 15s 8ms
Index statistics Entries: 11335407711 DB size: 74627 MB Avg k-mer size: 8.855787 Top 10 k-mers DDDDDDD 22900829 DDDDDDP 20036462 DDDDDPP 16763068 DPVVVVV 15787301 VVVVVPP 13227255 DDVVVVV 13114181 SVVVVVV 12197158 LVVVVVV 9253579 DDPVVVV 8343123 VLVVVVV 7762948 Time for index table init: 0h 7m 57s 22ms Process prefiltering step 1 of 1

k-mer similarity threshold: 90 Starting prefiltering scores calculation (step 1 of 1) Query db start 1 to 1073247 Target db start 1 to 53665860 Killed ] 0.00% 1 eta - Error: Kmer matching step died Error: Search died

martin-steinegger commented 4 months ago

@szimmerman92 could you please try the newest version (commit f629bbe108595ebf0886dd6974e9959a4d962af3)? I implemented a different strategy that avoids reallocations in the prefilter.

szimmerman92 commented 3 months ago

Hi Martin,

Sorry for the late response. I am going to try the commit you made above. How do you install foldseek from source or using that commit?

Sorry for this very basic question.

Thanks again.

Best, Sam

milot-mirdita commented 3 months ago

You can download precompiled binaries here: https://mmseqs.com/foldseek

Otherwise follow the instructions in the wiki to compile from source.

szimmerman92 commented 3 months ago

Thank you very much for the instructions. Foldseek no longer dies when running but it seems like no progress is being made. It hangs during the step "Starting prefiltering scores calculation (Step 1 of 2).

This is what the output looks like so far.

prefilter /home/zimmerma/fhs_prostt5_foldseek_db/fhs_db_ss /mnt/disks/mydisk/database_afuniprot50/afuniprot50_ss temp/13593466129177909235/search_tmp/4380869008468293709/pref --sub-mat 'aa:3di.out,nucl:3di.out' --seed-sub-mat 'aa:3di.out,nucl:3di.out' -s 9.5 -k 6 --target-search-mode 0 --k-score seq:2147483647,prof:2147483647 --alph-size aa:21,nucl:5 --max-seq-len 65535 --max-seqs 1000 --split 0 --split-mode 2 --split-memory-limit 0 -c 0 --cov-mode 0 --comp-bias-corr 1 --comp-bias-corr-scale 0.15 --diag-score 1 --exact-kmer-matching 0 --mask 0 --mask-prob 0.99995 --mask-lower-case 1 --min-ungapped-score 30 --add-self-matches 0 --spaced-kmer-mode 1 --db-load-mode 0 --pca substitution:1.100,context:1.400 --pcb substitution:4.100,context:5.800 --threads 16 --compressed 0 -v 3

Query database size: 1073247 type: Aminoacid Target split mode. Searching through 2 splits Estimated memory consumption: 69G Target database size: 53665860 type: Aminoacid Process prefiltering step 1 of 2

Index table k-mer threshold: 78 at k-mer size 6 Index table: counting k-mers [=================================================================] 100.00% 26.82M 2m 56s 668ms Index table: Masked residues: 0 Index table: fill [=================================================================] 100.00% 26.82M 4m 41s 335ms

Index statistics Entries: 5505266600 DB size: 31989 MB Avg k-mer size: 86.019791 Top 10 k-mers DDDDDD 12729457 DDDDDP 11218285 DDDDPP 9710346 VVVVPD 8890212 DPVVVV 8379919 SVSVVV 7066127 SVVVVV 6849741 DDVVVV 6193142 LVVVVV 5936668 VSVVVV 5325698 Time for index table init: 0h 11m 42s 309ms k-mer similarity threshold: 78 Starting prefiltering scores calculation (step 1 of 2) Query db start 1 to 1073247 Target db start 1 to 26821505 ] 0.00% 1 eta -

Its hard to tell if its just slowly making progress or completely hanging. Is there a way to tell?

Thank you for all your help!

Sam

milot-mirdita commented 3 months ago

It's making very slow progress. This search is quite large. I would recommend to run on more CPU cores. For comparison 340K vs 50M that I did recently took a couple days on 128 cores.

steineggerlab / foldseek

question about optimizing foldseek for metagenomics gene catalogue #240