Open EvanKomp opened 2 weeks ago
25 million MSAs is quite considerable. Following is a brain dump of some things to consider. Sorry it’s not super organized.
You can scale down sensitivity. -s 6
might be sufficient and much faster than the default.
You can consider searching only against the uniref and omitting the colabfold db.
I think a batch size of 50k-100k queries is probably going to work to be sufficient.
I have also recently tweaked the setup_databases.sh script to not store the databases in a compressed form, this will increase RAM use but avoid constant decompression. However, I’ve not commited this change yet.
Storage space is likely going to be one of the bigger issues, since 25M MSA will be quite large.
You might want to extract the mmseqs shell commands from the colabfold_search python script (https://github.com/sokrypton/ColabFold/blob/main/colabfold/mmseqs/search.py) so you have an easier time tweaking them.
One option might be to run all searches in batches, but not run the final result2msa and unpackdb steps yet, as they are comparatively cheap and the filesize increases a lot from the internal result format to the MSA a3m format. You can also run another mmseqs module here (filterdb --extract-lines) to return only a maximum number hits to be converted to a MSA.
Currently, colabfold_search materializes each MSA as an individual file to the file system (through the unpackdb) call. We have been meaning to make colabfold_batch work directly with the internal database format. I assume you don’t actually want to ran 25M structure predictions and instead want to run something else that is faster. You might want to implement something that reads from the MMseqs2 databases directly, this will save a lot of filesystem headache.
@milot-mirdita Thanks for your response! This is good stuff. I hadn't considered the point about data size, though I do have a number of TB to work with.
Re. batching, is there a built in module to chunk up the query db or should I manually split my fasta and make seperate DBs? I assume splitdb
?
Here is my current list of commands in hard coded format:
# UniRef search
mmseqs search queryDB /datasets/UniprotKB/colabfold_data/uniref30_2302_db outdir/res outdir/tmp --num-iterations 3 --db-load-mode 1 -a -e 0.0001 --max-seqs 25 --prefilter-mode 0 -s 1 --threads 4
# Move profile
mmseqs mvdb outdir/tmp/latest/profile_1 outdir/prof_res
mmseqs lndb queryDB_h outdir/prof_res_h
# Expand and align UniRef results
mmseqs expandaln queryDB /datasets/UniprotKB/colabfold_data/uniref30_2302_db outdir/res /datasets/UniprotKB/colabfold_data/uniref30_2302_db outdir/res_exp --expansion-mode 0 -e inf --expand-filter-clusters 1 --max-seq-id 0.95 --db-load-mode 1 --threads 4
mmseqs align outdir/prof_res /datasets/UniprotKB/colabfold_data/uniref30_2302_db outdir/res_exp outdir/res_exp_realign --db-load-mode 1 -e 10 --max-accept 1000 --threads 4 --alt-ali 10 -a
# Filter UniRef results
mmseqs filterresult queryDB /datasets/UniprotKB/colabfold_data/uniref30_2302_db outdir/res_exp_realign outdir/res_exp_realign_filter --db-load-mode 1 --qid 0 --qsc 0.8 --diff 0 --threads 4 --max-seq-id 1.0 --filter-min-enable 100
# Convert UniRef results to MSA
mmseqs result2msa queryDB /datasets/UniprotKB/colabfold_data/uniref30_2302_db outdir/res_exp_realign_filter outdir/uniref.a3m --msa-format-mode 6 --db-load-mode 1 --threads 4 --filter-msa 1 --filter-min-enable 1000 --diff 256 --qid 0.0,0.2,0.4,0.6,0.8,1.0 --qsc 0 --max-seq-id 0.95
# Environmental search
mmseqs search outdir/prof_res /datasets/UniprotKB/colabfold_data/colabfold_envdb_202108_db outdir/res_env outdir/tmp/env --num-iterations 3 --db-load-mode 1 -a -e 0.0001 --max-seqs 25 --prefilter-mode 0 -s 1 --threads 4
# Expand and align environmental results
mmseqs expandaln outdir/prof_res /datasets/UniprotKB/colabfold_data/colabfold_envdb_202108_db outdir/res_env /datasets/UniprotKB/colabfold_data/colabfold_envdb_202108_db outdir/res_env_exp -e inf --expansion-mode 0 --db-load-mode 1 --threads 4
mmseqs align outdir/tmp/env/latest/profile_1 /datasets/UniprotKB/colabfold_data/colabfold_envdb_202108_db outdir/res_env_exp outdir/res_env_exp_realign --db-load-mode 1 -e 10 --max-accept 1000 --threads 4 --alt-ali 10 -a
# Filter environmental results
mmseqs filterresult queryDB /datasets/UniprotKB/colabfold_data/colabfold_envdb_202108_db outdir/res_env_exp_realign outdir/res_env_exp_realign_filter --db-load-mode 1 --qid 0 --qsc 0.8 --diff 0 --max-seq-id 1.0 --threads 4 --filter-min-enable 100
# Convert environmental results to MSA
mmseqs result2msa queryDB /datasets/UniprotKB/colabfold_data/colabfold_envdb_202108_db outdir/res_env_exp_realign_filter outdir/env.a3m --msa-format-mode 6 --db-load-mode 1 --threads 4 --filter-msa 1 --filter-min-enable 1000 --diff 256 --qid 0.0,0.2,0.4,0.6,0.8,1.0 --qsc 0 --max-seq-id 0.95
# Merge results
mmseqs mergedbs queryDB outdir/final.a3m outdir/uniref.a3m outdir/env.a3m
# Unpack results
mmseqs unpackdb outdir/final.a3m outdir --unpack-name-mode 0 --unpack-suffix .a3m
As it stands, I am not converting from DB to a3m until after the filter (with diff =256) so maybe MSA sizes will be okay? Any thoughts?
What I changed from colabfold default, but plan to "search" over for MSA quality and run time: e_value (for search) -> 1e-4 max_seqs (for search) -> 25 max_accept (for alignment) diff -> (filter) 256
Thanks again.
EDIT: Since I am doing an iterative search, would we expect an indexed and in memory target DB to speed up search? I am gathering no based on other conversations given that my use case is large batches and not single queries?
EDIT2: Re. the purpose - correct, I have no need to run structure prediction. The MSAs are going into MSATransformer, so there is no point for them to be extremely large or contain remote fragments.
Context
I need to create MSAs for a very large set of protein sequences: about 25 million.
I was planning to use the ColabFold workflow. I figured that this would be achievable given the nonlinear scaling to large query sets. That being said, extremely remote alignments are not necessary for my use case in the same way they they are helpful for structure prediction. I am looking for relative small MSAs (no more than 256 sequences) of diverse sets that do not have small fragments eg. high coverage.
I had intended to run some scaling tests over query size as well as parameters (first thoughts being sensitivity, max_seqs, align eval, max_except), as well as not using the metagenomic database.
I figured I would first chat with experts and save some compute carbon before doing this. Is there any params I am missing? Any that would be good to change from default and forget? Am I totally off in thinking my job is accessible with a 104 thread compute note and a week of runtime?