sokrypton / ColabFold

Making Protein folding accessible to all!
MIT License
1.88k stars 476 forks source link

Colabfold_batch connects to mmseqs server despite msa input #563

Open Nuta0 opened 7 months ago

Nuta0 commented 7 months ago

Expected Behavior

I want to create msas for a batch of heterodimers using colabfold_search and predict the structure for the msas using colabfold_batch without using the mmseqs server for msa generation.

Current Behavior

Colabfold_search is generating msas as expected. At the moment it looks like colabfold_batch is using the server to generate msas again even though I point to the already generate a3m files. I am not sure if it is actually connecting to the server but I do think it is wasting time waiting for something, since it outputs pending.

Steps to Reproduce (for bugs)

Here is the input I use for colabfold_search

module load Miniconda3/22.11.1-1
eval "$(conda shell.bash hook)"
conda activate /data/gpfs/projects/punim1869/shared_bin/localcolabfold/colabfold-conda
module load MMseqs2/15-6f452
colabfold_search input.fasta /data/gpfs/datasets/mmseqs/uniref30_2302 msas

and then colabfold_batch

module load Miniconda3/22.11.1-1
module load CUDA/12.2.0
eval "$(conda shell.bash hook)"
conda activate /data/gpfs/projects/punim1869/shared_bin/localcolabfold/colabfold-conda
colabfold_batch --amber --templates --use-gpu-relax msas predictions

ColabFold Output (for bugs)

Here is are the first few lines of output and log.txt when I run colabfold_batch.

2024-01-26 12:25:04.257622: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
2024-01-26 12:25:05.750801: W tensorflow/core/common_runtime/gpu/gpu_device.cc:1956] Cannot dlopen some GPU libraries. Please make sure the missing libraries mentioned above are installed properly if you would like to use GPU. Follow the guide at https://www.tensorflow.org/install/gpu for how to download and setup the required libraries for your platform.
Skipping registering GPU devices...
^M  0%|          | 0/300 [elapsed: 00:00 remaining: ?]^MSUBMIT:   0%|          | 0/300 [elapsed: 00:00 remaining: ?]^MPENDING:   0%|          | 0/300 [elapsed: 00:00 remaining: ?]^M                                                             ^MPENDING:   0%|          | 0/300 [elapsed: 00:00 remaining: ?]^MCOMPLETE:   0%|          | 0/300 [elapsed: 00:09 remaining: ?]^MCOMPLETE: 100%|██████████| 300/300 [elapsed: 00:09 remaining: 00:00]^MCOMPLETE: 100%|██████████| 300/300 [elapsed: 00:14 remaining: 00:00]
^M  0%|          | 0/300 [elapsed: 00:00 remaining: ?]^MSUBMIT:   0%|          | 0/300 [elapsed: 00:00 remaining: ?]^MPENDING:   0%|          | 0/300 [elapsed: 00:01 remaining: ?]
2024-01-26 12:25:00,517 Running colabfold 1.5.5 (941feece178db14c9af1580eefbf4a8fe4e5b5af)
2024-01-26 12:25:05,724 Running on GPU
2024-01-26 12:25:13,220 Found 9 citations for tools or databases
2024-01-26 12:25:13,220 Query 1/100: Heterodimer (length 261)
2024-01-26 12:25:17,176 Sleeping for 8s. Reason: PENDING
2024-01-26 12:26:05,156 Sequence 0 found templates: ['Xxx8']
2024-01-26 12:26:05,156 Sequence 1 found no templates

Context

Providing context helps us come up with a solution and improve our documentation for the future.

Your Environment

YoshitakaMo commented 7 months ago

If you turn on --templates arg, colabfold_batch will try to connect the MSA server to obtain template PDB files. Since the effect of template structure is small for many cases (except limited available MSAs), you can turn off --template.

Nuta0 commented 7 months ago

Is there a way of obtaining template PDB files locally? And if not is it feasible to obtain the templates of many predictions (>10000) from the msa server given the resource limitations of the server?

YoshitakaMo commented 7 months ago

Currently, colabfold_search can generate a list of template PDB files together with MSA files. For example,

MMSEQS_PATH="/path/to/your/mmseqs2/for_colabfold"
DATABASE_PATH="/mnt/databases"
INPUTFILE="ras_raf.fasta"
OUTPUTDIR="ras_raf"

colabfold_search \
  --use-env 1 \
  --use-templates 1 \
  --db-load-mode 2 \
  --db2 pdb100_230517 \
  --mmseqs ${MMSEQS_PATH}/bin/mmseqs \
  --threads 4 \
  ${INPUTFILE} \
  ${DATABASE_PATH} \
  ${OUTPUTDIR}

Then, use colabfold_batch with --pdb-hit-file PDBHITFILE, which will be generated by colabfold_search. Note that mmCIF file database (/path/to/pdb_mmcif/mmcif_files) is required in your computer like the original AlphaFold2.

INPUTFILE="RAS_RAF.a3m"
PDBHITFILE="RAS_RAF_pdb100_230517.m8"
LOCALPDBPATH="/path/to/pdb_mmcif/mmcif_files"
RANDOMSEED=0

colabfold_batch \
  --amber \
  --templates \
  --use-gpu-relax \
  --pdb-hit-file ${PDBHITFILE} \
  --local-pdb-path ${LOCALPDBPATH} \
  --random-seed ${RANDOMSEED} \
  ${INPUTFILE} \
  ras_raf
Nuta0 commented 7 months ago

That's great. Is there a way of doing this for a batch of proteins with respective .a3m files and .m8 files?

Nuta0 commented 7 months ago

I am trying to use a fasta file with multiple complexes as an input for colabfold_search. This is my code:

module load Miniconda3/22.11.1-1
eval "$(conda shell.bash hook)"
conda activate /data/gpfs/projects/punim1869/shared_bin/localcolabfold/colabfold-conda
module load MMseqs2/15-6f452

DATABASE_PATH="/data/gpfs/datasets/mmseqs/uniref30_2302"

colabfold_search \
  --use-env 1 \
  --use-templates 1 \
  --db2 pdb100_230517 \
  --threads 16 \
  input/${input_file} \
  ${DATABASE_PATH} \
  msas

I get this error:

Traceback (most recent call last):
  File "/data/gpfs/projects/punim1869/shared_bin/localcolabfold/colabfold-conda/bin/colabfold_search", line 8, in <module>
    sys.exit(main())
  File "/data/gpfs/projects/punim1869/shared_bin/localcolabfold/colabfold-conda/lib/python3.10/site-packages/colabfold/mmseqs/search.py", line 385, in main
    os.rename(
FileNotFoundError: [Errno 2] No such file or directory: 'msas/pdb100_230517.m8' -> 'msas/Complex_2_pdb100_230517.m8'

Does Colabfold_serach support multiples sequences as an input when generating both .a3m and .m8?

YoshitakaMo commented 7 months ago

Does Colabfold_serach support multiples sequences as an input when generating both .a3m and .m8?

Yes, you can obtain a3m files for multiple input.

Here is my example. I'm using ColabFold 1.5.5 (a00ce1bcc477491d7693e3816d21ea3fc2cf40fd).

-rw-r--r-- 1 moriwaki staffs 5797891705 May 22 2023 uniref30_2302_db_mapping -rw-r--r-- 1 moriwaki staffs 667957493 May 22 2023 uniref30_2302_db_taxonomy -rw-r--r-- 1 moriwaki staffs 64064274015 Jun 13 2023 pdb100_a3m.ffdata -rw-r--r-- 1 moriwaki staffs 6389810 Jun 13 2023 pdb100_a3m.ffindex -rw-r--r-- 1 moriwaki staffs 43200163261 Oct 9 17:34 uniref30_2302_db_h -rw-r--r-- 1 moriwaki staffs 8910693488 Oct 9 17:35 uniref30_2302_db_h.index -rw-r--r-- 1 moriwaki staffs 4 Oct 9 17:35 uniref30_2302_db_h.dbtype -rw-r--r-- 1 moriwaki staffs 5787495369 Oct 9 17:36 uniref30_2302_db -rw-r--r-- 1 moriwaki staffs 879290728 Oct 9 17:36 uniref30_2302_db.index -rw-r--r-- 1 moriwaki staffs 4 Oct 9 17:36 uniref30_2302_db.dbtype -rw-r--r-- 1 moriwaki staffs 83036144795 Oct 9 17:57 uniref30_2302_db_seq -rw-r--r-- 1 moriwaki staffs 8957791292 Oct 9 17:58 uniref30_2302_db_seq.index -rw-r--r-- 1 moriwaki staffs 4 Oct 9 17:58 uniref30_2302_db_seq.dbtype -rw-r--r-- 1 moriwaki staffs 8709887243 Oct 9 18:07 uniref30_2302_db_aln -rw-r--r-- 1 moriwaki staffs 867494002 Oct 9 18:07 uniref30_2302_db_aln.index -rw-r--r-- 1 moriwaki staffs 4 Oct 9 18:07 uniref30_2302_db_aln.dbtype lrwxrwxrwx 1 moriwaki staffs 24 Oct 9 18:07 uniref30_2302_db_seq_h.index -> uniref30_2302_db_h.index lrwxrwxrwx 1 moriwaki staffs 25 Oct 9 18:07 uniref30_2302_db_seq_h.dbtype -> uniref30_2302_db_h.dbtype lrwxrwxrwx 1 moriwaki staffs 18 Oct 9 18:07 uniref30_2302_db_seq_h -> uniref30_2302_db_h -rw-r--r-- 1 moriwaki staffs 228709249024 Oct 9 18:20 uniref30_2302_db.idx -rw-r--r-- 1 moriwaki staffs 506 Oct 9 18:20 uniref30_2302_db.idx.index -rw-r--r-- 1 moriwaki staffs 4 Oct 9 18:20 uniref30_2302_db.idx.dbtype lrwxrwxrwx 1 moriwaki staffs 24 Oct 9 18:21 uniref30_2302_db.idx_mapping -> uniref30_2302_db_mapping lrwxrwxrwx 1 moriwaki staffs 25 Oct 9 18:21 uniref30_2302_db.idx_taxonomy -> uniref30_2302_db_taxonomy -rw-r--r-- 1 moriwaki staffs 0 Oct 9 18:21 UNIREF30_READY -rw-r--r-- 1 moriwaki staffs 25108896515 Oct 10 09:07 colabfold_envdb_202108_db_h -rw-r--r-- 1 moriwaki staffs 18036930897 Oct 10 09:09 colabfold_envdb_202108_db_h.index -rw-r--r-- 1 moriwaki staffs 4 Oct 10 09:09 colabfold_envdb_202108_db_h.dbtype -rw-r--r-- 1 moriwaki staffs 26732224605 Oct 10 09:14 colabfold_envdb_202108_db -rw-r--r-- 1 moriwaki staffs 5260769931 Oct 10 09:15 colabfold_envdb_202108_db.index -rw-r--r-- 1 moriwaki staffs 4 Oct 10 09:15 colabfold_envdb_202108_db.dbtype -rw-r--r-- 1 moriwaki staffs 92749953996 Oct 10 09:46 colabfold_envdb_202108_db_seq -rw-r--r-- 1 moriwaki staffs 18917335740 Oct 10 09:49 colabfold_envdb_202108_db_seq.index -rw-r--r-- 1 moriwaki staffs 4 Oct 10 09:49 colabfold_envdb_202108_db_seq.dbtype -rw-r--r-- 1 moriwaki staffs 27929446713 Oct 10 09:57 colabfold_envdb_202108_db_aln -rw-r--r-- 1 moriwaki staffs 5214433987 Oct 10 09:58 colabfold_envdb_202108_db_aln.index -rw-r--r-- 1 moriwaki staffs 4 Oct 10 09:58 colabfold_envdb_202108_db_aln.dbtype -rw-r--r-- 1 moriwaki staffs 1907 Oct 10 11:23 colabfold_envdb_202108_db.idx.index -rw-r--r-- 1 moriwaki staffs 4 Oct 10 11:23 colabfold_envdb_202108_db.idx.dbtype lrwxrwxrwx 1 moriwaki staffs 33 Oct 10 13:38 colabfold_envdb_202108_db_seq_h.index -> colabfold_envdb_202108_db_h.index lrwxrwxrwx 1 moriwaki staffs 34 Oct 10 13:39 colabfold_envdb_202108_db_seq_h.dbtype -> colabfold_envdb_202108_db_h.dbtype lrwxrwxrwx 1 moriwaki staffs 27 Oct 10 13:39 colabfold_envdb_202108_db_seq_h -> colabfold_envdb_202108_db_h -rw-r--r-- 1 moriwaki staffs 562358472704 Oct 16 01:28 colabfold_envdb_202108_db.idx -rw-r--r-- 1 moriwaki staffs 0 Oct 10 13:40 COLABDB_READY -rw-r--r-- 1 moriwaki staffs 25 Oct 10 13:47 pdb100_230517.source -rw-r--r-- 1 moriwaki staffs 27989933 Oct 10 13:47 pdb100_230517_h -rw-r--r-- 1 moriwaki staffs 4 Oct 10 13:47 pdb100_230517_h.dbtype -rw-r--r-- 1 moriwaki staffs 65092975 Oct 10 13:47 pdb100_230517 -rw-r--r-- 1 moriwaki staffs 4 Oct 10 13:47 pdb100_230517.dbtype -rw-r--r-- 1 moriwaki staffs 6279753 Oct 10 13:47 pdb100_230517.index -rw-r--r-- 1 moriwaki staffs 6116273 Oct 10 13:47 pdb100_230517_h.index -rw-r--r-- 1 moriwaki staffs 5178372 Oct 10 13:47 pdb100_230517.lookup -rw-r--r-- 1 moriwaki staffs 1443213312 Oct 10 13:47 pdb100_230517.idx -rw-r--r-- 1 moriwaki staffs 383 Oct 10 13:47 pdb100_230517.idx.index -rw-r--r-- 1 moriwaki staffs 4 Oct 10 13:47 pdb100_230517.idx.dbtype -rw-r--r-- 1 moriwaki staffs 0 Oct 10 13:47 PDB_READY -rw-r--r-- 1 moriwaki staffs 0 Oct 10 13:54 PDB100_READY


Then, I obtained the a3m and m8 files in the ` manual_ras_raf` directory, `RAS_RAF_pdb100_230517.m8` contains

101 7kyz_A 0.856 188 26 1 1 188 1 187 3.275E-63 215 167M1I20M 101 2mse_B 0.848 185 28 0 1 185 1 185 1.583E-62 213 185M 101 7tlk_B 0.934 167 11 0 1 167 1 167 4.075E-62 212 167M 101 7t1f_A 0.923 169 13 0 1 169 1 169 4.075E-62 212 169M ... ... 101 6pgo_B 0.804 169 16 2 1 169 1 152 1.592E-46 167 31M7I20M10I101M 101 4m1s_C 0.796 167 15 2 1 167 1 148 2.987E-46 166 28M10I23M9I97M 101 6o62_A 0.299 167 109 3 5 170 8 167 7.677E-46 165 27M6I68M1I15M1D49M 101 4m21_C 0.789 166 15 2 2 167 1 146 6.945E-45 162 27M11I22M9I97M 102 4g3x_B 1.000 77 0 0 5 81 1 77 1.003E-28 110 77M 102 3kud_B 0.986 76 1 0 6 81 1 76 4.892E-28 108 76M 102 3kuc_B 0.973 76 2 0 6 81 1 76 9.221E-28 107 76M 102 1rrb_A 0.986 76 1 0 6 81 1 76 1.266E-27 107 76M ... ... 102 2mse_D 0.578 76 29 1 6 81 1 73 5.603E-22 91 47M3I26M 102 2mse_D 0.578 76 29 1 6 81 1 73 5.603E-22 91 47M3I26M 102 5yxi_A 0.500 74 37 0 5 78 3 76 1.561E-18 81 74M 102 6ntd_B 0.733 75 9 1 5 79 1 64 9.674E-17 76 47M11I17M 102 6ntc_B 0.706 75 10 2 6 80 1 63 9.682E-13 64 17M3I24M9I22M



Note that `101` and `102` represent the first and second sequence in the input fasta file, respectively.
Nuta0 commented 7 months ago

Thank you for this example. But how would I use an input file like this?

>complex_a
MFAWVSVSQSYGVIEILKDIMNKVMGIKKKGTNTGITVEDFEQMGEEEVRQHLHDFLRDKKYLVVMDDVWTVDVWRQIHQIFPNVNNGSRILLTTRNMEVARHAEPWIPPHEPHLLNDTHSLELFCRKAFPANQDVPTELEPLSQKLAKR:
MCGGLPLALVVLGGLMSRKDPSYDTWLRVAQSMNWESSGEGQECLGILGLSYNDLPYQLKPCFLYITAFPEDSIIPVSKLARLWIAEGFILEEQRQTMEDTARDWLDELVQRCMIQVVKRSVTRGRVKSIRIHDMLRDFGLLEARKDGFLHVCSTDA
>complex_b
MVVSSHRVAFHDRINEEVAVSSPHLRTLLGSNLILTNAGRFLNGLNLLRVLDLEGARDLKKLPKQMGNMIHLRYLGLRRTGLKRLPSSIGHLLNLQTLDARGTYISWLPKSFWKIRTLRYVYINILAFLSAPIIG:
MDHKNLQALKITWINVDVMDMIRLGGIRFIKNWVTTSDSAEMAYERIFSESFGKSLEKMDSLVSLNMYVKELPKDIFFAHARPLPKLRSLYLGG
>complex_c
MSFQQQQLPDITQFPPNLTKLILISFHLEQDPMPVLEKLPNLRLLELCGAYHGKSMSC:
MSAGGFPRLQHLILEDLYDLEAWRVEVGAMPRLTNLTIRWCGMLKMLPEGLQHVTTVRELKLIDMPREFSDKVRSEDGYKVTHPLHYY
YoshitakaMo commented 7 months ago

I've pushed a fix for this issue, @Nuta0. See https://github.com/sokrypton/ColabFold/issues/567 . Please update your (local)ColabFold and try to use CSV format for input.

Nuta0 commented 7 months ago

@YoshitakaMo Thank you for fixing this.

I have noticed another related issue, where the templates that are picked at the beginning of the prediction are different when I use colabfold_batch directly compared to when I use colabfold_search and then colabfold_batch.

I use this input:

id,sequence
3kud,MTEYKLVVVGAGGVGKSALTIQLIQNHFVDEYDPTIEDSYRKQVVIDGETCLLDILDTAGQEEYSAMRDQYMRTGEGFLCVFAINNTKSFEDIHQYREQIKRVKDSDDVPMVLVGNKCDLAARTVESRQAQDLARSYGIPYIETSAKTRQGVEDAFYTLVREIRQH:PSKTSNTIRVFLPNKQRTVVNVRNGMSLHDCLMKKLKVRGLQPECCAVFRLLHEHKGKKARLDWNTDAASLIGEELQVDFL
ras,MTEYKLVVVGAGGVGKSALTIQLIQNHFVDEYDPTIEDSYRKQVVIDGETCLLDILDTAGQEEYSAMRDQYMRTGEGFLCVFAINNTKSFEDIHQYREQIKRVKDSDDVPMVLVGNKCDLAARTVESRQAQDLARSYGIPYIETSAKTRQGVEDAFYTLVREIRQHKLRKLNPPDESGPGCMSCKCVLS
1BJP_2,PIAQIHILEGRSDEQKETLIREVSEAISRSLDAPLTSVRVIITEMAKGHFGIGGELASKVRR:PIAQIHILEGRSDEQKETLIREVSEAISRSLDAPLTSVRVIITEMAKGHFGIGGELASKVRR
1BJP_ras,PIAQIHILEGRSDEQKETLIREVSEAISRSLDAPLTSVRVIITEMAKGHFGIGGELASKVRR:PIAQIHILEGRSDEQKETLIREVSEAISRSLDAPLTSVRVIITEMAKGHFGIGGELASKVRR:MTEYKLVVVGAGGVGKSALTIQLIQNHFVDEYDPTIEDSYRKQVVIDGETCLLDILDTAGQEEYSAMRDQYMRTGEGFLCVFAINNTKSFEDIHQYREQIKRVKDSDDVPMVLVGNKCDLAARTVESRQAQDLARSYGIPYIETSAKTRQGVEDAFYTLVREIRQHKLRKLNPPDESGPGCMSCKCVLS

When using colabfold_batch --templates --amber --use-gpu-relax input.csv prediction these are the contents of the 1BJP_2_template_domain_names.json file:

{"A": ["3mb2_C", "2fm7_B", "4fdx_A", "3ry0_B", "1bjp_A", "6fps_P", "4faz_C", "7m59_B", "6bgn_C", "1otf_D", "3abf_B", "5clo_C", "6fps_R", "7xuy_A", "5cln_I", "6blm_A", "7puo_F", "2op8_A", "4x1c_F", "6blm_A"], "B": ["3mb2_C", "2fm7_B", "4fdx_A", "3ry0_B", "1bjp_A", "6fps_P", "4faz_C", "7m59_B", "6bgn_C", "1otf_D", "3abf_B", "5clo_C", "6fps_R", "7xuy_A", "5cln_I", "6blm_A", "7puo_F", "2op8_A", "4x1c_F", "6blm_A"]}

However, when doing:

MMSEQS_PATH="/apps/easybuild-2022/easybuild/software/MPI/GCC/11.3.0/OpenMPI/4.1.4/MMseqs2/15-6f452/bin/mmseqs"
DATABASE_PATH="/data/gpfs/datasets/mmseqs/uniref30_2302"
INPUTFILE="input.csv"

DATABASE_PATH=
colabfold_search \
  --use-env 1 \
  --use-templates 1 \
  --db-load-mode 2 \
  --mmseqs ${MMSEQS_PATH} \
  --db2 pdb100_230517 \
  --threads 4 \
  ${INPUTFILE} \
  ${DATABASE_PATH} \
  msas
INPUTFILE="1BJP_2.a3m"
PDBHITFILE="1BJP_2_pdb100_230517.m8"
LOCALPDBPATH="/data/scratch/datasets/alphafold/v2.3.2/pdb_mmcif/mmcif_files"
RANDOMSEED=0

colabfold_batch \
  --amber \
  --templates \
  --use-gpu-relax \
  --pdb-hit-file ${PDBHITFILE} \
  --local-pdb-path ${LOCALPDBPATH} \
  --random-seed ${RANDOMSEED} \
  ${INPUTFILE} \
  prediction

then this is the 1BJP_2_template_domain_names.json file:

{"A": ["4x1c_H", "1bjp_B", "1bjp_A", "1bjp_E", "6fps_N", "6fps_Q", "6fps_P", "7xuy_A", "3ry0_B", "3ry0_A", "2op8_B", "2op8_A", "7puo_F", "7puo_C", "4x1c_G", "7puo_B", "7puo_D", "7puo_E", "7puo_A", "4faz_C"], "B": ["4x1c_H", "1bjp_B", "1bjp_A", "1bjp_E", "6fps_N", "6fps_Q", "6fps_P", "7xuy_A", "3ry0_B", "3ry0_A", "2op8_B", "2op8_A", "7puo_F", "7puo_C", "4x1c_G", "7puo_B", "7puo_D", "7puo_E", "7puo_A", "4faz_C"]}

Clearly the templates are similar between the methods and the resulting predictions are also similar in this instance, but I had cases where the predicted structures were significantly different. Is this intended behaviour?

Cryptheon commented 6 months ago

Hi @YoshitakaMo,

How long does the search usually take? I followed your instructions (https://qiita.com/Ag_smith/items/bfcf94e701f1e6a2aa90) and installed everything on HPC, however, without loading the dataset fully onto RAM. I tested it with a few proteins but the search takes quite a long time (1h+), I assume this is abnormally long.

Do you know if there might be something obvious that could elicit these search times? In my case I ran:

colabfold_search --use-env 1 --use-templates 0 --db-load-mode 2 --mmseqs /projects/0/prjs0859/ml/algorithms/colabfold/mmseqs/bin/mmseqs --threads 8 /projects/0/prjs0859/ml/inputs/alphafold/fastas/7XTB_5.fasta /projects/2/managed_datasets/AlphaFold_mmseqs2/ /projects/0/prjs0859/ml/outputs/msa/

Any input would be greatly apprecited, thanks!

milot-mirdita commented 6 months ago

Long search times are expected for single queries. colabfold_search is intended for larger scale runs with hundreds or thousands of queries.

It still works for single queries but doesn’t scale down well.

crisdarbellay commented 5 months ago

Hello, I followed the instruction of https://qiita.com/Ag_smith/items/bfcf94e701f1e6a2aa90, but I still find myself with long search time (~1h). I have a huge ram (~750GB), I should be able to reproduce around same speed as the colabfold server right? I have around 5'000 predictions to make. How could I optimize the run and search time? Thank you for your work!

Nuta0 commented 5 months ago

@crisdarbellay For 5000 predictions colabfold_search takes around 6 h with 16 CPU cores in my tests.

milot-mirdita commented 5 months ago

This sounds about right, in the paper we show that we ran a proteome with 1.7k proteins in 2h on a 24-core CPU.

The server is optimized for low latency for single queries, not for the highest possible throughput. colabfold_search is intended for that.

Cryptheon commented 5 months ago

So, how is it possible that obtaining MSA using the servers takes mere seconds? Is it a matter of just using colabfold_search on a way larger batch? I have around 6m proteins that I need to compute the MSA. What am I missing?

milot-mirdita commented 5 months ago

The server takes about one minute-ish per MSA (can become much longer for long sequences). The MSAs stay cached for a while, so if you request the same sequence again it ill not recompute the MSA, but return it from the cache (instantly).

colabfold_search's raw throughput is still be much better than the server's, it should be much faster than the 6 million * 1 minute the server (divided by the number of workers) would take (and much much much faster than running colabfold_search 6 million times). But that still means you will need to throw quite a bit of CPU at the problem of computing 6 million MSAs.

You can reduce sensitivity slightly if you really want to speed-up the MSA computation part.

ahof1704 commented 4 months ago

Hi!

I am in a similar position, where I have to predict thousands of structures. I am considering running colabfold_search first to speed the process. if I understand it correctly, colabfold_search doesn't need gpus, right? Could I compute the MSA in a CPU node with many cores and large RAM and then move it to a GPU node for structure prediction?

Thanks!

crisdarbellay commented 4 months ago

@Nuta0 Could I see an example where you run prediction on that many queries? Are you using mutliple fastas or one fastas with all sequences? I think that I'm doing something wrong...