sokrypton / ColabFold

Making Protein folding accessible to all!
MIT License
2.01k stars 504 forks source link

colabfold_search and colabfold_batch takes too long than expected #272

Open lzhangUT opened 2 years ago

lzhangUT commented 2 years ago

Hi there, Thanks for sharing colabfold. It is greatly appreciated. we followed the exact step on the github using 'Running locally'. we have a virtual machine with 6cCPUs, 112 RAM, and 1 GPU, and a data disk with 2T size.

  1. after the pip install, we can successfully run colab_batch
  2. since we are going to have many fasta files, we decide to run msa ourselves. so we downloaded the data to the data disk (success!), and we run the colabfold_search to run for 50 fasta files, since it is a batch ,we used colabfold_search input_sequences_folder /path/to/db_folder msas --db-load-mode 0 as suggested in the github. we were able to run it successfully. BUT, it takes ~5 hours for a batch of 50 sequences. I don't think that sound right, as in the tutorial you mentioned you had less than 4 mins for 20 sequences with one core. why is that? how many cpus with the one core? is the computing resources the reason?
  3. anyway, after that, we ran colabfold_batch msas predictions. As I would imagine this process should take much much less, as it is the inference and we have a GPU. BUT for the 50 a3m files, it takes forever. right now it finished for 22 sequences, and it already used > 6 hours, a screenshot is shown here, it seems like the inference is still run for each sequence sequentially (doesn't contitutively match the idea of batch for me). and each one of them takes about 13-15 mins!!! image please let me knwo where it went wrong or this is the case??
lzhangUT commented 2 years ago

@lucidrains @EnzoAndree @milot-mirdita it will be greatly appreciated!

parasvcb commented 2 years ago

Hey, not a developer of colabfold but its user. May i know how many threads are there in 6cCPU you mentioned ? simple htop may give information (count may differ if hyperthreading is supported and enabled). I have realised that in many places colabfold_search have instances of hardcoded threads count of 64, especially in mmseqs search options, If your have less than such resources system may behave abruptly because of load imbalance and hence large amount of time. the following file has 3-4 such instances (path may differ in yours). ~/miniconda3/lib/python3.9/site-packages/colabfold/mmseqs/search.py

change those instances to os.cpu_count()-2 Do let me know if that works, or I can also attach my edited file.