google-deepmind / alphafold

Open source code for AlphaFold 2.
Apache License 2.0
12.84k stars 2.28k forks source link

Faster MSA computations with chunked DBs on Colab #437

Open Meghpal opened 2 years ago

Meghpal commented 2 years ago

There is a stark difference in the time taken to compute MSA on Colab vs through AlphaFold's actual implementation through docker. I was trying to figure out the differences, and the most obvious seems to be that jackhmmer runs on chunks of the original dbs.

I have the whole dataset downloaded with me already, and I find that for the proteins I tested, there are almost the same number of sequences found in both approaches. However, in the Colab version MSA takes merely 20-30 min and through the docker it takes 7-9 (mostly jackhmmer on uniref90 and mgnify, and even when I get a total of <200 sequence matches for a protein with only 77 amino acids)

I am running on AWS notebook instance ml.g4dn.4xlarge, all the data being on the SSD.

I have some questions: 1) Why is the 7-9 hour approach suggested? 2) Is this the only reason MSA computation is so fast on Colab? 3) Might there be something wrong with my implementation if jackhmmer on Colab is taking less time than local implementation? 4) jackhmmer chunking currently is only supported over internet, will it work as fast/faster if it is implemented locally (by myself)?

Here are relevant timing logs for comparison. I believe jackhmmer is the slowest, and I have ran 3-4 proteins through to get similar time benchmarks.

I0419 09:01:46.545762 140559696049984 run_docker.py:255] I0419 09:01:46.545098 140281249204032 utils.py:36] Started Jackhmmer (uniref90.fasta) query
I0419 13:09:40.188118 140559696049984 run_docker.py:255] I0419 13:09:40.186011 140281249204032 utils.py:40] Finished Jackhmmer (uniref90.fasta) query in 14873.641 seconds
I0419 13:09:40.283886 140559696049984 run_docker.py:255] I0419 13:09:40.283068 140281249204032 utils.py:36] Started Jackhmmer (mgy_clusters_2018_12.fa) query
I0419 17:24:10.802730 140559696049984 run_docker.py:255] I0419 17:24:10.802150 140281249204032 utils.py:40] Finished Jackhmmer (mgy_clusters_2018_12.fa) query in 15270.519 seconds
I0419 17:24:11.070103 140559696049984 run_docker.py:255] I0419 17:24:11.069362 140281249204032 utils.py:36] Started HHsearch query
I0419 17:31:10.780151 140559696049984 run_docker.py:255] I0419 17:31:10.779571 140281249204032 utils.py:40] Finished HHsearch query in 419.710 seconds
I0419 17:31:11.022377 140559696049984 run_docker.py:255] I0419 17:31:11.021669 140281249204032 utils.py:36] Started HHblits query
I0419 18:09:46.100788 140559696049984 run_docker.py:255] I0419 18:09:46.100168 140281249204032 utils.py:40] Finished HHblits query in 2315.078 seconds
I0419 18:10:14.906161 140559696049984 run_docker.py:255] I0419 18:10:14.905608 140281249204032 pipeline.py:234] Uniref90 MSA size: 6040 sequences.
I0419 18:10:14.906352 140559696049984 run_docker.py:255] I0419 18:10:14.905785 140281249204032 pipeline.py:235] BFD MSA size: 1328 sequences.
I0419 18:10:14.906456 140559696049984 run_docker.py:255] I0419 18:10:14.905830 140281249204032 pipeline.py:236] MGnify MSA size: 248 sequences.
I0419 18:10:14.906560 140559696049984 run_docker.py:255] I0419 18:10:14.905873 140281249204032 pipeline.py:238] Final (deduplicated) MSA size: 6010 sequences.
andrejberg commented 1 year ago

As far as I can tell, Colab is not using jackhmmer but MMseqs2