There is a stark difference in the time taken to compute MSA on Colab vs through AlphaFold's actual implementation through docker. I was trying to figure out the differences, and the most obvious seems to be that jackhmmer runs on chunks of the original dbs.
I have the whole dataset downloaded with me already, and I find that for the proteins I tested, there are almost the same number of sequences found in both approaches. However, in the Colab version MSA takes merely 20-30 min and through the docker it takes 7-9 (mostly jackhmmer on uniref90 and mgnify, and even when I get a total of <200 sequence matches for a protein with only 77 amino acids)
I am running on AWS notebook instance ml.g4dn.4xlarge, all the data being on the SSD.
I have some questions:
1) Why is the 7-9 hour approach suggested?
2) Is this the only reason MSA computation is so fast on Colab?
3) Might there be something wrong with my implementation if jackhmmer on Colab is taking less time than local implementation?
4) jackhmmer chunking currently is only supported over internet, will it work as fast/faster if it is implemented locally (by myself)?
Here are relevant timing logs for comparison. I believe jackhmmer is the slowest, and I have ran 3-4 proteins through to get similar time benchmarks.
There is a stark difference in the time taken to compute MSA on Colab vs through AlphaFold's actual implementation through docker. I was trying to figure out the differences, and the most obvious seems to be that jackhmmer runs on chunks of the original dbs.
I have the whole dataset downloaded with me already, and I find that for the proteins I tested, there are almost the same number of sequences found in both approaches. However, in the Colab version MSA takes merely 20-30 min and through the docker it takes 7-9 (mostly jackhmmer on uniref90 and mgnify, and even when I get a total of <200 sequence matches for a protein with only 77 amino acids)
I am running on AWS notebook instance ml.g4dn.4xlarge, all the data being on the SSD.
I have some questions: 1) Why is the 7-9 hour approach suggested? 2) Is this the only reason MSA computation is so fast on Colab? 3) Might there be something wrong with my implementation if jackhmmer on Colab is taking less time than local implementation? 4) jackhmmer chunking currently is only supported over internet, will it work as fast/faster if it is implemented locally (by myself)?
Here are relevant timing logs for comparison. I believe jackhmmer is the slowest, and I have ran 3-4 proteins through to get similar time benchmarks.