soedinglab / CCMpred

Protein Residue-Residue Contacts from Correlated Mutations predicted quickly and accurately.
http://www.ncbi.nlm.nih.gov/pubmed/25064567
GNU Affero General Public License v3.0
107 stars 25 forks source link

Export of raw parameters as numpy array, plus some minor fixes #28

Open kWeissenow opened 4 years ago

kWeissenow commented 4 years ago

A common usage scenario for plmDCA nowadays is to use the raw Potts model parameters as an input for machine learning devices, especially Deep Learning systems, to infer contact or distance maps. The most recent and prominent example would be DeepMind's AlphaFold, the winner of CASP13. CCMpred is widely used because of its GPU acceleration, but has the drawback of outputting the raw parameters as a text file, which can be huge (>10 GB) for longer proteins. Machine learning systems almost always expect numpy arrays as inputs, which are binary representations and therefore also faster to load since they are more compact.

I've implemented the option to directly write the raw paramters to numpy arrays with the command line switch '-y'. This circumvents the additional step of parsing the text output to generate a binary representation. For long proteins, this makes a huge difference: On a TeslaV100, a MSA with 50k sequences of a protein with 820 residues took 26m13s to process in the traditional way (CCMpred -> raw text file -> parsing file to generate numpy array), whereas running CCMpred and directly writing a numpy array with my implementation took only 16m20s. The speedups are not quite as remarkable for smaller proteins around the average lengths of 200-300 residues, but still account for 1-2 minutes saved per sample. For my current dataset, which contains ~80k MSAs, I expect to save multiple weeks of computation time.

Since I assume that CCMpred is used for exactly this kind of workflow in many structure prediction research projects, I kindly invite you to integrate this addition into the main repository.