hypnopump / MiniFold

MiniFold: Deep Learning for Protein Structure Prediction inspired by DeepMind AlphaFold algorithm
MIT License
200 stars 34 forks source link

Distance prediction treshold #4

Open simonMoisselin opened 5 years ago

simonMoisselin commented 5 years ago

Hello,

Nice work !

In the notebook predicting_distances, Why did you want to predict classes of distances, instead of distances values directly ?

hypnopump commented 5 years ago

Hi, First of all, thanks for the kind words.

Answering to your question: since we were training in frames of 200x200, I couldn't find a better way for the model to "ignore" the padding than converting it into a classification problem and giving that cclass a very little value. Also, AlphaFold also used classes, not direct distances as predictions. The reason for that is that we don't want our model to output the exact distance between each pair of AAs since it's pretty impractical, but instead use the outputs as constraints for some folding algorithm such as Rosetta's one (I'm stil not exactly sure how to pass the outputs as constraints to this kind of systems, but the method is reported in this paper which got SOTA results: and it seems to me that it was referenced in the AlphaFold blog post). I'll try to train a model for direct distance prediction with MSE (Mean Squared Error) as the loss function once I'll have the 64x64 crops system working.

jgreener64 commented 5 years ago

To add my opinion to the mix. When you predict a distance you need a degree of uncertainty associated with the prediction to use it effectively as a constraint. Predicting distances in bins is a useful way to do this. It is unclear how you would train a system that predicted distance and an uncertainty value together.

Also, distance predictions above a certain threshold (perhaps 20 Angstrom) are not accurate when using covariation data, as they just tell you the residues are not close in the protein. You wouldn't want a strong constraint on that. Predicting into distance bins lets you have a catch-all last bin that takes account of this.

simonMoisselin commented 5 years ago

Ok Thank you ! Now it make sense for me. And how did you choose the threshold values ? I am guessing that it is derived from existing literature.

hypnopump commented 5 years ago

The threshold values are an arbitrary decision (although some constraints may apply), so they could be replaced with some other ones. In general, predictions of distances >~20 Angstrom (A) may be inaccurate. Some papers use bins of 0.5-1A between 4A and 20A approximately. My problem was that classes are not equally represented in the data so in order for the model to output a "visually pleasant" image, I had to set weights for the classes. As you can imagine, an optimization problem with 7 variables is much easier than one with 20 of them. In addition to that, I couldn't store such big tensors (with 20 classes) in memory if I wanted a decent (at least 100) amount of proteins to train the model) so I had to reduce the number of classes. Since my network was not very deep, I wanted it to be an "easy" problem so I reduced the number of them. Right now I'm working on a method to load the dataset from disk instead of RAM so I expect to free some memory and perhaps increase the number of classes or the depth of the model.

Guocanyong commented 4 years ago

How do you choose the value of the weighted_categorical_crossentropy for the loss function used in the distance prediction model?

hypnopump commented 4 years ago

Trial and error with different valules. You're encouraged to share your weights if you find a combination which produces better results!