hypnopump / MiniFold

MiniFold: Deep Learning for Protein Structure Prediction inspired by DeepMind AlphaFold algorithm
MIT License
202 stars 34 forks source link

Regarding input and distance matrix padding. #27

Closed n4ndoz closed 4 years ago

n4ndoz commented 4 years ago

Hi! Wonderful work here, and wonderful code aswell. I have a few questions regarding your model and some of your input preparation steps. 1- Why do you implemented padding as a new class and not as a mask, by multiplaying every add layer by this binary mask in order to avoid backprop of these regions? 2- Why did you created a different embbeding for the distances, and not only the threshold function?

hypnopump commented 4 years ago

Interesting comments.:

  1. I don't know if you mean adding the Padding as a Keras Layer at the beginning of the Net? I wasn't sure how to do that so I just did the padding in NumPy before.
  2. Not sure what you're referring to. In the data preparation functions here: https://github.com/EricAlcaide/MiniFold/blob/master/models/distance_pipeline/distance_generator_data.py I use the same function for padding both the distance and the pssm.

The codebase is from a year and a half ago so I don't have everything in my mind now. If you could clarify what you're referring to, I think I would be able to explain more.

Thanks for the interest in the project!

n4ndoz commented 4 years ago

Hi!! Thanks for the quick reply!

  1. I am applying some parts of your model and modifying mainly the res blocks. The main trick is make a binary mask matrix (MaxL*MaxL, i'm using 256, so I can grab the major distribution of proteins in ProteinNet) where a subset LxL for each sequence is 1 and the rest is 0. This way, when you backprop the grads will be 0 where there is no protein info and the error is not propagated. It works? Well, questionable. hahahahahahahahah But it is what Raptor-X-Contact implemented.

  2. I just took a look at the embbeding_matrix function and understood. It pads the dist matrix, right?

Another question is: you did used Alpha Carbons as distance targets, right? You wrote that you applied the Model to ProteinNet, but it doesn't stores Beta Carbon coordinates, only N, Ca, C (CBeta being the "root" of side chain). I'm asking this because I've been trying to fetch the Beta Carbon coordinates from ProteinNet ids and been getting several issues regarding sequence/structure matching between PDB and ProteiNet.

Thanks a lot again for the answer. have you been doing any other works in protein structure prediction? And, nice paper on E-Swish.

hypnopump commented 4 years ago

Cool!

  1. Good luck! I would like to see the results!
  2. yup. I took distances between C-alpha for predictions. Idk if there are differences wrt PDB, i'm sorry.

Thanks for the E-swish comment, i did it during my last high school year! Also, what do you think about my comment in the other thread?