patrickbryant1 / Umol

Protein-ligand structure prediction
164 stars 16 forks source link

[question]: Around training the model. #29

Open alexanderbonnet opened 1 week ago

alexanderbonnet commented 1 week ago

Hi! First of all, thanks for making your work so readily available.

I am looking to get a PyTorch reproduction of the repository going. I have not run into problems for inference (adapting from OpenFold and converting weights), but am running into a couple of challenges at train time, and wondered if you could help me understand some implementation details.


I see in the make_uniform function of the predict.py file that a comment mentions that the amino acid type if set to glycine, but the zero index that remains actually sets the amino acid to alanine. Wouldn't this matter for the pseudo_beta_fn and the inclusion of the ligand in the distogram loss? https://github.com/patrickbryant1/Umol/blob/f7cd2b4de09b4e7cc1b68606791dd1cc81deeebc/src/predict.py#L108


In the folding.py for the backbone_loss, a "atom14_gt_exists_protein" feature is built. I presume this contains atom masks for the protein only? As opposed to "atom14_gt_exists" which must contain atoms for the protein and ligand.

https://github.com/patrickbryant1/Umol/blob/f7cd2b4de09b4e7cc1b68606791dd1cc81deeebc/src/net/model/folding.py#L648

What about in the sidechain_loss?

https://github.com/patrickbryant1/Umol/blob/f7cd2b4de09b4e7cc1b68606791dd1cc81deeebc/src/net/model/folding.py#L696


Thanks for your help!

patrickbryant1 commented 1 week ago

Hi,

I would like to explain this in detail, but don't have time right now. I will provide a better answer in the first weeks of July.

Right now what I can say is that comment is probably wrong. What matters is that each 'ligand atom token' can be mapped to a CA and that the masks use only CA for the losses. Any amino acid selection will take care of this.

Hope this helps (somewhat).

Best,

Patrick

alexanderbonnet commented 1 week ago

Thanks for the quick answer!

You may disregard the second portion of the question, my issues were due to poor indexing on my part in one of the frame aligned point error losses. All looks good now and I am getting expected behavior during training.

For the distogram loss, it still looks to me like using any other amino acid than glycine would essentially remove the ligand from the distogram loss, as the distances considered are between CBs (except for glycine, that uses CAs, and would be compatible with setting the ligand heavy atoms to CAs).

https://github.com/patrickbryant1/Umol/blob/f7cd2b4de09b4e7cc1b68606791dd1cc81deeebc/src/net/model/tf/data_transforms.py#L308-L324

https://github.com/patrickbryant1/Umol/blob/f7cd2b4de09b4e7cc1b68606791dd1cc81deeebc/src/net/model/modules.py#L1244-L1273

I think I should be fine for the most part, but would love to have detailed explanations regardless if you find the time.

Thanks again, Alexander

patrickbryant1 commented 3 days ago

Hi,

Great 👍

The distogram is predicted in bins mapped from the pair representation. Therefore, the amino acid type doesn't matter as long as the ground truth coordinates (CB for protein) is provided for that loss.

Hope this helps.

Best,

Patrick