jonathanking / sidechainnet

An all-atom protein structure dataset for machine learning.
BSD 3-Clause "New" or "Revised" License
322 stars 36 forks source link

dRMSD Calculations & Bond Angles. #19

Closed OsamaGhandour closed 3 years ago

OsamaGhandour commented 3 years ago

In Your Code in dRMSD loss calculation **for pc, tc, s in zip(pred_coordinates, true_coordinates, seq):

Remove batch_padding from true coords

    batch_padding = _tile((s != 20), 0, NUM_COORDS_PER_RES)
    tc = tc[batch_padding]
    missing_atoms = (tc == 0).all(axis=-1)
    tc = tc[~missing_atoms]
    pc = pc[~missing_atoms]**

I see that you remove the padding from true_crds only and that leads to error IndexError: The shape of the mask [1988] at index 0 does not match the shape of the indexed tensor [2058, 3] at index 0 because true_crds and pred_crds are not in the same shape so i think that we should remove padding from pred_crds also ?

Also in diffrent place in Bond angles you mention in your nerf function t = self.ang[3] # thetas["n-ca-c"]

you get n-ca-c angle from index 3 but in sidechiannet paper it is in diffrent order which is (B) Backbone bond angles (C-N-CA, N-CA-C, and CA-C-N; <-- in Figure (1)

which leads that n-ca-c should be in index 4 ?!

also i hope if you could make this line more clear. missing_atoms = (tc == 0).all(axis=-1)

jonathanking commented 3 years ago

Hi, thanks a lot for your feedback!

  1. Hmm, I see where this could potentially be an issue. Do you have an example where this fails? Honestly, I have been using it without any problems, so thanks for bringing this to my attention. I can look into it more.

  2. The paper figure may not represent what's actually happening in the implementation for building residues. I don't think this should cause any problems.

  3. Given a tensor of (N x 3), where N is the number of atoms in a protein structure, there are many missing atoms that are marked with [0,0,0] in the true coordinates. Before comparing the true and predicted coordinates, we must first identify the location of these missing atoms in the true coordinate tensor, and them remove these entries from both the true and predicted tensors.

OsamaGhandour commented 3 years ago

thanks a lot for your fast reply.

I'm try to edit your code to make it suitable for my work. while testing i test the batch drmsd by this line compute_batch_drmsd(batch.crds, batch.crds,batch.int_seqs) i know it is wrong and i should send predictied instead of true crds but this just for testing. i removed padding from tc and pc not only pc and until now it works for me. and i think it is correct in this way because we don't need the predicited values for padding residues while calculate loss ?

jonathanking commented 3 years ago

So, I looked into this a little more, and I think it depends on how you are using the function. Have you seen the Colab example notebook ?

This is the training loop from section 4.3 of the notebook.

for epoch in range(20):
  print(f"Epoch {epoch}")
  progress_bar = tqdm(total=len(d['train']), smoothing=0)
  for batch in d['train']:
      # Prepare variables
      model_input = batch.seq_evo_sec.to(device)

      # Predict the angles in sin/cos format before transforming to radians
      predicted_angles_sincos = pssm_coord_model(model_input)
      predicted_angles = inverse_trig_transform(predicted_angles_sincos)

      # BatchedStructureBuilder can be used to generate atomic structures for 
      # a given batch of proteins represented as angles (batch x L x NUM_ANGLES)
      sb = scn.BatchedStructureBuilder(batch.int_seqs, predicted_angles.cpu())
      predicted_coords = sb.build()
      loss = losses.compute_batch_drmsd(batch.crds, predicted_coords, 
                                        batch.int_seqs)
      loss.backward()
      torch.nn.utils.clip_grad_norm_(pssm_coord_model.parameters(), 2)
      optimizer.step()

Please note that the BatchedStructureBuilder returns a list of coordinate tensors that have been unpadded.

If you are using the code in the manner above (as I have), then the loss function is correct in not removing padding. Are you using the code in this way? Please let me know if you have further questions.

OsamaGhandour commented 3 years ago

I got it now and it works for me. Thanks a lot.

I have just one more quesiton but it is not related to this issue. what is simplest code that i can use for (aggreagte seq as one hot and pssm+ic only) as model_input ??

jonathanking commented 3 years ago

You're welcome! Happy to help.

model_input = torch.cat([batch.seqs.float(), batch.evos.float()], dim =-1) should work for you