wrong structure for 1GJJ_1_A?

wwang2 commented 2 years ago

Hi @jonathanking

First of all, thanks for open-sourcing this cool dataset, I have been using it to benchmark my model.

I just want to raise an issue about a seemingly problematic structure in casp12 data in sidechainet:

seq: MPEFLEDPSVLTKDKLKSELVANNVTLPAGEQRKDVYVQLYLQHLTARNRPPLPAGTNSKGPPDFSSDEEREPTPVLGSGAAAAGRSRAAVGRKATKKTDKPRQEDKDDLDVTELTNEDLLDQLVKYGVNPGPIVGTTRKLYEKKLLKLREQGTESRSSTPLPTISSS
id:  1GJJ_1_A

When I try to visualize this structure, it looks like this:

Screen Shot 2021-11-03 at 11 34 18 AM

So it seems to have proteins overlayed on top of each other.

Additionally, this protein leads to a pretty large loss and big (sometimes NaN) gradient for my model.

Thanks

wwang2 commented 2 years ago

I looked at the PDB entry for this protein(https://www.rcsb.org/structure/1GJJ), I think your parsing is correct. I suspect the person who generated this structure submitted an ensemble of conformers instead of a single conformer. The entry information indicates it is supposed to have just one conformer. I am not sure how the sequence information matches the structure in this case.

jonathanking commented 2 years ago

Thank you, @wwang2, for reaching out to share this with me! Yes, I also see the RCSB structure file, and it is a tad annoying that I was not able to programmatically avoid this.

The author's comments in the PDB file itself (replicated below) essentially say that there are two resolved domains of the structure connected by a disordered linker. The authors made the decision to align the two domains since they are structurally similar. Interesting choice, and definitely incompatible with any attempt to use it as-is for protein structure prediction!

Going forward, I may try to split up this entry into two pieces, but let me come up with a solution for you (and myself!) in the meantime.

Thanks again!

REMARK   3  IN THIS ENTRY THE LAST COLUMN REPRESENTS THE AVERAGE                
REMARK   3  RMS DIFFERENCE BETWEEN THE INDIVIDUAL SIMULATED                     
REMARK   3  ANNEALING STRUCTURES AND THE MEAN COORDINATE                        
REMARK   3  POSITIONS. ONLY COORDINATES FOR RESIDUES 1-50 (LAP2-N)              
REMARK   3  AND 111-153 (LAP2-C) ARE PROVIDED. THE LINKER CONNECTING            
REMARK   3  THESE TWO DOMAINS IS COMPLETELY DISORDERED.                         
REMARK   3  LIKEWISE THE C-TERMINAL RESIDUES (154-168) ARE DISORDERED.          
REMARK   3  SINCE THE TWO DOMAINS, LAP2-N AND LAP2-C, REORIENT                  
REMARK   3  ESSENTIALLY INDEPENDENTLY IN SOLUTION, THE COORDINATES              
REMARK   3  OF THE TWO DOMAINS HAVE BEEN BEST-FITTED TO EACH OTHER              
REMARK   3  SINCE THEY ARE STRUCTURALY VERY SIMILAR. THE LAP2-N                 
REMARK   3  DOMAIN BINDS DNA. THE LAP2-C DOMAIN BINDS THE                       
REMARK   3  BARRIER-TO-AUTOINTEGRATION FACTOR BAF.

wwang2 commented 2 years ago

Thanks for confirming this. That is indeed a bizarre choice made by the uploader. For now, I will simply skip this structure for training my model.

jonathanking commented 2 years ago

Hi, @wwang2. I've made a small fix for this in the code itself which splits the offending entry into its two domains before loading the data. You may find this easier to handle than skipping it in your training loop.

For now, feel free to update the package via pip or from source. The next time I generate the data, I will make this fix more permanent.

Thank you again for pointing this out. I am excited to hear that you find the package useful. Take care!

jonathanking commented 2 years ago

You can see below how the two new entries replace the old one.

Screenshot

wwang2 commented 2 years ago

Wow, that was quick. Thanks!

jonathanking commented 2 years ago

You're welcome!

By the way, someone pointed out (#40) that I introduced an issue for loading custom datasets. I've now fixed that issue in v0.7.3.

jonathanking / sidechainnet

wrong structure for 1GJJ_1_A? #38