Closed wwang2 closed 2 years ago
I looked at the PDB entry for this protein(https://www.rcsb.org/structure/1GJJ), I think your parsing is correct. I suspect the person who generated this structure submitted an ensemble of conformers instead of a single conformer. The entry information indicates it is supposed to have just one conformer. I am not sure how the sequence information matches the structure in this case.
Thank you, @wwang2, for reaching out to share this with me! Yes, I also see the RCSB structure file, and it is a tad annoying that I was not able to programmatically avoid this.
The author's comments in the PDB file itself (replicated below) essentially say that there are two resolved domains of the structure connected by a disordered linker. The authors made the decision to align the two domains since they are structurally similar. Interesting choice, and definitely incompatible with any attempt to use it as-is for protein structure prediction!
Going forward, I may try to split up this entry into two pieces, but let me come up with a solution for you (and myself!) in the meantime.
Thanks again!
REMARK 3 IN THIS ENTRY THE LAST COLUMN REPRESENTS THE AVERAGE
REMARK 3 RMS DIFFERENCE BETWEEN THE INDIVIDUAL SIMULATED
REMARK 3 ANNEALING STRUCTURES AND THE MEAN COORDINATE
REMARK 3 POSITIONS. ONLY COORDINATES FOR RESIDUES 1-50 (LAP2-N)
REMARK 3 AND 111-153 (LAP2-C) ARE PROVIDED. THE LINKER CONNECTING
REMARK 3 THESE TWO DOMAINS IS COMPLETELY DISORDERED.
REMARK 3 LIKEWISE THE C-TERMINAL RESIDUES (154-168) ARE DISORDERED.
REMARK 3 SINCE THE TWO DOMAINS, LAP2-N AND LAP2-C, REORIENT
REMARK 3 ESSENTIALLY INDEPENDENTLY IN SOLUTION, THE COORDINATES
REMARK 3 OF THE TWO DOMAINS HAVE BEEN BEST-FITTED TO EACH OTHER
REMARK 3 SINCE THEY ARE STRUCTURALY VERY SIMILAR. THE LAP2-N
REMARK 3 DOMAIN BINDS DNA. THE LAP2-C DOMAIN BINDS THE
REMARK 3 BARRIER-TO-AUTOINTEGRATION FACTOR BAF.
Thanks for confirming this. That is indeed a bizarre choice made by the uploader. For now, I will simply skip this structure for training my model.
Hi, @wwang2. I've made a small fix for this in the code itself which splits the offending entry into its two domains before loading the data. You may find this easier to handle than skipping it in your training loop.
For now, feel free to update the package via pip or from source. The next time I generate the data, I will make this fix more permanent.
Thank you again for pointing this out. I am excited to hear that you find the package useful. Take care!
You can see below how the two new entries replace the old one.
Wow, that was quick. Thanks!
You're welcome!
By the way, someone pointed out (#40) that I introduced an issue for loading custom datasets. I've now fixed that issue in v0.7.3.
Hi @jonathanking
First of all, thanks for open-sourcing this cool dataset, I have been using it to benchmark my model.
I just want to raise an issue about a seemingly problematic structure in casp12 data in sidechainet:
When I try to visualize this structure, it looks like this:
So it seems to have proteins overlayed on top of each other.
Additionally, this protein leads to a pretty large loss and big (sometimes NaN) gradient for my model.
Thanks