moltimate / moltimate-backend

A protein active site alignment tool
GNU General Public License v2.0
10 stars 5 forks source link

Errors loading some motifs from CSA #57

Closed jmiller656 closed 5 years ago

jmiller656 commented 5 years ago

While loading data from the CSA, we get some error cases. A lot of them look like this: image

Here's another error that was found: image

steplica commented 5 years ago

@jmiller656 I noticed this when we were collecting the list of proteins with no CB atom. The two that error out are 4v40 and 4v4e. I'll look into why this happens

steplica commented 5 years ago

At this point, we only fail for the two PDB IDs I commented about above. I did some digging around and learned that these two entries do exist in the PDB, but they cannot be downloaded as PDB files (hence the failure when we try to get their pdb files based on a url pattern). Both can be downloaded in mmCIF format, though.

4v40: https://www.rcsb.org/structure/4V40 4v4e: https://www.rcsb.org/structure/4V4e

@jmiller656 @blackpan2 Do either of you think it's worth the time to try and have fail logic where we try to grab the protein structures as a different file format? This is handled by BioJava but I'm not sure what level of control they give us over which file type we get from the RCSB PDB.

jmiller656 commented 5 years ago

If it's only for those two structures, then it might be something we could mark as a known bug and suggest for fixing later

steplica commented 5 years ago

I figured out the reason they aren't available as PDB files. These two structures are too large to be stored in the PDB file format.

@jmiller656 @blackpan2 Should we ask our sponsors what to do in this case? It's only 2 out of 1200 motifs we try to create. But that might be important. But they'll also be insanely huge to store and compare against.

blackpan2 commented 5 years ago

@steplica can you confirm that the motif itself is also large? We aren't storing the entire protein, so even if the protein is large the motif could be smaller and a manageable size.

It does bring up how we would integrate our planned augmented pdb file would work for larger proteins.

I'm ok with marking these as a know bug and documenting the concerns associated with them. We can let the sponsors know, but I don't think this should be a focus since adding two motifs (or the small % it represents) to the system.

@jmiller656 can you update the issue to represent this specific issue? Or we can close this and open a new one and mark it as a P2 known bug.

steplica commented 5 years ago

The solution to this problem, and also to the satisfaction of our sponsors, is to start pulling mmCIF files instead of PDB files

blackpan2 commented 5 years ago

@jmiller656 do you have an update for this? I know you wanted to wrap up this work for better or worse today. Can you give a list of still non-working PDB ids

jmiller656 commented 5 years ago

This is fixed, thanks @steplica