Dataset issues - Githubissues

jonathanking / sidechainnet

An all-atom protein structure dataset for machine learning.

BSD 3-Clause "New" or "Revised" License

328 stars 37 forks source link

Dataset issues #18

Closed hypnopump closed 3 years ago

hypnopump commented 3 years ago

Hi there! I'm opening muy second issue here, but first of all i wnat to congratulate you for this amazing work, which I am using for some projects. It has helped me a lot. Kudos for the clean interface and simple+intuitive package functions.

I have detected what appears to be substantial anomalies in the data mainly consisting of:

Trimming of the original protein sequences: The provided sequence is just a fraction of the existing one in the PDB
- example: for sequence 1J6U_d1j6ua3 from the CASP7 data, the sidechainnet dataloader returns a sequence of 207 AAs (RLHYFRDTLKREKKEEFAVTGTDGKTTTTAMVAHVLKHLRKSPTVFLGGIMDSLEHGNYEKGNGPVVYELDESEEFFSEFSPNYLIITNARGDHLENYGNSLTRYRSAFEKISRNTDLVVTFAEDELTSHLGDVTFGVKKGTYTLEMRSASRAEQKAMVEKNGKRYLELKLKVPGFHNVLNALAVIALFDSLGYDLAPVLEALEEFR____________) while the original sequence is 446 AAs long ( https://www.rcsb.org/3d-view/1J6U ).
Missing coordinates for particular atoms: some atoms in the sidechain do not have assignated coordinates (set to 0), however, the rest of the sidechain is correctly positioned. The coordinates are present in the .pdb file from the RCSB PDB.
- example: for the same sequence (1J6U_d1j6ua3 from the CASP7 data), the Min position 30 in the provided chain has the following coordinates (which match the ones present in the RCSB PDB for residue number 119, except for the Sulfur atom):
```
[239.3010,  15.4910, 197.5680],
[239.1060,  16.5650, 196.6000],
[240.0840,  16.3990, 195.4110],
[239.6540,  16.4340, 194.2590],
[239.2900,  17.9100, 197.2700],
[238.1530,  18.2990, 198.1030],
[  0.0000,   0.0000,   0.0000],
[236.6680,  20.4360, 199.4060],
```
  In the same protein, 1J6U_d1j6ua3 there are 4 more missing atoms: the ones in positions 406, 1177 and 1261 from the batch.crds[i] for that protein.

This is an example of a protein with missing atoms which their coordinates can be found on the PDB, but there are a few more which happen to be in the same situation. Here's a few of them:

1VM0_d1vm0b-
- seq provided by sidechainnet: KNRIQVSNTKKPLFFYVNLAKRYMQQYNDVELSALGMAIATVVTVTEILKNNGFAVEKKIMTSIVDIKDDARGRPVQKAKIEITLVKSEKFDELMAAANEEKE and length is 103
- real seq and real length: SEEITDGVNNMNLATDSQKKNRIQVSNTKKPLFFYVNLAKRYMQQYNDVELSALGMAIATVVTVTEILKNNGFAVEKKIMTSIVDIKDDARGRPVQKAKIEITLVKSEKFDELMAAANEEKEDAETQVQN and length is 130
- Missing positions in batch.crds[i]: 206, 307, 487, ...
2G39_d2g39a1

I have found some of the missing atoms to correspond to selenocysteines, if that's of any help.

Is this issue familiar? Do you have some thoughts on how to solve it? Is it going to take a lot of time? If so, i could help you with it.

hypnopump commented 3 years ago

Here's the general process i did to check for outliers:

Load a protein
Create a distance matrix of all the points that would correspond to that protein, without the extra 0s (just masking the batch.crds[i] with the output of this function: https://github.com/lucidrains/alphafold2/blob/main/alphafold2_pytorch/utils.py#L219 )

Check the 15 closest neighbours for every point and the distance for that neighbours:

dist_mat = torch.cdist(points, points, p=2)
top_15 = torch.topk(-dist_mat ,k=15).values()
furthest = torch.amax(-1*top_15)
print("Indexes with suspicious coords", (furthest>25).nonzero() )

also plotting the distance matrix (plt.imshow(dist_mat.cpu().numpy())) is of help to see what's going on

jonathanking commented 3 years ago

Thanks for the kind words! I'm also really excited to hear that you and others are working on projects using SidechainNet. Please never hesitate to reach out if there's anything I can do to help, or if there's anything I can add to/change about SidechainNet to make it more useful.

Thank you, also, for reaching out with your concerns. I'm happy to discuss them. The issues you describe are known to me and they come from two sources.

Trimming Sequences
- 1J6U_d1j6ua3 and similarly formatted ProteinNet IDs (pnids) are ASTRAL entries.
- See astral_data.txt for a summary of all ASTRAL data that I've downloaded and included with SidechainNet.
- If you search for the ASTRAL identifier portion of the pnid (d1j6ua3), you'll see on line 179465 of astral_data.txt that the entry you are referencing only refers to residues 89-295 of chain A of PDB 1j6u. That should explain the differences you are observing.
Missing coordinates
- I hope you and others would agree that handling non-standard residues is a tricky issue. They may have non-standard 3-letter codes, modified number of atoms, and modified atom names.
- The issue you describes comes from the fact that the residue is indeed modified, (methionine -> selenomethionine). The way I measure atomic coordinates is to measure the atom names for each residue (methionine has atom names ['CB', 'CG', 'SD', 'CE']). If a particular atom isn't present, then it's value is not recorded.
- In this case, there is no SD atom, so I do not record it. There is, however, a SE atom which I think is what you are finding is missing.
- By only using the default atom select names, I made the choice to not include atoms for the Seleniums (or other odd substitutions). Whether this is the best choice, I'm not sure.
- You may also find missing chunks of atoms when I observed non-standard residues that I did not explicitly allow (see this list for examples of residues that seemed "close enough" to the real residue that I tried to include them, even if they have some issues like the one you describe in part 2). You'll notice that there are many modified residues which are probably modified to the point that including them/renaming them as their standard versions is risky at best.

Happy to continue chatting about this, but I hope this helps! :) (PS I'll be out of the office for a few days, so I may be slow to respond to your next message).

Take care! Jonathan

hypnopump commented 3 years ago

Oh yes! I've been using sidechainnet for different projects and it's likely that some things will start to come out... There's also a project i've been working on with sidechainnet as a data source that could be an improvement for sidechainnet as well (i'm trying to write up a preprint, then i'll come back here and share the details!).

Now, to the points:

Is there a way to load only complete proteins instead of protein domains? If so, that would be of tremendous help. If there isn't, do you think there might be a straightforward way to do it? (by looking at the ASTRAL codes and seeing if there's no code for a sequence, for example?)
Hmm my opinion is that both selenocysteine and selenomethionine could be parsed by including the selenium atom in the sulfur position, since they're both the same group and similar atoms (i think that would be closer to the reality than not including them at all). However, I'm not a biochemist so it's just my personal intuition. For the other modified residues... it seems too complex to handle them as separate so i agree with the current substitution (although i think that there could be a way of including that information in the dataset like an anotation or something like that (maybe include the 3-letter code as well for every sequence?)).

Overall, thanks for the quick response and valuable comments! I will adapt the corresponding functions in order to deal with these exceptions!

Sincerely, Eric

jonathanking commented 3 years ago

Hey Eric,

I hope you've had a nice week. Here are some followups:

1.

Unfortunately, I'm not so sure if this is reasonable to implement. I tried to make SidechainNet a faithful extension of ProteinNet. ProteinNet considers these ASTRAL entries as separate entries and is designed to specifically include only the referenced sequence positions. Mohammad AlQuraishi was very cautious and methodical in his attempt to cluster the entries in ProteinNet, and I would be wary to make any sequence-based modifications to SidechainNet. This is how I understand it, of course, and I could be misinterpreting AlQuraishi's original work.

I don't really see how I can add the data for these entries without maintaining a separate set of files for SidechainNet. I could do this as a one-off modification though and send you the files, if you're interested? Or, I could open a branch that shows the necessary changes and you could generate the files yourself?

An alternative (which I think may be what you were thinking of) is to make the default version of SidechainNet include the complete sequences, and when loading have the ASTRAL domains parsed before the data is returned to the user. This, I'm afraid, isn't a great option either because then the data itself (the pickled dictionary files) will have "dirty" data that cannot simply be loaded and inspected as-is. In any case, the more I think about it, the more I don't see this extension as a faithful extension of ProteinNet.

Finally, I'm not sure that it would be wise to utilize the ProteinNet dataset clustering/splits that were computed using the ASTRAL domain sequences with the intention of modifying the sequences used for the original clustering (I assume they were included in the clustering). We're talking about modifying up to 25% of the dataset. I think you are at risk of losing one of the primary benefits of ProteinNet. If it's really important, and if you think you are sure that making such a large change to the dataset is worth it, I'd definitely be interested in talking about it and making something that helps you out!

What are your thoughts?

2.

I think you're right. Thanks for bringing this to my attention! I agree that it's probably better to include the selenium atoms when I make selenomethionine/selenocysteine conversions. I'll update this issue when that gets sorted!

I do wonder, though: In all of my work, I have simply used the 0-vectored entries as a coordinate tensor mask. Does that not suffice for your workflow?

Finally, to your last point, I have added 3-letter residue codes to the dataset in another branch, and I'll certainly let you know when I regenerate the data to include it! :smiley:

Best, Jonathan

hypnopump commented 3 years ago

Hi there! Thanks for your insightful comments. Here are some workaround / custom solutions I have implemented so far:

Point 1: for only getting the entries for whole proteins, here's the custom function (it's just an if statement based on the format of the ids): https://github.com/hypnopump/alphafold2/blob/main/alphafold2_pytorch/utils.py#L220
Point 2: fin order to get the mask for the coordinates-present atoms, i have used the following line: https://github.com/hypnopump/geometric-vector-perceptron/blob/main/examples/data_handler.py#L607 . Also, thanks for adding the 3-letter code!

I think these solutions can work for now. Let me know if you find those useful.

Sincerely, Eric

jonathanking commented 3 years ago

I’m glad you found some solutions that work for you! I’d still be concerned about modifying the sequences by including whole proteins in lieu of their ASTRAL domains as specified by ProteinNet.

Let me know if there is anything else I can do to help.

hypnopump commented 3 years ago

Oh I think the issue has been sorted out, thanks! I'm closing this for now... will ping back once I have some results from the projects I mentioned earlier regarding the use of sidechainnet!

Eric

hypnopump commented 3 years ago

Hi there! I'm back! I would like to share with you a project that has been built on top/using Sidechainnet: MP-NeRF: A Massively Parallel Method for Accelerating Protein Structure Reconstruction from Internal Coordinates : basically a faster version (1000x) of the NeRF algorithm for 3d reconstruction. The project wouldn't have been possible without sidechainnet so I would like to thank you for building such an awesome platform from which we've been able tu run our research!

jonathanking commented 3 years ago

Hey Eric, wow, those runtimes look really incredible! NeRF has been a huge bottleneck in my own research projects, so I look forward to learning about your approach and potentially utilizing it! Thanks for sharing.

I'm curious about a couple of things if you have a moment to share. What are your thoughts about protein-level parallelism (parallelizing standard NeRF over multiple proteins) as opposed to MP-NeRF? That had been my approach in the past. I know it's fundamentally different, but I am curious how MP-NeRF would compare in a deep learning training context. Also, I think I noticed that there were some other hard-coded values you built into your code. What kind of values did you need on top of the ones from SidechainNet to complete MP-NeRF?

Thanks again, and congrats on getting the preprint and code out!

hypnopump commented 3 years ago

Hmm in deep learning... I'm currently using it for this application. The idea is that it's so fast that it doesn't matter if it's done in a for loop (An extra batch dimension could be added to mp-nerf to accomplish this level of parallelism as well, but i just thought it wasn't worth the time since I have other projects). Also, usually the networks that are used to analyze proteins are so big that the batch size is limitted, and people normally use gradient accumulation, so not big deal really.

WRT the hard-coded values, I only needed the sidechainnet ones to get the basic version of MP-Nerf, the other hardcoded values in the library are for extended applications such as getting a chain conformation only from backbone internal coordinates, etc

jonathanking commented 3 years ago

I see, thanks for the follow-up explanation!

If I was interested in utilizing mp-nerf and I had a vector of angles that I would like to convert to coordinates, what would be the entry point in your code? Do you have a higher-level reconstruction function?