Closed hypnopump closed 3 years ago
Here's the general process i did to check for outliers:
batch.crds[i]
with the output of this function: https://github.com/lucidrains/alphafold2/blob/main/alphafold2_pytorch/utils.py#L219 )dist_mat = torch.cdist(points, points, p=2)
top_15 = torch.topk(-dist_mat ,k=15).values()
furthest = torch.amax(-1*top_15)
print("Indexes with suspicious coords", (furthest>25).nonzero() )
plt.imshow(dist_mat.cpu().numpy())
) is of help to see what's going onThanks for the kind words! I'm also really excited to hear that you and others are working on projects using SidechainNet. Please never hesitate to reach out if there's anything I can do to help, or if there's anything I can add to/change about SidechainNet to make it more useful.
Thank you, also, for reaching out with your concerns. I'm happy to discuss them. The issues you describe are known to me and they come from two sources.
Trimming Sequences
1J6U_d1j6ua3
and similarly formatted ProteinNet IDs (pnids) are ASTRAL entries. d1j6ua3
), you'll see on line 179465 of astral_data.txt
that the entry you are referencing only refers to residues 89-295 of chain A of PDB 1j6u
. That should explain the differences you are observing.Missing coordinates
['CB', 'CG', 'SD', 'CE']
). If a particular atom isn't present, then it's value is not recorded. SD
atom, so I do not record it. There is, however, a SE
atom which I think is what you are finding is missing.Happy to continue chatting about this, but I hope this helps! :) (PS I'll be out of the office for a few days, so I may be slow to respond to your next message).
Take care! Jonathan
Oh yes! I've been using sidechainnet for different projects and it's likely that some things will start to come out... There's also a project i've been working on with sidechainnet as a data source that could be an improvement for sidechainnet as well (i'm trying to write up a preprint, then i'll come back here and share the details!).
Now, to the points:
Is there a way to load only complete proteins instead of protein domains? If so, that would be of tremendous help. If there isn't, do you think there might be a straightforward way to do it? (by looking at the ASTRAL codes and seeing if there's no code for a sequence, for example?)
Hmm my opinion is that both selenocysteine and selenomethionine could be parsed by including the selenium atom in the sulfur position, since they're both the same group and similar atoms (i think that would be closer to the reality than not including them at all). However, I'm not a biochemist so it's just my personal intuition. For the other modified residues... it seems too complex to handle them as separate so i agree with the current substitution (although i think that there could be a way of including that information in the dataset like an anotation or something like that (maybe include the 3-letter code as well for every sequence?)).
Overall, thanks for the quick response and valuable comments! I will adapt the corresponding functions in order to deal with these exceptions!
Sincerely, Eric
Hey Eric,
I hope you've had a nice week. Here are some followups:
Unfortunately, I'm not so sure if this is reasonable to implement. I tried to make SidechainNet a faithful extension of ProteinNet. ProteinNet considers these ASTRAL entries as separate entries and is designed to specifically include only the referenced sequence positions. Mohammad AlQuraishi was very cautious and methodical in his attempt to cluster the entries in ProteinNet, and I would be wary to make any sequence-based modifications to SidechainNet. This is how I understand it, of course, and I could be misinterpreting AlQuraishi's original work.
I don't really see how I can add the data for these entries without maintaining a separate set of files for SidechainNet. I could do this as a one-off modification though and send you the files, if you're interested? Or, I could open a branch that shows the necessary changes and you could generate the files yourself?
An alternative (which I think may be what you were thinking of) is to make the default version of SidechainNet include the complete sequences, and when loading have the ASTRAL domains parsed before the data is returned to the user. This, I'm afraid, isn't a great option either because then the data itself (the pickled dictionary files) will have "dirty" data that cannot simply be loaded and inspected as-is. In any case, the more I think about it, the more I don't see this extension as a faithful extension of ProteinNet.
Finally, I'm not sure that it would be wise to utilize the ProteinNet dataset clustering/splits that were computed using the ASTRAL domain sequences with the intention of modifying the sequences used for the original clustering (I assume they were included in the clustering). We're talking about modifying up to 25% of the dataset. I think you are at risk of losing one of the primary benefits of ProteinNet. If it's really important, and if you think you are sure that making such a large change to the dataset is worth it, I'd definitely be interested in talking about it and making something that helps you out!
What are your thoughts?
I think you're right. Thanks for bringing this to my attention! I agree that it's probably better to include the selenium atoms when I make selenomethionine/selenocysteine conversions. I'll update this issue when that gets sorted!
I do wonder, though: In all of my work, I have simply used the 0-vectored entries as a coordinate tensor mask. Does that not suffice for your workflow?
Finally, to your last point, I have added 3-letter residue codes to the dataset in another branch, and I'll certainly let you know when I regenerate the data to include it! :smiley:
Best, Jonathan
Hi there! Thanks for your insightful comments. Here are some workaround / custom solutions I have implemented so far:
Point 1: for only getting the entries for whole proteins, here's the custom function (it's just an if statement based on the format of the ids): https://github.com/hypnopump/alphafold2/blob/main/alphafold2_pytorch/utils.py#L220
Point 2: fin order to get the mask for the coordinates-present atoms, i have used the following line: https://github.com/hypnopump/geometric-vector-perceptron/blob/main/examples/data_handler.py#L607 . Also, thanks for adding the 3-letter code!
I think these solutions can work for now. Let me know if you find those useful.
Sincerely, Eric
I’m glad you found some solutions that work for you! I’d still be concerned about modifying the sequences by including whole proteins in lieu of their ASTRAL domains as specified by ProteinNet.
Let me know if there is anything else I can do to help.
Oh I think the issue has been sorted out, thanks! I'm closing this for now... will ping back once I have some results from the projects I mentioned earlier regarding the use of sidechainnet!
Eric
Hi there! I'm back! I would like to share with you a project that has been built on top/using Sidechainnet: MP-NeRF: A Massively Parallel Method for Accelerating Protein Structure Reconstruction from Internal Coordinates : basically a faster version (1000x) of the NeRF algorithm for 3d reconstruction. The project wouldn't have been possible without sidechainnet so I would like to thank you for building such an awesome platform from which we've been able tu run our research!
Hey Eric, wow, those runtimes look really incredible! NeRF has been a huge bottleneck in my own research projects, so I look forward to learning about your approach and potentially utilizing it! Thanks for sharing.
I'm curious about a couple of things if you have a moment to share. What are your thoughts about protein-level parallelism (parallelizing standard NeRF over multiple proteins) as opposed to MP-NeRF? That had been my approach in the past. I know it's fundamentally different, but I am curious how MP-NeRF would compare in a deep learning training context. Also, I think I noticed that there were some other hard-coded values you built into your code. What kind of values did you need on top of the ones from SidechainNet to complete MP-NeRF?
Thanks again, and congrats on getting the preprint and code out!
Hmm in deep learning... I'm currently using it for this application. The idea is that it's so fast that it doesn't matter if it's done in a for loop (An extra batch dimension could be added to mp-nerf to accomplish this level of parallelism as well, but i just thought it wasn't worth the time since I have other projects). Also, usually the networks that are used to analyze proteins are so big that the batch size is limitted, and people normally use gradient accumulation, so not big deal really.
WRT the hard-coded values, I only needed the sidechainnet ones to get the basic version of MP-Nerf, the other hardcoded values in the library are for extended applications such as getting a chain conformation only from backbone internal coordinates, etc
I see, thanks for the follow-up explanation!
If I was interested in utilizing mp-nerf
and I had a vector of angles that I would like to convert to coordinates, what would be the entry point in your code? Do you have a higher-level reconstruction function?
Hi there! I'm opening muy second issue here, but first of all i wnat to congratulate you for this amazing work, which I am using for some projects. It has helped me a lot. Kudos for the clean interface and simple+intuitive package functions.
I have detected what appears to be substantial anomalies in the data mainly consisting of:
1J6U_d1j6ua3
from the CASP7 data, the sidechainnet dataloader returns a sequence of 207 AAs (RLHYFRDTLKREKKEEFAVTGTDGKTTTTAMVAHVLKHLRKSPTVFLGGIMDSLEHGNYEKGNGPVVYELDESEEFFSEFSPNYLIITNARGDHLENYGNSLTRYRSAFEKISRNTDLVVTFAEDELTSHLGDVTFGVKKGTYTLEMRSASRAEQKAMVEKNGKRYLELKLKVPGFHNVLNALAVIALFDSLGYDLAPVLEALEEFR____________
) while the original sequence is 446 AAs long ( https://www.rcsb.org/3d-view/1J6U )..pdb
file from the RCSB PDB.1J6U_d1j6ua3
from the CASP7 data), theM
in position 30 in the provided chain has the following coordinates (which match the ones present in the RCSB PDB for residue number 119, except for the Sulfur atom):In the same protein,
1J6U_d1j6ua3
there are 4 more missing atoms: the ones in positions406
,1177
and1261
from thebatch.crds[i]
for that protein.This is an example of a protein with missing atoms which their coordinates can be found on the PDB, but there are a few more which happen to be in the same situation. Here's a few of them:
1VM0_d1vm0b-
KNRIQVSNTKKPLFFYVNLAKRYMQQYNDVELSALGMAIATVVTVTEILKNNGFAVEKKIMTSIVDIKDDARGRPVQKAKIEITLVKSEKFDELMAAANEEKE
and length is 103SEEITDGVNNMNLATDSQKKNRIQVSNTKKPLFFYVNLAKRYMQQYNDVELSALGMAIATVVTVTEILKNNGFAVEKKIMTSIVDIKDDARGRPVQKAKIEITLVKSEKFDELMAAANEEKEDAETQVQN
and length is 130batch.crds[i]
:206
,307
,487
, ...2G39_d2g39a1
I have found some of the missing atoms to correspond to selenocysteines, if that's of any help.
Is this issue familiar? Do you have some thoughts on how to solve it? Is it going to take a lot of time? If so, i could help you with it.