jonathanking / sidechainnet

An all-atom protein structure dataset for machine learning.
BSD 3-Clause "New" or "Revised" License
322 stars 36 forks source link

Sidechainnet for CASP 13 to CASP 15 #57

Open harshagrawal13 opened 1 year ago

harshagrawal13 commented 1 year ago

Hi! I am trying to do Masked Modelling using sequential and structural data using your curated dataset. I was wondering if it's possible for you to add the data for CASP 13 to CASP 15 if that's possible or share how I can do the same on my own.

Kind regards, Harsh

jonathanking commented 1 year ago

Hi Harsh,

Thanks for your interest! This is something that I would love to do (and I'm sure other users would be interested in), but it's unfortunately delayed and I don't have info on when I can add this. I'm working on adding slightly different functionality to SidechainNet at the moment.

Why? The trouble is that SidechainNet directly extends ProteinNet (and thereby uses ProteinNet's pretty sophisticated protein sequence clustering and filtering methods). Since ProteinNet does support CASPs newer than CASP 12 to my knowledge (specifically the clustering info), I am prevented from adding later CASP datasets to SidechainNet for now. I must either develop the code to split the training data in the same way as AlQuraishi et al. have done, ask for the authors access to that code, or hope that the authors would be willing to generate the same kind of dataset splits for CASPs > 12 and share them.

The good news is that you can manually specify proteins for a custom SidechainNet dataset. See Section 5 of the Colab Walkthrough linked to in the README. You'd just need define a list of train, validation, and test set proteins using the SidechainNet naming scheme, and those protein chains will be acquired and parsed into SidechainNet's datastructures. For the CASP test set proteins, however, you would need identify the RCSB PDB IDs that they correspond to, so that SidechainNet can download them correctly from the RCSB PDB.

Please let me know if you have any questions or concerns, and I'd be happy to help as much as I can.

Best, Jonathan

harshagrawal13 commented 1 year ago

Hey Jonathaking, Thanks for your swift reply. I really appreciate all the effort you've put into sidechainnet. It's been incredibly useful. As you mentioned I was trying to use the create_custom function but I'm unsure how exactly to proceed. I simply require (without any test, val splits) all ~210,000 PDB entries in the SCN format. I found this endpoint: https://data.rcsb.org/rest/v1/holdings/current/entry_ids to query all the PDB IDs. I also understood that I need to format these ids in proteinnet format. (I'm unsure where to query the <chain/model_number> and . I was setting all of them to 1 and A by default. When I passed a list of all PDB ids formatted like this to the create_custom function, it throws an error: need at least one array to concatenate. I'm attaching a screenshot. Kindly let me know if I'm doing something wrong or how should I proceed.

Screenshot 2023-03-08 at 7 22 31 PM
jonathanking commented 1 year ago

I'm really glad it has been helpful to you! Let's see, let me try to break this down a bit.

1

To begin, (apologies if you already know this) you should be aware that SidechainNet (as well as many other models and datasets like ProteinNet or even AlphaFold) treat proteins not as mutli-chain entities, but rather operate on each protein chain independently. So, in SidechainNet, we use a naming scheme that not only includes the 4-digit RSCB PDB ID, but also a "model number" (usually 1 is appropriate if you don't have a reason to use something else), as well as the very important chain ID.

What you're effectively doing is trying to download model 1 and chain A from all of those proteins. Model 1 probably exists for all of them, as well as chain A, but neither are guaranteed.

2

I'm not positive, but I think your code is not running on the Colab notebook because some of the IDs you've provided are not valid. To me it looks like your code doesn't bother downloading sidechainnet data for any of the items you requested (it says 0it). I tried running the Colab notebook as it is written and it works there (see below):

Downloading pre-parsed ProteinNet data (~3.5 GB compressed).
Downloading file chunks (estimated): 57257chunk [02:03, 463.84chunk/s]                        
Re-initializing validation set splits ([10, 90]).
Loading complete ProteinNet data (100% thinning) from /usr/local/lib/python3.9/dist-packages/sidechainnet/resources/proteinnet_parsed.
Raw ProteinNet files already preprocessed (/usr/local/lib/python3.9/dist-packages/sidechainnet/resources/proteinnet_parsed/training_100.pkl).
Preparing to download requested proteins via their ProteinNet IDs.
Downloading SidechainNet specific data from RSCB PDB.
141 IDs OK for parallel downloading.
  0%|          | 0/141 [00:00<?, ?it/s]DEBUG:.prody:Connecting wwPDB FTP server RCSB PDB (USA).
...
100%|██████████| 147/147 [00:09<00:00, 15.27it/s]
Finished unifying sidechain information with ProteinNet data.
0 IDs failed to combine successfully.
147 included in CASP User-specified (User-specified% thinning).
User-specified SidechainNet written to ./sidechainnet_data/custom01.pkl.
To load the data in a different format, use sidechainnet.load with the desired
options and set 'local_scn_path=./sidechainnet_data/custom01.pkl'.

If you want me to look closer at your error, can you please expand the error traceback where it says "3 Frames"?

3

I think I understand what you want, but SidechainNet doesn't have all the tools to get you there at the moment. If you can come up with a way to generate all of the sidechainnet-formatted IDs that you need properly, then my code should be able to handle that. SidechainNet specifies proteins as being part of the validation or test sets by this naming convention (i.e. 10# and FM# mean validation set number 10 and Free Modeling test set), so if you're not planning to have SidechainNet construct validation sets or test sets, then you should not have any identifiers with # in them.

There is also functionality that's not fully tested where if you have the pdb file, you can load the protein into a SCNProtein. However, this doesn't work for proteins with gaps in their sequences, and the PDB file must only have a single chain.

Please let me know if I can help any more!

harshagrawal13 commented 1 year ago

Hey, thanks for your reply. Here's a copy of the colab notebook: https://colab.research.google.com/drive/1X-Z7qcDUyQxIXnBYyWd042BcQr3UsF-z?usp=sharing. Kindly let me know if you can find the issue or suggest me how it can be fixed :)

jonathanking commented 1 year ago

I get the same error when running your notebook. I think it's because of the reasons I mentioned above (improper sidechainnet ids). Please let me know if I can clarify further.

harshagrawal13 commented 1 year ago

Gotcha. Thanks. I'll try to fetch the correct Ids and post if I encounter any other issues.