jonathanking / sidechainnet

An all-atom protein structure dataset for machine learning.
BSD 3-Clause "New" or "Revised" License
330 stars 38 forks source link

Confusion Regarding train_eval and validation sets #60

Closed harshagrawal13 closed 1 year ago

harshagrawal13 commented 1 year ago

Hey Jonathan! I am starting another issue to clarify a minor, yet important issue: Why is there a train_eval set (consisting of the same number of proteins as the train set)? Moreover, why is the number of proteins present in the validation and test set only 32? Isn't that too small to test the generalizability criteria? It's less than 0.1 % of the train set. Do let me know if there is a way to obtain somewhere near 10-20% of the train set size as the validation set size.

Am I missing something? Kindly let me know! Thanks

jonathanking commented 1 year ago

Hi Harsh,

Thanks for your comments.

  1. Can you clarify which dataset you are using (CASP X, thinning Y, or a custom dataset)? The train_eval set is supposed to be a smaller, manageable downsampling of the entire training set. This way, if you have a model that is expensive to train, you may approximate its performance at the end of an epoch by evaluating the train_eval set instead of train.
  2. Please see our paper, SidechainNet, and Mohammad AlQuraishi's paper, ProteinNet, for a discussion on the validation sets. They are designed to give you multiple ways of validating your model depending on your needs.
  3. Please see the lower portion of our Colab notebook linked in the ReadMe for more information about creating your own dataset splits. You are welcome to define any dataset organization that you feel is appropriate for your work.

Please let me know if you need any further assistance!

Best, Jonathan

jonathanking commented 1 year ago

I’m going to mark this as closed, but please let me know if I can be if more assistance.