Closed xiongzhp closed 3 years ago
I think I understand the confusion. SidechainNet Dataloaders randomly load the data available for batching. It will approximately load each protein once per epoch. Behind the scenes it has organized proteins by size, and picks a size bin with a probability proportional to the number of proteins in each bin. Perhaps it would be a nice feature to not load the data like this! This is just how I had loaded it for my own work.
To get a better understanding of the number of unique IDs available, please load the data as a Python dictionary (do not provide the with_Pytorch
argument to scn.load
.
The actual number of IDs in this dataset is 25212 (hence why the DataLoader yields this many proteins in one epoch). It's just that the design of the DataLoader is not to load each protein exactly once by default.
Yes, you are right. There are 25212 proteins.
Please let me know if I can provide any further assistance, and good luck with your work!
Which will output (25212, 15829). That is to say, only 15829 different structures in the training set?