There are some repeating pids in the datasets

jonathanking / sidechainnet

An all-atom protein structure dataset for machine learning.

BSD 3-Clause "New" or "Revised" License

322 stars 36 forks source link

There are some repeating pids in the datasets #33

Closed xiongzhp closed 3 years ago

xiongzhp commented 3 years ago

data = scn.load(
    casp_version = 12,
    thinning = 30,
    with_pytorch = 'dataloaders',
    batch_size = 1,
    dynamic_batching = False,
)

for itr, batch in enumerate(data['train']):
    a_list.append(batch.pids[0])
len(a_list),len(set(a_list))

Which will output (25212, 15829). That is to say, only 15829 different structures in the training set?

jonathanking commented 3 years ago

I think I understand the confusion. SidechainNet Dataloaders randomly load the data available for batching. It will approximately load each protein once per epoch. Behind the scenes it has organized proteins by size, and picks a size bin with a probability proportional to the number of proteins in each bin. Perhaps it would be a nice feature to not load the data like this! This is just how I had loaded it for my own work.

To get a better understanding of the number of unique IDs available, please load the data as a Python dictionary (do not provide the with_Pytorch argument to scn.load.

jonathanking commented 3 years ago

The actual number of IDs in this dataset is 25212 (hence why the DataLoader yields this many proteins in one epoch). It's just that the design of the DataLoader is not to load each protein exactly once by default.

xiongzhp commented 3 years ago

Yes, you are right. There are 25212 proteins.

jonathanking commented 3 years ago

Please let me know if I can provide any further assistance, and good luck with your work!