Open albertma-evotec opened 5 years ago
Hi, this 'prob' parameter controls the frequency of the sampled data from the datasets. For example, you have two datasets, with two different probabilities, 0.8 and 0.2 respectively (As the sum should be 1)
A = gentrl.MolecularDataset(sources=[{
'path':'A.csv',
'smiles': 'SMILES',
'prob': 0.8,
'plogP' : 'plogP',
}],
props=['plogP'])
B = gentrl.MolecularDataset(sources=[{
'path':'B.csv',
'smiles': 'SMILES',
'prob': 0.2,
'plogP' : 'plogP',
}],
props=['plogP'])
So, when you train using these dataset the 80% of training data will be from dataset A. And 20% of training data will be from dataset B.
So, basically in this example it is kept 1 so that 100% of the training data is from the train_plogp_plogpm.csv
(In pretrain notebook)
What does this parameter control? I can see that it was store in the self.source_probs variable but i cannot really understand what it is trying to do in the getitem function? and why s is updated as s += self.source_probs[i] at the end of each For loop?
(In dataloader.py)
Anyone can educate me please? Many thanks in advance