insilicomedicine / GENTRL

Generative Tensorial Reinforcement Learning (GENTRL) model
596 stars 215 forks source link

"prob" parameter in dataset source #13

Open albertma-evotec opened 4 years ago

albertma-evotec commented 4 years ago

(In pretrain notebook)

image

What does this parameter control? I can see that it was store in the self.source_probs variable but i cannot really understand what it is trying to do in the getitem function? and why s is updated as s += self.source_probs[i] at the end of each For loop?

(In dataloader.py) image

Anyone can educate me please? Many thanks in advance

Bibyutatsu commented 4 years ago

Hi, this 'prob' parameter controls the frequency of the sampled data from the datasets. For example, you have two datasets, with two different probabilities, 0.8 and 0.2 respectively (As the sum should be 1)

A = gentrl.MolecularDataset(sources=[{
          'path':'A.csv',
          'smiles': 'SMILES',
          'prob': 0.8,
          'plogP' : 'plogP',
           }], 
        props=['plogP'])

B = gentrl.MolecularDataset(sources=[{
          'path':'B.csv',
          'smiles': 'SMILES',
          'prob': 0.2,
          'plogP' : 'plogP',
           }], 
        props=['plogP'])

So, when you train using these dataset the 80% of training data will be from dataset A. And 20% of training data will be from dataset B.

So, basically in this example it is kept 1 so that 100% of the training data is from the train_plogp_plogpm.csv