The setting `MAX_LEN = 277` in `models/model_zinc` seems not enough

shalijiang commented 5 years ago

If we try to encode the following SMILES string with the pretrained zinc model, it will result in an "index out of bounds" error, due to num_productions=288 for this string. Should I simply set MAX_LEN larger?

grammar_model.encode(["O1[C@@H](O[C@H]2[C@H](O)[C@@H](O[C@@H]3OC[C@@](O)(C)[C@H]([NH2+]C)[C@H]3O)[C@@H]([NH2+]CC)C[C@H]2[NH3+])[C@@H]([NH3+])CC=C1C[NH3+]"])

mkusner commented 5 years ago

Yep that's right! 277 was chosen to fit the molecules ZINC dataset, just increase it to fit larger molecules

On Wed, Sep 25, 2019, 4:42 AM Shali Jiang notifications@github.com wrote:

If we try to encode the following SMILES string with the pretrained zinc model, it will result in an "index out of bounds" error, due to num_productions=288 for this string. Should I simply set MAX_LEN larger?

grammar_model.encode(["O1C@@H C@@HC[C@H]2[NH3+])C@ @HCC=C1C[NH3+]"])

[image: image] https://user-images.githubusercontent.com/1907978/65567446-f213b680-df1b-11e9-94d4-b475e343a238.png

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/mkusner/grammarVAE/issues/24?email_source=notifications&email_token=ACBFFUPWTLYZQASP2TYLLW3QLLM33A5CNFSM4I2HIGQKYY3PNVWWK3TUL52HS4DFUVEXG43VMWVGG33NNVSW45C7NFSM4HNPKX3Q, or mute the thread https://github.com/notifications/unsubscribe-auth/ACBFFUPYUEVSD5YTZPVKEB3QLLM33ANCNFSM4I2HIGQA .

shalijiang commented 5 years ago

thanks! What I'm using is also a subset (about 160000+ chemicals) of ZINC dataset, and for most of the SMILES, it runs ok, only a couple of inputs result in error. Was 277 chosen to be empirical max len of the 250k training set?

If I simply change MAX_LEN, I guess I would have to retrain the model? (loading the pretrained model with updated MAX_LEN will error)

mkusner commented 5 years ago

Yeah 277 was based on the largest molecule in the training set.

Sorry yes you'll need to retrain. One could use an RNN where you have a single set of weights and the hidden state from the previous timestep is fed into the next timestep. This would prevent one having to retrain if larger molecules are introduced (although the behavior might get odd if molecules are much larger than in the training set). My guess is this would perform worse as there are fewer parameters, but I didn't try a model like this.

On Wed, Sep 25, 2019, 3:42 PM Shali Jiang notifications@github.com wrote:

thanks! What I'm using is also a subset (about 160000+ chemicals) of ZINC dataset, and for most of the SMILES, it runs ok, only a couple of inputs result in error. Was 277 chosen to be empirical max len of the 250k training set?

If I simply change MAX_LEN, I guess I would have to retrain the model? (loading the pretrained model with updated MAX_LEN will error)

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/mkusner/grammarVAE/issues/24?email_source=notifications&email_token=ACBFFUPPYU6KMUU6BWNMS5DQLN2G5A5CNFSM4I2HIGQKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD7SFCXA#issuecomment-535056732, or mute the thread https://github.com/notifications/unsubscribe-auth/ACBFFUO73YAQYYL3F2XE2HTQLN2G5ANCNFSM4I2HIGQA .

shalijiang commented 5 years ago

I enlarged the dataset to 1M+, and find the max len is 288. However, there are only two SMILES with the indices length > 277. I was wondering if I should increase the parameter just for these two. In your experience, will larger MAX_LEN harm the performance? I guess at least it should be slower, but otherwise?

LucFrachon commented 4 years ago

I enlarged the dataset to 1M+, and find the max len is 288. However, there are only two SMILES with the indices length > 277. I was wondering if I should increase the parameter just for these two. In your experience, will larger MAX_LEN harm the performance? I guess at least it should be slower, but otherwise?

From my own experiments (with a different grammar, not SMILES), I found that restricting the maximum length helped but I don't know whether it is from the reduced size or the fact that you remove a bit of padding ("Nothing -> None") at the end. Probably both.

mkusner commented 4 years ago

I'd opt for removing the 2 molecules. Increasing the size will require the model to have more parameters that are only trained based on 2 molecules, so they will likely be very poor.

mkusner / grammarVAE

The setting `MAX_LEN = 277` in `models/model_zinc` seems not enough #24