How to generate tokens from SMILES strings

oriondollar / TransVAE

A Transformer Based VAE Architecture for De Novo Molecular Design

MIT License

88 stars 23 forks source link

How to generate tokens from SMILES strings #5

Closed zhikaili closed 2 years ago

zhikaili commented 2 years ago

Hi,

Thank you for sharing the paper and code. I want to apply TransVAE to another molecule dataset, but I have no idea how to generate the vocabulary for my dataset. In README, it is written that "The vocabulary must be a pickle file that stores a dictionary that maps token -> token id and it must begin with the or token". May I ask how I can generate such a pickle file and the corresponding char_weighs if I only have the SMILES strings? Could you please share some exemplar scripts of data preprocessing?

Thank you!

oriondollar commented 2 years ago

Just wrote a quick script called build_vocab.py that will generate those files for you. You can build custom vocab and weights files by running python scripts/build_vocab.py --mols YOUR_SMILES.smi. There are a few other parameters you can set (you can check parsers.py if you're interested) but the default settings should get you the files you need and put them in the data folder. Hope that helps!

zhikaili commented 2 years ago

@oriondollar Thank you for your help!

oriondollar commented 2 years ago

@zhikaili no prob!

zhikaili commented 2 years ago

@oriondollar Hi, may I ask whether the YOUR_SMILES.smi file should contain the SMILES of both train and validation split, or just the train split alone? Thank you!

zhikaili commented 2 years ago

Hi,

I followed build_vocab.py and parsers.py to write a notebook for preparing my data. Before applying to my data, I tried my code on ZINC to see if it gives the same results as those given by you. I find that my output char_dict is the same as yours, though the orders of keys are different. However, the char weight for the token '\<end>' is significantly different from yours -- mine is about 0.6584 while yours is 0.75. Please see the following screenshots (params['CHAR_DICT'] was set to the char_dict given by you, so that the two groups of weights are sorted by the same order):

May I ask if I missed anything? Do I also need to manually set the weight for '\<end>' to 0.75, just like setting char_weights[-2] to 0.1?

Thank you!