Closed zhikaili closed 2 years ago
Just wrote a quick script called build_vocab.py that will generate those files for you. You can build custom vocab and weights files by running python scripts/build_vocab.py --mols YOUR_SMILES.smi
. There are a few other parameters you can set (you can check parsers.py if you're interested) but the default settings should get you the files you need and put them in the data folder. Hope that helps!
@oriondollar Thank you for your help!
@zhikaili no prob!
@oriondollar Hi, may I ask whether the YOUR_SMILES.smi file should contain the SMILES of both train and validation split, or just the train split alone? Thank you!
Hi,
I followed build_vocab.py and parsers.py to write a notebook for preparing my data. Before applying to my data, I tried my code on ZINC to see if it gives the same results as those given by you. I find that my output char_dict is the same as yours, though the orders of keys are different. However, the char weight for the token '\<end>' is significantly different from yours -- mine is about 0.6584 while yours is 0.75. Please see the following screenshots (params['CHAR_DICT'] was set to the char_dict given by you, so that the two groups of weights are sorted by the same order):
May I ask if I missed anything? Do I also need to manually set the weight for '\<end>' to 0.75, just like setting char_weights[-2] to 0.1?
Thank you!
Hi,
Thank you for sharing the paper and code. I want to apply TransVAE to another molecule dataset, but I have no idea how to generate the vocabulary for my dataset. In README, it is written that "The vocabulary must be a pickle file that stores a dictionary that maps token -> token id and it must begin with the or token". May I ask how I can generate such a pickle file and the corresponding char_weighs if I only have the SMILES strings? Could you please share some exemplar scripts of data preprocessing?
Thank you!