maxhodak / keras-molecules

Autoencoder network for learning a continuous representation of molecular structures.
MIT License
519 stars 146 forks source link

Problem with the pretrained model on autoencoding #78

Open jerryhluo opened 6 years ago

jerryhluo commented 6 years ago

Dear author,

I download your codes and pre-trained model (model_500k.h5) and tried out the following commands: python preprocess.py data/smiles_500k.h5 data/processed_500k.h5 python sample.py data/processed_500k.h5 data/model_500k.h5 --target autoencoder

Then it outputs:

NC(=O)c1nc(cnc1N)c2ccc(Cl)c(c2)S(=O)(=O)Nc3cccc(Cl)c3
(-> encoder -> decoder ->)
7-7ASC-F@@7N7AAAAAAAAAAAAAlllllNAACAAC7lll7AlllAAACC%CLA-VVVVVVVVFF--lAAAAAAAAAAAAAAVVAAAAACCAACCAAACAAACCA77A-VVV--

I am not sure what happened to the pre-trained model, seems it does not do a good job at all... Do you see a similar problem or I did something wrong...?

jerryhluo commented 6 years ago

Found a part of the reason: Python 2 generates "charset" variable in the same order (A->Z), while Python 3 is completely random. See https://stackoverflow.com/questions/9792664/set-changes-element-order

In additional, charset for 500k SMILES varies at different runs (due to the sampling function in preprocess.py). It's important for users to keep using the same set of files.

May you please provide the charset used to generate the pre-trained model? Since the model dimension also depends on the charset... @maxhodak


Updated on 4/23/2018 Solution found at https://github.com/chembl/autoencoder_ipython