Closed mgbvox closed 5 years ago
Hi Matthew. The default training data is based on what they described in the paper: they are randomly sampled "sentences" from a randomly initialized LSTM with parameters that are initialized from a normal distribution here:
https://github.com/suragnair/seqGAN/blob/ae8ffcd54977bd9ee177994c751f86d34f5f7aa3/generator.py#L27
The idea behind doing this is elegant- they now have an "oracle" that can compute the distribution of any sentence. This is not possible to do otherwise since the distribution for English is not known and we do not have an oracle. By constructing this oracle and then training a freshly initialized LSTM on the training data (generated by the oracle), we can monitor the progress of the new LSTM's learning with every epoch- the oracle can take a bunch of sentences produced by the new LSTM and compute the likelihood. The better the new LSTM gets at mimicking the oracle, the higher the likelihood will be.
You can certainly train it on the SMILES data, but much like English, there is no oracle. So you can not be very sure whether the model is converging to the "correct" distribution. The best you can do is use the gen
itself as the oracle
, e.g. replace this
https://github.com/suragnair/seqGAN/blob/ae8ffcd54977bd9ee177994c751f86d34f5f7aa3/main.py#L157
with train_generator_MLE(gen, gen_optimizer, gen, oracle_samples, MLE_TRAIN_EPOCHS)
, and the oracle_samlples
by the samples in the SMILES data.
@suragnair , thank you so much for your response. I'm a little confused by the definition of 'Oracle' - is the definition A) 'a perfect discriminator for a given dataset/type', B) 'a perfect generator for a given dataset/type', C) both A and B, or D) something completely different?
C. Since it's a perfect generator for the dataset, it also is the perfect discriminator in the sense that it can evaluate probabilities for any item (inside or outside the dataset).
Note that this perfect generator/discriminator is not something you have access to with say an English dataset, or SMILES dataset. But in the paper they can artificially constructed this oracle to help evaluate the GAN training procedure better.
Hi suragnair! I got the same question about the SMILES data input problem.Should I transform it into .trc? or just .txt would be well? i just wondering what exactly in the .trc....
It's just a matrix you can load using torch like this https://github.com/suragnair/seqGAN/blob/ae8ffcd54977bd9ee177994c751f86d34f5f7aa3/main.py#L141
ok, i will try , ty
Hi suragnair! do you have any idea about the transformation between smiles and .cif or other 3d Molecular format like car pdbmol2 ?
i can generate the SMILES but i cant not use it to reform the molecular...
thank you for your work, its very helpful for me!
Hi, I'm not sure since I have not used that dataset. But @mgbvox might know.
thx again,
Have a nice day
I have a dataset consisting of tokenized sequential data (SMILES data - characters represent atoms in chemical backbone). How do I train the model on this dataset, rather than the default data? What even IS the default data? From what I can tell, when generate_samples() runs on line 166 of main.py, it overwrites everything in the real.data file anyway, presumably with random data.