suragnair / seqGAN

A simplified PyTorch implementation of "SeqGAN: Sequence Generative Adversarial Nets with Policy Gradient." (Yu, Lantao, et al.)
642 stars 149 forks source link

Training on New (Different) Data - Stuck #10

Closed mgbvox closed 5 years ago

mgbvox commented 5 years ago

I have a dataset consisting of tokenized sequential data (SMILES data - characters represent atoms in chemical backbone). How do I train the model on this dataset, rather than the default data? What even IS the default data? From what I can tell, when generate_samples() runs on line 166 of main.py, it overwrites everything in the real.data file anyway, presumably with random data.

suragnair commented 5 years ago

Hi Matthew. The default training data is based on what they described in the paper: they are randomly sampled "sentences" from a randomly initialized LSTM with parameters that are initialized from a normal distribution here:

https://github.com/suragnair/seqGAN/blob/ae8ffcd54977bd9ee177994c751f86d34f5f7aa3/generator.py#L27

The idea behind doing this is elegant- they now have an "oracle" that can compute the distribution of any sentence. This is not possible to do otherwise since the distribution for English is not known and we do not have an oracle. By constructing this oracle and then training a freshly initialized LSTM on the training data (generated by the oracle), we can monitor the progress of the new LSTM's learning with every epoch- the oracle can take a bunch of sentences produced by the new LSTM and compute the likelihood. The better the new LSTM gets at mimicking the oracle, the higher the likelihood will be.

You can certainly train it on the SMILES data, but much like English, there is no oracle. So you can not be very sure whether the model is converging to the "correct" distribution. The best you can do is use the gen itself as the oracle, e.g. replace this

https://github.com/suragnair/seqGAN/blob/ae8ffcd54977bd9ee177994c751f86d34f5f7aa3/main.py#L157

with train_generator_MLE(gen, gen_optimizer, gen, oracle_samples, MLE_TRAIN_EPOCHS), and the oracle_samlples by the samples in the SMILES data.

mgbvox commented 5 years ago

@suragnair , thank you so much for your response. I'm a little confused by the definition of 'Oracle' - is the definition A) 'a perfect discriminator for a given dataset/type', B) 'a perfect generator for a given dataset/type', C) both A and B, or D) something completely different?

suragnair commented 5 years ago

C. Since it's a perfect generator for the dataset, it also is the perfect discriminator in the sense that it can evaluate probabilities for any item (inside or outside the dataset).

Note that this perfect generator/discriminator is not something you have access to with say an English dataset, or SMILES dataset. But in the paper they can artificially constructed this oracle to help evaluate the GAN training procedure better.

laixindev commented 5 years ago

Hi suragnair! I got the same question about the SMILES data input problem.Should I transform it into .trc? or just .txt would be well? i just wondering what exactly in the .trc....

suragnair commented 5 years ago

It's just a matrix you can load using torch like this https://github.com/suragnair/seqGAN/blob/ae8ffcd54977bd9ee177994c751f86d34f5f7aa3/main.py#L141

laixindev commented 5 years ago

ok, i will try , ty

laixindev commented 5 years ago

Hi suragnair! do you have any idea about the transformation between smiles and .cif or other 3d Molecular format like car pdbmol2 ?

i can generate the SMILES but i cant not use it to reform the molecular...

thank you for your work, its very helpful for me!

suragnair commented 5 years ago

Hi, I'm not sure since I have not used that dataset. But @mgbvox might know.

laixindev commented 5 years ago

thx again,

Have a nice day