maxhodak / keras-molecules

Autoencoder network for learning a continuous representation of molecular structures.
MIT License
519 stars 146 forks source link

Generative Adversarial Networks #55

Open maxhodak opened 7 years ago

maxhodak commented 7 years ago

In case you guys haven't seen it, this paper came out recently and looks kind of interesting: https://arxiv.org/abs/1701.01329

My first couple read throughs leave me with some questions. The paper triggers a couple of my first-order heuristics (explaining basic stuff like RNNs, seemingly magical performance on generating long valid SMILEs that suggests overfitting) and has kind of a weird application of fine-tuning as transfer learning, among other things. I'm planning on working up some parts of this paper like the stacked LSTMs as a SMILES generator for transfer to a property prediction network over this weekend. Anyone else have any comments on this paper or things to try?

@pechersky @dribnet @dakoner

pechersky commented 7 years ago

I'd be glad to work on this with you. I agree, it looks like they overfit their model to generate similar strings. This is especially evident in the fact that they supposedly got clean adamantyl strings.The t-SNE plot tells us nothing because we don't know with what perplexity they ran it.

Additionally, the epoch graph hints at overfitting. I like their fine-tuning idea, of taking a generically trained network and optimizing it for a subset of the space. Some metric on the fine-tuned network of specific-vs-general generation would be useful. Since they canonicalize the SMILES, I don't understand why they'd use edit-distance, since small changes in the chemical topology can cause large changes in the edit-distance.

This 3-LSTM/Dropout topology looks pretty simple. I wonder what results it would give if the symbol table wasn't made of characters but of SMILES tokens.

On Sat, Jan 14, 2017 at 11:40 AM, Max Hodak notifications@github.com wrote:

In case you guys haven't seen it, this paper came out recently and looks kind of interesting: https://arxiv.org/abs/1701.01329

My first couple read throughs leave me with some questions. The paper triggers a couple of my first-order heuristics (explaining basic stuff like RNNs, seemingly magical performance on generating long valid SMILEs that suggests overfitting) and has kind of a weird application of fine-tuning as transfer learning, among other things. I'm planning on working up some parts of this paper like the stacked LSTMs as a SMILES generator for transfer to a property prediction network over this weekend. Anyone else have any comments on this paper or things to try?

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/maxhodak/keras-molecules/issues/55, or mute the thread https://github.com/notifications/unsubscribe-auth/AFGDhiEol_InhNJb5QTL9b28zX2HOHwxks5rSPpmgaJpZM4LjrGk .

maxhodak commented 7 years ago

So I let a 3-LSTM run overnight on Saturday and the loss fell to near zero, but it was definitely overfitting; it clearly wasn't extracting any interesting information about the underlying chemistry. At this point I got distracted by the idea of using a GAN instead, which is what I've been doing since then. It's pretty difficult to get it to train well as the discriminator is clearly much easier to learn compared to the generator (the discriminator is pretty trivial when the generator is weak), so I haven't figured out yet how to keep the two in reasonable balance. I'm planning on asking a couple friends at OpenAI for some advice later today. I'll post my ipynb file once I have it working a little better!

pechersky commented 7 years ago

What are you using as the discriminator? SMILES validity or just absence of (possibly invalid) string from training set?

On Tue, Jan 17, 2017 at 12:16 PM, Max Hodak notifications@github.com wrote:

So I let a 3-LSTM run overnight on Saturday and the loss fell to near zero, but it was definitely overfitting; it definitely wasn't extracting any interesting information about the underlying chemistry. At this point I got distracted by the idea of using a GAN instead, which is what I've been doing since then. It's pretty difficult to get it to train well as the discriminator is clearly much easier to learn (pretty trivial to learn) compared to the generator, so I haven't figured out yet how to keep the two in reasonable balance. I'm planning on asking a couple friends at OpenAI for some advice later today. I'll post my ipynb file once I have it working a little better!

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/maxhodak/keras-molecules/issues/55#issuecomment-273234552, or mute the thread https://github.com/notifications/unsubscribe-auth/AFGDhosSMpzzktE1kzT9CZu9c_Gf4cNpks5rTPdUgaJpZM4LjrGk .

maxhodak commented 7 years ago

I'm using presence/absence from the set. SMILES validity is an arguably even easier metric, as CCCCCcccccccccccccccccccccccccccccccccc is valid SMILES but not representative of the distribution we want to learn.

maxhodak commented 7 years ago

This is pretty typical of attempts to train my network right now:

screen shot 2017-01-17 at 11 16 02 am

Sampling from which gives me stuff that looks like Caaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa

pechersky commented 7 years ago

One could do SMILES validity AND number of heavy atoms is <= tunable_parameter * max([number of heavy atoms in ligand | ligand in test set]). I think a GAN approach is awesome for this, perhaps we swap the GRUs from the VAE with deconv layers.

On Tue, Jan 17, 2017 at 2:12 PM, Max Hodak notifications@github.com wrote:

I'm using presence/absence from the set. SMILES validity is an arguably even easier metric, as CCCCCcccccccccccccccccccccccccccccccccc is valid SMILES but not representative of the distribution we want to learn.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/maxhodak/keras-molecules/issues/55#issuecomment-273268055, or mute the thread https://github.com/notifications/unsubscribe-auth/AFGDhnxhRAgrSvG-VvRYIS_DTeML4QiRks5rTRKngaJpZM4LjrGk .

maxhodak commented 7 years ago

See: https://github.com/maxhodak/keras-molecules/blob/gan/SMILES_GAN.ipynb

maxhodak commented 7 years ago

On pretraining, worth noting that if I don't pretrain the generator, no interesting training happens at all when I try and train the GAN. Discriminator loss just goes to 0 and generator loss goes to ~16. It's not clear if pretraining the discriminator matters, or even makes things worse.

Some posts suggest changing learning parameters at runtime depending on which side is "advantaged", see https://github.com/torch/torch.github.io/blob/master/blog/_posts/2015-11-13-gan.md

Some more ideas here I haven't worked through yet: https://github.com/soumith/ganhacks

pechersky commented 7 years ago

You're generating noise using uniform sampling between 0 and 1 at every position in the vector. Our true data is some weird subset of the space of all possible one-hot vectors. What if you started by skipping training the generator, and just tried to train a discriminator with true data vs random strings sampled from the alphabet?

On Tue, Jan 17, 2017 at 2:25 PM, Max Hodak notifications@github.com wrote:

On pretraining, worth noting that if I don't pretrain the generator, no interesting training happens at all when I try and train the GAN. Discriminator loss just goes to 0 and generator loss goes to ~16. It's not clear if pretraining the discriminator matters, or even makes things worse.

Some posts suggest changing learning parameters at runtime depending on which side is "advantaged", see https://github.com/torch/ torch.github.io/blob/master/blog/_posts/2015-11-13-gan.md

Some more ideas here I haven't worked through yet: https://github.com/soumith/ganhacks

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/maxhodak/keras-molecules/issues/55#issuecomment-273271908, or mute the thread https://github.com/notifications/unsubscribe-auth/AFGDhtLn16L1PdlEGiimbNaZXDC97faZks5rTRW1gaJpZM4LjrGk .

maxhodak commented 7 years ago

I'm not sure that matters... this isn't an autoencoder; the input is just a source of entropy. The nonlinearities in the generator network should mean the distribution of the input need not resemble the distribution of the output unless I've misunderstood something.

maxhodak commented 7 years ago

I've got something looking much better now after working in a bunch of the tricks linked above, though it still has a lot of room for improvement:

screen shot 2017-01-17 at 3 36 26 pm

After 200 iterations the generator samples out stuff like:

CCcccC CCcccN CCcccc\ CCcccC CCcCc CCCcccccc\ CCcccC

Updated notebook at https://github.com/maxhodak/keras-molecules/blob/gan/SMILES_GAN.ipynb

pechersky commented 7 years ago

Nice! Do you think the simple strings are due to just insufficient training, or its converging to a simple part of the string space? Perhaps including the KL divergence in the loss might help. If I understand correctly, the generator is like the "decoder" part from the VAE. Do you think the topology and layer identity (LSTM / GRU / something else) make a qualitative difference in richness of generated molecules? I was thinking something like a Grid LSTM https://arxiv.org/pdf/1507.01526v1.pdf might be appropriate for our system due to local and distant correlation in the string.

On Tue, Jan 17, 2017 at 6:38 PM, Max Hodak notifications@github.com wrote:

I've got something looking much better now after working in a bunch of the tricks linked above, though it still has a lot of room for improvement:

[image: screen shot 2017-01-17 at 3 36 26 pm] https://cloud.githubusercontent.com/assets/83726/22044697/c699b172-dcca-11e6-968e-163c80e3d41b.png

After 200 iterations the generator samples out stuff like:

CCcccC CCcccN CCcccc CCcccC CCcCc CCCcccccc CCcccC

Updated notebook at https://github.com/maxhodak/keras-molecules/blob/gan/ SMILES_GAN.ipynb

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/maxhodak/keras-molecules/issues/55#issuecomment-273336413, or mute the thread https://github.com/notifications/unsubscribe-auth/AFGDhlFN6_ykz_55HxEvt-K42B8cTBA6ks5rTVDlgaJpZM4LjrGk .

XericZephyr commented 7 years ago

Hey, guys. Found a good place. I am also working on this field. I was trying to use seq2seq model to produce an unsupervised fingerprint for each molecule. I am also trying to use GAN as a future work. Does anyone have any update on this GAN idea?