agitter commented 7 years ago

http://doi.org/10.18632/oncotarget.14073

Recent advances in deep learning and specifically in generative adversarial networks have demonstrated surprising results in generating new images and videos upon request even using natural language as input. In this paper we present the first application of generative adversarial autoencoders (AAE) for generating novel molecular fingerprints with a defined set of parameters. We developed a 7-layer AAE architecture with the latent middle layer serving as a discriminator. As an input and output the AAE uses a vector of binary fingerprints and concentration of the molecule. In the latent layer we also introduced a neuron responsible for growth inhibition percentage, which when negative indicates the reduction in the number of tumor cells after the treatment. To train the AAE we used the NCI-60 cell line assay data for 6252 compounds profiled on MCF-7 cell line. The output of the AAE was used to screen 72 million compounds in PubChem and select candidate molecules with potential anti-cancer properties. This approach is a proof of concept of an artificially-intelligent drug discovery engine, where AAEs are used to generate new molecular fingerprints with the desired molecular properties.

gwaybio commented 7 years ago

Interesting study that uses binarized chemical compound vectors of length 166 (that look like this) combined with dosage concentration data to generate new compounds that may help prioritize candidate small molecules that treat cancer patients.

Biological Aspects

Chemical compounds with dosage information as input
Also included is the chemical's corresponding growth inhibition in a breast cancer cell line (MCF-7)

Computational Aspects

adversarial autoencoder that encodes input binarized chemical compound vectors into a length 5 latent layer
2 layer encoder to learn how the molecular fingerprint impacts growth inhibition
- The latent layer can thereby represent a vector of how well the corresponding fingerprint impacts MCF-7 growth
2 layer decoder for reconstruction
The adversarial training comes in as the authors sample from a learned prior distribution
- The sampled length 5 vector from the prior is then run through a discriminator to detect real latent vectors from fake
- Growth inhibition is sampled from a normal distribution with mean=5 and variance=1 independently from the prior
Once the model is trained, the sampled latent vector is decoded to output an artificial molecular fingerprint with a corresponding drug concentration
This artificial fingerprint is compared against a reference of 72 million compounds from pubchem
- The authors then selected the top 10 most similar compounds to their predicted compounds if the decoded log concentration was less than -5.0 molar

Why we should include it in our review

~I am not entirely sure if we should consider this paper for our review.~ edit I think we can include it now, in the treat section or as a method for prioritizing drug candidates/repurposing.

This is not my field of expertise, but I am interested in adversarial methods so I gave this paper a thorough read. However, the methods, results, and evaluation remain a bit unclear to me. Another really nice thing about this paper is the availability of source code (https://github.com/spoilt333/onco-aae). Perhaps @spoilt333 can help to clarify some of my confusion. I outlined my understanding above, but a couple of points remain:

Why was the growth inhibition (GI) sampled independently?
- it seems to me that this is a critical component of the model and if the GI is high, then the drug is considered effective. ~~Isn't this artificial sampling decoupled from the learning process?~~ edit Based on @spoilt333's response, it is now clear that this parameter is learned. If the latent vector combination is unreasonable (presents a very high concentration in the output layer) the generated compound is rejected.
Why did the authors choose to sample 640 vectors and how did they exactly determine similar compounds from pubchem?
- edit 640 is a random number
What is the discriminator? Is it using some sort of density metric or KL divergence as compared to the latent distribution?
There is no discussion on how the model is training and if it is actually learning something meaningful. The authors do really nicely discuss several specific examples of "nearest" compounds so it seems to be working but it would really be great to see some sort of model evaluation.
- For example, what is the reconstruction cost associated with the autoencoder portion of the model and what was stopping criteria? What is it across epochs?
- What are the hyperparameters of the model and how were they chosen?
- edit these results will be expanded upon in a future publication

Overall, I thought the paper elegantly laid out the problem of very high drug development failure rate and the evolution of computational methods for compound prioritization. They also apply a promising approach that appears to be working at first glance. I think it would be great to see this approach work really well as it appears to be a very promising approach for drug development and drug repurposing. However, I think that given my concerns perhaps it is not suitable for this review. Maybe we could talk about the idea of the approach in the discussion - I am not sure.

spoilt333 commented 7 years ago

Hello there. I'll try to answer the points

Actually, GI neuron was trained jointly with rest latent neurons as predictor of "efficiency" of drug. But, after training, it was used as tuner for generating new drugs. Latent layer is a kind of noise and GI is a condition for Decoder net, and both used to produce output.
There was no reason to pick exactly 640 samples, but we had to chose some:) As output layer has sigmoid activation we treat it as probability of presence of corresponding bit in compound code. So, "similarity" was just a likelihood of a compound to be sampled from generated vector.
Discriminator is a standard GANs part. In fact, it is a binary classifier which tries to determine was sample came from some "true" distribution or it was generated by NN. In our case, true distribution was Gaussian, and false came from Encoder.
It is really a big point and we are going to make it more clear in next paper. There is few ideas Most important hyperparameter is a latent layer size IMO. We did experiments with different sizes and there was a problem with big - we couldn't able to make generator converge, but for a few neurons it converges well. We have no answer to why does it behave in such way, but since the paper was published code was evolved a lot and we are going to try again. We also had few experiments with different depths of autoencoder, but finally chose the same number as in original AAE paper - 2 fully connected layers for Encoder and for Decoder. About exact error numbers we should ask my co-author Kuzma Khrabrov. Hi is in copy with Alex Zhavoronkov.

2017-02-02 18:58 GMT+03:00 Greg Way notifications@github.com:

Interesting study that uses binarized chemical compound vectors of length 166 (that look like this http://www.nature.com/nprot/journal/v9/n9/fig_tab/nprot.2014.151_F2.html) combined with dosage concentration data to generate new compounds that may help prioritize candidate small molecules that treat cancer patients. Biological Aspects

Chemical compounds with dosage information as input

Also included is the chemical's corresponding growth inhibition in a breast cancer cell line (MCF-7)

Computational Aspects

adversarial autoencoder https://arxiv.org/abs/1511.05644 that encodes input binarized chemical compound vectors into a length 5 latent layer

2 layer encoder to learn how the molecular fingerprint impacts growth inhibition

The latent layer can thereby represent a vector of how well the corresponding fingerprint impacts MCF-7 growth

2 layer decoder for reconstruction

The adversarial training comes in as the authors sample from a learned prior distribution

The sampled length 5 vector from the prior is then run through a discriminator to detect real latent vectors from fake

Growth inhibition is sampled from a normal distribution with mean=5 and variance=1 independently from the prior

Once the model is trained, the sampled latent vector is decoded to output an artificial molecular fingerprint with a corresponding drug concentration

This artificial fingerprint is compared against a reference of 72 million compounds from pubchem https://pubchem.ncbi.nlm.nih.gov/

The authors then selected the top 10 most similar compounds to their predicted compounds if the decoded log concentration was less than -5.0 molar

Why we should include it in our review

I am not entirely sure if we should consider this paper for our review.

This is not my field of expertise, but I am interested in adversarial methods so I gave this paper a thorough read. However, the methods, results, and evaluation remain a bit unclear to me. Another really nice thing about this paper is the availability of source code ( https://github.com/spoilt333/onco-aae). Perhaps @spoilt333 https://github.com/spoilt333 can help to clarify some of my confusion. I outlined my understanding above, but a couple of points remain:

Why was the growth inhibition (GI) sampled independently?

it seems to me that this is a critical component of the model and if the GI is high, then the drug is considered effective. Isn't this artificial sampling decoupled from the learning process?

Why did the authors choose to sample 640 vectors and how did they exactly determine similar compounds from pubchem?

What is the discriminator? Is it using some sort of density metric or KL divergence as compared to the latent distribution?

There is no discussion on how the model is training and if it is actually learning something meaningful. The authors do really nicely discuss several specific examples of "nearest" compounds so it seems to be working but it would really be great to see some sort of model evaluation.

For example, what is the reconstruction cost associated with the autoencoder portion of the model and what was stopping criteria? What is it across epochs?

What are the hyperparameters of the model and how were they chosen?

Overall, I thought the paper elegantly laid out the problem of the very high drug development failure rate and the evolution of computational methods for compound prioritization. They also apply a promising approach that appears to be working at first glance. I think it would be great to see this approach work really well as it appears to be a very promising approach for drug development and drug repurposing. However, I think that given my concerns perhaps it is not suitable for this review. Maybe we could talk about the idea of the approach in the discussion - I am not sure.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/greenelab/deep-review/issues/213#issuecomment-276997680, or mute the thread https://github.com/notifications/unsubscribe-auth/AIXsci-Y6hRQ2Lgki31ymJMsyVcepU5sks5rYf0pgaJpZM4Lv5UF .

gwaybio commented 7 years ago

Hi @spoilt333 - this is great! thanks for your prompt response - i think this clears up a lot. I'll respond to your points below:

Actually, GI neuron was trained jointly with rest latent neurons as predictor of "efficiency" of drug. But, after training, it was used as tuner for generating new drugs. Latent layer is a kind of noise and GI is a condition for Decoder net, and both used to produce output.

Ah, I see, this makes sense now - I think this is a nice innovation! I can see then that the rejection criterion was whether or not the concentration of the corresponding reconstructed molecular fingerprint was reasonable.

There was no reason to pick exactly 640 samples, but we had to chose some:) As output layer has sigmoid activation we treat it as probability of presence of corresponding bit in compound code. So, "similarity" was just a likelihood of a compound to be sampled from generated vector.

Great, ok, I see now. I must have missed that the output layer was sigmoid.

Discriminator is a standard GANs part. In fact, it is a binary classifier which tries to determine was sample came from some "true" distribution or it was generated by NN. In our case, true distribution was Gaussian, and false came from Encoder.

Yep! I was wondering what the architecture of the discriminator was. Sounds like it could be a logistic regression classifier? Or was it that you sampled several times from the generator and if it fell beyond the distribution of the real latent space then it was rejected?

It is really a big point and we are going to make it more clear in next paper. There is few ideas Most important hyperparameter is a latent layer size IMO.

I have found this to be the case as well. Looking forward to the next paper.

Thanks again for responding so quickly, I will update my summary posted above accordingly.

spoilt333 commented 7 years ago

I think it could be not clear enough from code because of some optimization tricks. You're right, discriminator is logistic regression classifier with reformulated cost. About output layer - it has no activation in code, but inside tf.nn.sigmoid_cross_entropy_with_logits sigmoid applied to evaluate a cost. And, of course, after generating new vectors we applied it too.

2017-02-03 1:49 GMT+03:00 Greg Way notifications@github.com:

Hi @spoilt333 https://github.com/spoilt333 - this is great! thanks for your prompt response - i think this clears up a lot. I'll respond to your points below:

Actually, GI neuron was trained jointly with rest latent neurons as predictor of "efficiency" of drug. But, after training, it was used as tuner for generating new drugs. Latent layer is a kind of noise and GI is a condition for Decoder net, and both used to produce output.

Ah, I see, this makes sense now - I think this is a nice innovation! I can see then that the rejection criterion was whether or not the concentration of the corresponding reconstructed molecular fingerprint was reasonable.

There was no reason to pick exactly 640 samples, but we had to chose some:) As output layer has sigmoid activation we treat it as probability of presence of corresponding bit in compound code. So, "similarity" was just a likelihood of a compound to be sampled from generated vector.

Great, ok, I see now. I must have missed that the output layer was sigmoid.

Discriminator is a standard GANs part. In fact, it is a binary classifier which tries to determine was sample came from some "true" distribution or it was generated by NN. In our case, true distribution was Gaussian, and false came from Encoder.

Yep! I was wondering what the architecture of the discriminator was. Sounds like it could be a logistic regression classifier? Or was it that you sampled several times from the generator and if it fell beyond the distribution of the real latent space then it was rejected?

It is really a big point and we are going to make it more clear in next paper. There is few ideas Most important hyperparameter is a latent layer size IMO.

I have found this to be the case as well. Looking forward to the next paper.

Thanks again for responding so quickly, I will update my summary posted above accordingly.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/greenelab/deep-review/issues/213#issuecomment-277110121, or mute the thread https://github.com/notifications/unsubscribe-auth/AIXsco68TtiuUxXEsH4AcqjwJJO37fahks5rYl2PgaJpZM4Lv5UF .

greenelab / deep-review

The cornucopia of meaningful leads: Applying deep adversarial autoencoders for new molecule development in oncology #213

Biological Aspects

Computational Aspects

Why we should include it in our review