singnet / cancer

OpenCog/SNet precision medicine for clinical trials poc project
7 stars 5 forks source link

Design and train generative model for gene expressions #5

Open noskill opened 4 years ago

noskill commented 4 years ago

Basic idea is to create generative model with such structure that would allow to disentangle features in our datasets. Particularly disentangle variability in gene expressions coming from tumor type and from a particular procedure used in gene expression.

noskill commented 4 years ago

I start with adversarial autoencoder for gene expressions:

gene_expression -> encoder-> code -> decoder -> gene_expression
                          -> code -> classifier by study

code is split into two parts, i'am adding classification loss to these to parts. Classier taking fist part of the code shouldn't be able to attribute it to a particular study. Classifier taking second part of the code should be able to attribute it to a particular study.

This should give a some level of disentanglement at the level of code.

noskill commented 4 years ago

Alexey suggested to try also infogan

Necr0x0Der commented 4 years ago

Let me explain the idea with generative models, because it can have different implementations. We want to represent gene expression levels with a latent code, which apparently should have two parts - study-dependent (v_i) and study-independent (z_i). The latter should describe the objective characteristics of tumors themselves, while the former should capture the platform specifics, biases in the patient distributions, and such. We want these two parts to be independent (statistically/informationally). We also want the tumor-specific part to be useful for predictions. We can consider two types of generative models - GANs and autoencoders, and their mixture. Autoencoders seem more intuitive, but it might be trickier to understand with them, what should be shared between different encoders/decoders. So, let's start with generative models. In fact, we should have only one decoder/generator: x_i=G(z_i, v_i), because our latent code already contains study-specific information. We can have individual generators per study. In this case, we will not need v_i: x_i=G_s(z_i). But then the question will be how to make the latent space of these generators shared. So, let's proceed with one generator.

We can try training it directly using adversarial loss. However, we will need to somehow separate z_i and v_i.

1) The first idea is to use InfoGANs. Indeed, InfoGANs were designed to learn z_i and v_i as independent variables (and they have corresponding mutual information loss component). We can specify z as a categorical distribution, for example, and see if it will be able to learn meaningful molecular subtypes. Applying InfoGANs out of the box should be easy and worth trying. However, the drawback of vanilla InfoGANs is that they will not use the available information about the distribution of samples over the studies.

2) SD-GANs can use this information. Basically, we can require that v_i = const over one study. Unfortunately, SD-GANs would typically require that we can provide pairs with the same n_i, but different v_i. And we just don't have ground truth for this (very similar patients in different studies). We can try applying SD-GANs in our case, but I'm not sure they will work well. So, I'd rather try injecting this partial SD-GAN-type loss into InfoGANs. Simply putting, we sample two n_i, n_j and one v_i and produce x_i and x_j requiring the discriminator to recognize them as from one study. But this is the question for more detailed analysis.

Instead of adversarial loss, we can use reconstruction loss, that will lead us to autoencoders (and we can use AAE in order to impose priors on the latent code). The problem is that there are no standard models, which would help us splitting the latent code of AAE into two parts (although there are very many less standard existing models, which can do something similar). One way to do this is to use z_i as input to the predictor, so we can add the prediction/classification loss. But we may want to be sure that n_i doesn't contain useful information for classification (and there are also different tricks to do this). We can add this prediction loss to InfoGANs as well. But generally speaking, we need to develop a more detailed model, which may not be reducible to the existing vanilla models or their simple heuristic modifications...