greenelab / tybalt

Training and evaluating a variational autoencoder for pan-cancer gene expression data
BSD 3-Clause "New" or "Revised" License
162 stars 61 forks source link

Incorporate ADAGE model parameter sweep #58

Closed gwaybio closed 7 years ago

gwaybio commented 7 years ago

Somewhat of a larger PR this time for one param sweep chunk.

Incorporates small changes to existing scripts to allow usage with ADAGE. Also adds specific ADAGE parameter sweep scripts and figures. The next PR will include the Jupyter notebook training the optimal model.

jaclyn-taroni commented 7 years ago

@gwaygenomics is this ready for another look or are you still looking into GaussianDropout?

gwaybio commented 7 years ago

is this ready for another look or are you still looking into GaussianDropout?

Still looking into it - I will write a message when the PR is set to review again

gwaybio commented 7 years ago

Going to track responses here (also for future reference).

In response to GaussianDropout() - I believe that its not doing what I think its doing, perhaps based on the name. The key component of GaussianDropout() is here:

return inputs * K.random_normal(shape=K.shape(inputs),
                                mean=1.0,
                                stddev=stddev)

Which is adding regularization, but not dropping out input weights. Its returning the input multiplied vectors close to 1. I should instead be using Dropout() after the input layer. I think this is more like the denoising autoencoder ADAGE is based from.

gwaybio commented 7 years ago

In response to the encoder.predict() comment:

I was previously building the encoder model with encoder = Model(input_rnaseq, encoded_rnaseq_2) and then building the latent features with latent_features = encoder.predict(rnaseq_df). This was giving out a high abundance of zeroed out latent features. I thought this was an issue with the model, so I tested it by extracting the weights a different way with:

weight_matrix = pd.DataFrame(autoencoder.get_weights()[0], index=rnaseq_df.columns)
latent_features = rnaseq_df.dot(weight_matrix) + autoencoder.get_weights()[1]  # Plus bias term

and this did not produce zeroed latent features.

The key issue was the calling to encoded_rnaseq_2 = Dense(latent_dim, activation='relu'..., specifically, including the relu activation call in building this layer. relu is active when making the encoder.predict() call, which makes all negative features zero. So, to fix this, I will separate the activation call from the weight layer instance:

# Build the Keras graph
input_rnaseq = Input(shape=(num_features, ))
encoded_rnaseq = Dropout(noise)(input_rnaseq)
encoded_rnaseq_2 = Dense(encoding_dim,
                         activity_regularizer=l1(sparsity))(encoded_rnaseq)
activation = Activation('relu')(encoded_rnaseq_2)
decoded_rnaseq = Dense(num_features, activation='sigmoid')(activation)

autoencoder = Model(input_rnaseq, decoded_rnaseq)

Separating this out solves the issue.

# Get weights with `encoder.predict()` method
predict_latent = pd.DataFrame(encoder.predict(np.array(rnaseq_df)))

# Get weights with matrix multiply method
weight_latent = rnaseq_df.dot(pd.DataFrame(autoencoder.get_weights()[0], index=rnaseq_df.columns))
weight_latent = weight_latent + autoencoder.get_weights()[1]  # add bias term

# Visualize how they are different
g = (pd.DataFrame(np.array(predict_latent) - np.array(weight_latent)).sum(axis=1) / results.shape[0]).hist();
ax = g.get_figure()
ax.savefig('figures/testing_latent_feature.png')

testing_latent_feature

I will use the encoder.predict() method in the future, but will be aware of this issue moving forward

gwaybio commented 7 years ago

ok @jaclyn-taroni - ready for another look. Thanks!

jaclyn-taroni commented 7 years ago

@gwaygenomics can you comment on why the loss goes up in some of these figures? Is this a bug?

gwaybio commented 7 years ago

@jaclyn-taroni no bug, just sparsity = 0.001 is too high. It zeros out too many weights for signal to be reconstructed and learning can never occur. I removed it from consideration in the overall final loss figure

jaclyn-taroni commented 7 years ago

@gwaygenomics - gotcha, thanks!

gwaybio commented 7 years ago

I should probably add the parameter sweep results for adage in this PR, since they're here for tybalt...

gwaybio commented 7 years ago

any other comments @jaclyn-taroni - if not, I will merge this in

jaclyn-taroni commented 7 years ago

@gwaygenomics I will go through this shortly. Haven't taken a good look at all the changes, just wanted to check in about those plots first.

gwaybio commented 7 years ago

:+1: