google / prettytensor

Pretty Tensor: Fluent Networks in TensorFlow
1.24k stars 151 forks source link

batch_normalize=True doesn't work accurately with phase=Phase.* setting #23

Closed jramapuram closed 8 years ago

jramapuram commented 8 years ago

I believe that there is an error when using phase in the default_scope coupled with batch_normalize=True.

Basically it looks like this:

    def encoder(self, inputs, latent_size, activ=tf.nn.elu, phase=pt.Phase.train):
        with pt.defaults_scope(activation_fn=activ,
                               batch_normalize=True,
                               learned_moments_update_rate=0.0003,
                               variance_epsilon=0.001,
                               scale_after_normalization=True,
                               phase=phase):
            params = (pt.wrap(inputs).
                      reshape([-1, self.input_shape[0], self.input_shape[1], 1]).
                      conv2d(5, 32, stride=2).
                      conv2d(5, 64, stride=2).
                      conv2d(5, 128, edges='VALID').
                      flatten().
                      fully_connected(self.latent_size * 2, activation_fn=None)).tensor

Full code here: https://github.com/jramapuram/CVAE/blob/master/cvae.py If I remove phase=phase within the scope assignment my model produces the following: 2d_cluster_orig

However, when setting the phase appropriately I get the following: 2d_cluster

This is trained for the same number of iterations using the same model.

eiderman commented 8 years ago

In most typical usage, batch_normalization is only applied during training and the moving average is tracked for inference time when batch size tends to be 1. Because of this, Phase.infer and Phase.test use variables in the graph that tracked the stddev/mean of the batches during training.

I feel like the infer/test paths may need better documentation to clear this up. Are you having the problem when using pt.Phase.train as well?

jramapuram commented 8 years ago

Yes, that is correct @eiderman . I use phase=pt.Phase.train during train and phase=pt.Phase.test during test. I haven't permuted them yet (i.e. try train for test, etc).

eiderman commented 8 years ago

I've checked the implementation and it should be doing the correct thing. @jramapuram, would you mind explaining the graph to me? Also, how does this impact the evaluation metrics for the relevant loss on the test set?

jramapuram commented 8 years ago

I have a convolutional variational autoencoder which is mapping to a two dimensional latent space. Thus, it disentangles the manifold seen above (of MNIST). When I use do not use the phase=* (in the scope) I see option fig 1 which is the correct expectation. When I add the phase=* option I see fig 2. I have tried re-training many times, but still face the same issue. With regards to metrics: since this is unsupervised it is slightly hard to quantify.

My train/test objects are simply this [note in train the phase is default valued to phase=pt.Phase.train and thus ommited ] :

            with tf.variable_scope("z"): # Encode our data into z and return the mean and covariance
                self.z_mean, self.z_log_sigma_sq = self.encoder(self.inputs, latent_size)
                self.z = tf.add(self.z_mean,
                                tf.mul(tf.sqrt(tf.exp(self.z_log_sigma_sq)), eps))
                # Get the reconstructed mean from the decoder
                self.x_reconstr_mean = self.decoder(self.z, self.input_size)
                self.z_summary = tf.histogram_summary("z", self.z)

            with tf.variable_scope("z", reuse=True): # The test z
                self.z_mean_test, self.z_log_sigma_sq_test = self.encoder(self.inputs, latent_size, phase=pt.Phase.test)
                self.z_test = tf.add(self.z_mean_test,
                                     tf.mul(tf.sqrt(tf.exp(self.z_log_sigma_sq_test)), eps))
                # Get the reconstructed mean from the decoder
                self.x_reconstr_mean_test = self.decoder(self.z_test, self.input_size, phase=pt.Phase.test)
eiderman commented 8 years ago

Batch normalization is behaving correctly, but I would really like to understand this phenomenon more because it may have modeling implications on best practice for BN.

One experiment that may help to verify it is whether your test results are as good when running smaller batches than all 10k. It may be that normalizing the output based on all test examples results in a cleaner embedding. The default inference behavior of BN is geared towards generating correct and stable predictions for small batch sizes.

It would be interesting to see how the accuracy changes on the test set if you were to attach a softmax layer to the embedding (and not training lower layers by using no_gradients()) and test it on various batch sizes.

Yet another aspect that would be interesting to test is which projection works better as a VAE. Since one of the goals is to make a decoder that can be easily sampled to generate new results, I suspect that a denser region of digits may work better since there is less likely to be junk spaces that produce non-digits within the samples space.

jramapuram commented 8 years ago

@eiderman : Will give it a shot for smaller batch sizes (i.e. same as training). However, this still doesn't answer why it would work when no phase parameter is provided. Does batch normalization turn off without a provided phase parameter?

I'm not sure the softmax layer makes any sense. This is a pure unsupervised problem. There are no class labels that can be provided to update the softmax's weights & biases. I'm assuming you would be talking about a softmax+cross-entropy as an optimization objective.

eiderman commented 8 years ago

Jason, with the phase set, batch normalization looks like:

If you do not set the phase, it defaults to 'train' in both cases. This means that the version without Phase set is performing normalization during inference by using the test set activations, which is not really a good thing because it can easily bring the network outside of the ranges during training and a test example's prediction may be sensitive to other items in the batch. In your case, it appears to have made your model do a better separation, but there are some caveats:

  1. 2D isn't sufficient to capture the space, so this may just be noise.
  2. Both using Phase.train and Phase.test on the test set have large patches of intermingled values and it isn't obvious which is better overall.
  3. When using a VAE as a generational model, a denser embedding is often preferable. Empty spaces in the embedding may correspond to junk digits instead of plausible images.

To test 1& 2, I would recommend either computing the test reconstruction loss (preferable) or attaching a classification loss and only training the classification layer. While I suggested softmax before, I think nearest neighbor vs train set may work just as well for a smoke test.

To test 3, Just sample from the model and make sure to hit the white space on your graph to see how the digits look. Doing enough of these to achieve statistical significance would be hard, but sampling from a VAE using a gaussian should probably give you equal probability.

On Sat, Jun 4, 2016 at 5:23 AM, Jason Ramapuram notifications@github.com wrote:

@eiderman https://github.com/eiderman : Will give it a shot for smaller batch sizes (i.e. same as training). However, this still doesn't answer why it would work when no phase parameter is provided. Does batch normalization turn off without a provided phase parameter?

I'm not sure the softmax layer makes any sense. This is a pure unsupervised problem. There are no class labels that can be provided to update the softmax's weights & biases. I'm assuming you would be talking about a softmax+cross-entropy as an optimization objective.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/google/prettytensor/issues/23#issuecomment-223752839, or mute the thread https://github.com/notifications/unsubscribe/ABnmwJgDDiKcIt68_7pkTc4n4wIdotN9ks5qIW5BgaJpZM4IoVym .

jramapuram commented 8 years ago

@eiderman : I updated my logic to do inference using only batch_size as such:

def plot_2d_cvae(sess, source, cvae):
    z_mu = []
    y_sample = []
    for _ in range(np.floor(10000.0 / FLAGS.batch_size).astype(int)):
      x_sample, y = source.test.next_batch(FLAGS.batch_size)
      z_mu.append(cvae.transform(sess, x_sample))
      y_sample.append(y)

    z_mu = np.vstack(z_mu)
    y_sample = np.vstack(y_sample)
    print 'z.shape = ', z_mu.shape, ' | y_sample.shape = ', y_sample.shape

    plt.figure(figsize=(8, 6))
    plt.scatter(z_mu[:, 0], z_mu[:, 1], c=np.argmax(y_sample, 1))
    plt.colorbar()
    plt.savefig("models/2d_cluster.png", bbox_inches='tight')
    #plt.show()

When the phase is set to test it looks like the same issue is present: 2d_cluster

However, setting phase=train for both test & train, 2d_cluster it accurately separates the manifold:

To address your points:

  1. The 2d representation is perfectly sufficient for MNIST as the manifold has been proven to be separable in this manner via the SOM, autoencoder and t-sne literature, so I don't believe that is the issue at hand.
  2. There is no intermingling going on. Phase.train is used on the parameters that are optimized during training time using the training data for MNIST. Phase.test is used at test time using the reused parameters (i.e. weights/biases) but working on the test data for MNIST. The training loss after around 400 epochs is 138.141. This is the standard VAE loss (2 part loss). I haven't had the time to add an extra layer and such.
  3. I am not using it as a generative model for the above use case. Merely as one to separate the visualize a disentangled feature space. However, here is a visualization of the reconstruction as requested from both cases (one with Phase.train for train & Phase.test for test parameters [the correct method] and one for Phase.train set for both test & train functions [the incorrect method that proves that batch_normalization is NOT working accurately).

Listed below is reconstruction when Phase.test is set accurately: 20d_reconstr_4

And here is when using Phase.train :
20d_reconstr_4

When using batch normalization with the running mean it appears to be projecting to ~ the same location (as per the reconstruction). Thus I believe that there is either something wrong with the batch_normalization implementation on the conv2d op.

eiderman commented 8 years ago

My apologies for being obtuse. BN is working as intended, but there is a gotcha (which I am currently fixing). In order for you to update the averaged mean and variance variables, you need to run the update ops on each iteration.

These are executed by adding a dependency on pt.with_update_ops as documented here: https://github.com/google/prettytensor/blob/master/docs/pretty_tensor_top_level.md#apply_optimizerlosses-regularizetrue-include_markedtrue

This is really a poor API to trickle out to other users, so I will fix it so that the updates are part of the graph.

On Mon, Jun 6, 2016 at 7:16 AM, Jason Ramapuram notifications@github.com wrote:

@eiderman https://github.com/eiderman : I updated my logic to do inference using only batch_size as such:

def plot_2d_cvae(sess, source, cvae): z_mu = [] ysample = [] for in range(np.floor(10000.0 / FLAGS.batch_size).astype(int)): x_sample, y = source.test.next_batch(FLAGS.batch_size) z_mu.append(cvae.transform(sess, x_sample)) y_sample.append(y)

z_mu = np.vstack(z_mu)
y_sample = np.vstack(y_sample)
print 'z.shape = ', z_mu.shape, ' | y_sample.shape = ', y_sample.shape

plt.figure(figsize=(8, 6))
plt.scatter(z_mu[:, 0], z_mu[:, 1], c=np.argmax(y_sample, 1))
plt.colorbar()
plt.savefig("models/2d_cluster.png", bbox_inches='tight')
#plt.show()

When the phase is set to test it looks like the same issue is present: [image: 2d_cluster] https://cloud.githubusercontent.com/assets/8204807/15824556/0b19c8f4-2c00-11e6-99f6-213158c23c6a.png

However, setting phase=train for both test & train, [image: 2d_cluster] https://cloud.githubusercontent.com/assets/8204807/15824544/f5dd4cb8-2bff-11e6-9711-c4fdd63ebbff.png it accurately separates the manifold:

To address your points:

  1. The 2d representation is perfectly sufficient for MNIST as the manifold has been proven to be separable in this manner via the SOM, autoencoder and t-sne literature, so I don't believe that is the issue at hand. 2.

    There is no intermingling going on. Phase.train is used on the parameters that are optimized during training time using the training data for MNIST. Phase.test is used at test time using the reused parameters (i.e. weights/biases) but working on the test data for MNIST. The training loss after around 400 epochs is 138.141. This is the standard VAE loss (2 part loss). I haven't had the time to add an extra layer and such. 3.

    I am not using it as a generative model for the above use case. Merely as one to separate the visualize a disentangled feature space. However, here is a visualization of the reconstruction as requested from both cases (one with Phase.train for train & Phase.test for test parameters [the correct method] and one for Phase.train set for both test & train functions [the incorrect method that proves that batch_normalization is NOT working accurately).

Listed below is reconstruction when Phase.test is set accurately: [image: 20d_reconstr_4] https://cloud.githubusercontent.com/assets/8204807/15824647/7a5685c2-2c00-11e6-9398-fba7bc504a26.png

And here is when using Phase.train :

[image: 20d_reconstr_4] https://cloud.githubusercontent.com/assets/8204807/15824672/9c3be650-2c00-11e6-992b-784e5d076063.png

When using batch normalization with the running mean it appears to be projecting to ~ the same location (as per the reconstruction). Thus I believe that there is either something wrong with the batch_normalization implementation on the conv2d op.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/google/prettytensor/issues/23#issuecomment-223971856, or mute the thread https://github.com/notifications/unsubscribe/ABnmwGSfkOHs1pqEBwUxW2guqGyqUF-Lks5qJCuvgaJpZM4IoVym .

jramapuram commented 8 years ago

Great! Thanks!

eiderman commented 8 years ago

I added fix to automatically compute the running variance/mean for inference time. If you have any other issues, please let me know!

I'm a little surprised at how poorly the model did with the initial variance (1.0) and mean (0.0). I would have expected the training to have made it somewhat resilient to scale and shift of features.

jramapuram commented 8 years ago

Great! Will give it a shot and get back

jramapuram commented 8 years ago

Thanks for the assistance @eiderman ! It is working as intended now. 2d_cluster