rinuboney / ladder

Ladder network is a deep learning algorithm that combines supervised and unsupervised learning.
MIT License
242 stars 92 forks source link

clean batch normalization should use batch mean and var for training #3

Open mikowals opened 8 years ago

mikowals commented 8 years ago

I think you are using noise_std > 0 to separate both clean vs corrupted path as well as training vs eval. This causes a problem because during evaluation batch norm mean and var should always be based on the training example averages while during training batch norm is meant to introduce regularization via noise by using the batch mean and var.

I changed the code so that update_batch_norm only ran during training on the clean path and always normalized with the mean and var of the batch. Like this:

def update_batch_normalization(batch, mean, var, l):
  if not mean or not var:
    mean, var = tf.nn.moments(batch, axes=[0])
  assign_mean = running_mean[l-1].assign(mean)
  assign_var = running_var[l-1].assign(var)
  bn_assigns.append(ewma.apply([running_mean[l-1], running_var[l-1]]))
  with tf.control_dependencies([assign_mean, assign_var]):
    return (batch - mean) / tf.sqrt(var + 1e-12)

I passed a boolean placeholder to the encoder to separate training loops from evaluation loops. Then inside the encoder I used batch_norm to normalize by the running averages outside of training steps.

if training and noise_std == 0.0:
  z = join(update_batch_normalization(z_pre_l, m_l, v_l, l), batch_normalization(z_pre_u, m, v))
elif training:
  z = join(batch_normalization(z_pre_l, m_l, v_l), batch_normalization(z_pre_u, m, v))
else:
   mean = ewma.average(running_mean[l-1])
   var = ewma.average(running_var[l-1])
   z = join(batch_normalization(z_pre_l, m_l, mean, var), batch_normalization(z_pre_u, mean, var))

This may still not be completely right since I was making all examples labeled examples. With this and the variable initialization fix I trained with 60k labeled examples down to 0.59% error.

rinuboney commented 8 years ago

yeah that's how batch normalization is done usually but in the Theano code published along with the paper, I didn't see how training and testing is distinguished. Here the update is done if it's on the clean path.

How different is the accuracy when this change is made?

mikowals commented 8 years ago

The accuracy improves from 99.29% to 99.41% by using the batch mean and batch var during training. Those are just single runs but the lower one is pretty far outside the error bounds of the papers results.

i don't think the update of the moving averages is the problem. I think the problem comeis from always returning the normalization based on the running averages when the update code is called. These lines:

if avg_mean and avg_var:
  return (batch - avg_mean) / tf.sqrt(avg_var + 1e-10)

The pseudocode on page 5 of the paper also looks to me like it is the _batch_ mean and var are used to normalize in the decoding step.

it is probably enough just to make sure update_batch_normalization always returns normalization based on the batch during training steps. I was originally concerned that the evaluation data could also be impacting the running averages but I think because the control dependency is placed on the training step that the moving average updates never actually happen from the eval data.

rinuboney commented 8 years ago

I have updated the code. I'm testing it but it takes too long on my laptop. It would be great if you confirm that the code now produces the results presented in the paper.

rinuboney commented 8 years ago

Even after making the changes in variable initialization, learning rate and batch norm, the accuracy doesn't improve over 99.29%. @mikowals did you make any other changes?

Also, in the last line of the code you posted above, it's supposed to be

z = join(batch_normalization(z_pre_l, mean, var), batch_normalization(z_pre_u, mean, var))
mikowals commented 8 years ago

Looking at the code on the master now, have you set validation_size = 0 in input_data.py so that all examples get used for training?

I have fixed the error pointed out above and am rerunning the code that previously got 99.41% accuracy to see if it was some sort of accident. I will report back.

rinuboney commented 8 years ago

I hadn't set the validation set size to 0 but even after making the correction I get almost the same results. I'll verify it again. I found a difference in the updation of running_mean and running_var in the original implementation. I thought the difference in results may be because of that but if you are able to get 99.41% accuracy then obviously it isn't. Are you able to reproduce the result?

mikowals commented 8 years ago

The final accuracy was 99.33%. I got that result on 2 training runs.

If I put my typo back (batch_normalization(z_pre_l, m_l, mean, var)) the model does train to 99.41. My interpretation is that that line of code should only be impacting the evaluation encoding of the first 100 examples. But after training I changed the code back to the corrected version and added a couple more training steps. The model continued to get 99.41 on the test results for a few steps. So somehow that change appears to have impacted the trained parameters.

I am lost as to why the clean, labelled path impacts training and why this implementation is not able to match the papers results.

mikowals commented 8 years ago

I wonder if the remaining difference is the implementation of Adam in Tensorflow vs Blocks. Blocks has different defaults and also an extra decay term that is not available in Tensorflow. The accuracies don't really appear stable after 150 epochs and for me bounce in a range betweeen 99.25 and 99.45 from about 50 epochs onwards.

rinuboney commented 8 years ago

The reported error rate for full labelled setting is 0.608 ± 0.013 which means the 99.41% accuracy you obtained concurs with the results of the paper. When I run the code, the accuracy never goes above 99.29%. The first 99% appears after 70 epochs and it bounces in a range between 99 and 99.29 after 100 epochs. I wonder what's different between our implementations.

rinuboney commented 8 years ago

So, without the typo, there is no significant difference in accuracy after distinguishing between training and testing? Actually, I didn't notice any separation between training and testing in the original implementation. I think I'll also try out the update method for running_mean and running_var used in the original implementation rather than using tf.train.ExponentialMovingAverage .

mikowals commented 8 years ago

Apparently using a placeholder for a conditional in a tensorflow graph does not work with a simple if - see http://stackoverflow.com/a/35833133/728291. Using the placeholder with tf.cond, as done in batch normalization here, looks like the right way.

rinuboney commented 8 years ago

I had that doubt earlier but when I tried it out, it was working. Let me check again.

rinuboney commented 8 years ago

Yes a simple if doesn't work. I've updated the code. Now, I get a better than earlier error rate of 1.25%