decrease total output, incl. the bias component

dgoldman-pdx commented 9 years ago

In the "mean network", each unit's output must be decreased to compensate. That implies that both W and b, rather than only W, must be decreased -- right?

mdenil commented 9 years ago

Hi @dgoldman-ebay , thanks for your interest!

I think the code is correct without bias scaling for the following reasons:

Caffe and pylearn2 implement dropout at test time by scaling the input to the layer whose inputs are being dropped out. This is equivalent to scaling the weights but not the biases in that layer.

Caffe: https://github.com/BVLC/caffe/blob/master/src/caffe/layers/dropout_layer.cpp#L40 Pylearn2: https://github.com/lisa-lab/pylearn2/blob/master/pylearn2/models/mlp.py#L829 and https://github.com/lisa-lab/pylearn2/blob/master/pylearn2/models/mlp.py#L987

(they actually do the inverse scaling at train time but it's still equivalent).

Hinton's dropout paper (http://arxiv.org/pdf/1207.0580.pdf) mentions halfing the outgoing weights of each unit on page 2. This is a bit ambiguous about what should be done with the biases; however, the same paragraph also mentions that scaling the weights is exactly equivalent to taking the geometric mean over all 2^N dropout masks in a network with a single hidden layer (or equivalently in logistic regression with dropout on the inputs). If you work out the geometric mean of those 2^N logistic regressions you find that the biases are not scaled there.

dgoldman-pdx commented 9 years ago

@mdenil you may indeed be right.

Looking at a newer Hinton dropout summary paper (http://www.cs.toronto.edu/~rsalakhu/papers/srivastava14a.pdf), he still uses the phrase "outgoing weights". But Figure 2 helps explain this. Rather than decrease the outgoing activation value of each node (equivalent to decreasing both W and b for the node), he instead decreases the weight of each outgoing connection between the node and the nodes of the next layer.

As I read your code, it looks to me like you're instead decreasing the weight of each incoming connection from the previous layer. Have I misread?

mdenil commented 9 years ago

The dropout layer takes the dropout rate to be applied to its output (dropout_rate[layer_counter+1]), but the scaling applied to W uses dropout_rate[layer_couter] which corresponds to the dropout rate applied to the output of the previous layer.

So yes, you are correct that the code scales each incoming connection from the previous layer, but it scales them by using the dropout rate that was applied the input of that layer.

This is kind of confusing because the code bundles weights with the activations above them, and Hinton's papers talk about weights and the activations below them, but I think they agree once we unpack the indexing.

dgoldman-pdx commented 9 years ago

Thanks, @mdenil. That all makes sense. I was misreading your code.

So I'll just take this opportunity to thank you for your work here ... and now I'll move along... :)

mdenil / dropout

decrease total output, incl. the bias component #11