sphinxteam / Boltzmann.jl

Restricted Boltzmann Machines in Julia
Other
15 stars 4 forks source link

Implement Dropout #1

Open eric-tramel opened 8 years ago

eric-tramel commented 8 years ago

Goal

One of the latest/best regularisation techniques for training RBMs is dropout. Unfortunately, the original Boltzmann.jl package does not implement this technique, so we should undertake this ourselves.

Technique

During the training phase of the RBM, each hidden node is present with only probably $p$. Training is performed for this reduced model and then the resulting trained models are combined. The pertinent section from (Srivasta 2014) reads,

8.2 Learning Dropout RBMs

Learning algorithms developed for RBMs such as Contrastive Divergence (Hinton et al., 2006) can be directly applied for learning Dropout RBMs. The only difference is that r is first sampled and only the hidden units that are retained are used for training. Similar to dropout neural networks, a different r is sampled for each training case in every minibatch. In our experiments, we use CD-1 for training dropout RBMs.

References

Srivasta et al, "Dropout: A simple way to prevent neural networks from overfitting," JMLR, vol. 15, 2014, pp. 1929-1958.

eric-tramel commented 8 years ago

I forgot to reference this issue in the commit d3c2caa9469e188161396c40af7b7d8d883a7b9d ! I have created a separate branch to work on implementing this feature.

Modifications

For my first attempt at implementing this feature, I added an optional parameter to rbm.jl/fit() to allow the user to specify the dropout rate:

function fit(rbm::RBM, X::Mat{Float64};
             persistent=true, lr=0.1, n_iter=10, batch_size=100, n_gibbs=1,dorate=0.0)

From here, I tried to take the approach I quoted earlier (§8.2 of Srivasta et al 2014) and apply a different dropout pattern for each training sample in the mini batch. I accomplish this within rbm.jl/gibbs():

function gibbs(rbm::RBM, vis::Mat{Float64}; n_times=1,dorate=0.0)
    suppressedUnits = rand(size(rbm.hbias,1),size(rbm.vis,2)) .< dorate  
   ...

I then modify rbm.jl/sample_visibles() (and, the corresponding rbm.jl/vis_means()) to take this logical array specifying the suppressed/dropped hidden units and assign zeros to the dropped hidden units before calculating the matrix-matrix product between rbm.W and the hidden activations:

function vis_means(rbm::RBM, hid::Mat{Float64}, suppressedUnits::Mat{Bool})    
    hid[suppressedUnits] = 0.0          # Suppress dropped hidden units
    p = rbm.W' * hid .+ rbm.vbias
    return logistic(p)
end

This should, in total, accomplish the dropout.

Doubts

Now, what isn't clear to me is whether or not the dropout pattern should be changing from epoch to epoch. The paper seems to indicate that the pattern should be changing from mini batch to mini batch, but it doesn't specify anything about the epoch. I am assuming that this pattern is updated at every mini batch computation, however. If anyone has any references to other RBM Dropout implementations, they might be helpful in clearing up this issue.

eric-tramel commented 8 years ago

Okay, it had some issues, some bugs I put in, but now it is building and passing tests! I'll need to make a dropout test to ensure that everything is really working correctly.

eric-tramel commented 8 years ago

Okay! It works! The issues I was being with the keywords not being recognised were due to the workspace not being cleared before running the mnistexample_dropout.jl script. After clearing out the workspace, it seemed to run fine. What is yet to be done is to run a comparison to show that this implementation of dropout is really giving some advantage over no dropout.

Thanks @alaa-saade !

eric-tramel commented 8 years ago

So, it seems like there is still something to be desired in the Dropout performance. Currently there does not seem to be much difference between it and the pseudo-likelihood obtained when not using dropout, as shown in the following figure: dropout_trainingpl

I'm going to restructure where the dropout is enforced. I think that perhaps I'm not doing it in the right manner. Referring to This Lua/torch7 implementation, it seems that we need to make sure to suppress these units on the gradient update, as well.

krzakala commented 8 years ago

Interesting. But is it known that the effect of dropout can be seen on pseudo likely hood ??

eric-tramel commented 8 years ago

@krzakala : I don't truly know if the effect can be seen on the PL or not. You could very well be right on this point. I'm working on a demo, now, which reports the estimated features (W), as well. I'll also include a histogram of the hidden activations, as was done in (Srivasta 2014), to show the discrepancy between the approaches.