mila-iqia / blocks

A Theano framework for building and training neural networks
Other
1.16k stars 351 forks source link

Mechanisem to disable drop out for testing/sampling #699

Open udibr opened 9 years ago

udibr commented 9 years ago

The use of apply_dropout modify the graph to include dropout with a fixed drop_prob. But when testing or generating samples you want to set drop_prob to zero or remove the dropout code completely from the graph. The only solution now is to keep two graphs: one with dropout and one without. Which is wasteful. (Torch has a global flag for this.)

bartvm commented 9 years ago

Why is keeping two graphs wasteful? Theano has to compile both graphs separately (unless you want to go really crazy and use ifelse everywhere, which would be a disaster), so you're going to need two separate computation graphs whatever you do. However, they're run entirely separately as well (one during training, the other during validation), so the only overhead you have is in compile time.

What works for Torch doesn't always work for Theano because their internals work quite differently. In this case, because Theano compiles the entire graph and then runs it one go, we can't just set a global flag that changes the behaviour of the entire graph, we need two graphs.

udibr commented 9 years ago

1) Instead of using ifesle you can use a uniform random and place the drop_prob in shared variable and set it to 0 when testing 2) Having two graphs is a nice solution if you can use apply_dropout however I did not figure out how to use it with scan inside recurrent or in other cases where VariableFilter might fail. 3) How to communicate the different graphs to the different parts of the main loop is not clear from documentation. If I set one graph in model and another in GradientDescent I get warning messages from the system.

bartvm commented 9 years ago

Using a drop_prob stored in shared variable would be very wasteful, because you'd be spending a lot of time sampling 1s with probability 1.

The fact that replace doesn't work inside scan is definitely a problem, but it's mostly a shortcoming of Theano (see also https://github.com/mila-udem/blocks/issues/16). Sadly, the Theano developers have said that it's not a priority for them (https://groups.google.com/forum/#!topic/theano-dev/Imx0zxKIqi8). If you really need it, I suggest you bug the Theano developers to try and convince them that this is an important use case!

The documentation for graph management and replacement is indeed a bit terse. We'll try and work on that. You can ignore the warnings, they are there in case people make accidental mistakes. In this case, it's to be expected that the costs are different. I think there's a case to be made for getting rid of these warnings (I get them all the time as well when I use dropout).

udibr commented 9 years ago

Thanks! I read the link you gave and it looks like there are valid problems in Theano. This pushes the discussion back to having dropout done directly in code and using ifelse to deactivate it (writing the entire code twice is not acceptable.) Can you expand on why you think ifelse is a bad idea? Would it also cause unneeded random generation?

Doing unnecssary random generation is not too bad becayse it will only happen in test/sample which are usually much shorter than training.

Maybe the recurrent decorator can have a pre-built mechanisem to do it on all existing code which does not have a dropout implementation.

bartvm commented 9 years ago

ifelse seems like a bad option for a variety of reasons: It can prevent optimizations from being applied (because no optimizations can be applied across the ifelse node), it can be slow (sometimes significantly), and as recently turned out, it can give errors (https://github.com/Theano/Theano/pull/3001).

The assumption that validation/test time is much shorter isn't always true. For example, for machine translation a beam search is used to sample from the decoder. This means that many evaluations are made. Tests sets of 6,000 sentences already take hours to decode, and a significant slowdown in that case would be unacceptable.

rizar commented 9 years ago

While the canonical way two "switch off" a regularization in Blocks is definitely to keep the original graph, I think that all regularization parameters should be shared variables, properly tagged in order to be findable. Simply because there is always a chance that somebody will want to anneal them. This would effectively enable switching off dropout as @udibr proposes.

drop_prob, = VariableFilter(roles=[DROP_PROB])(cg)
cg.set_value(0.0)
rizar commented 9 years ago

https://github.com/mila-udem/blocks/issues/701