kaldi-asr / kaldi

kaldi-asr/kaldi is the official location of the Kaldi project.
http://kaldi-asr.org
Other
14.28k stars 5.32k forks source link

Dropout for ephemeral connections #988

Closed danpovey closed 4 years ago

danpovey commented 8 years ago

I have an idea that I'd like help to implement in nnet3. I'm not sure who to ask to work on this.

It's related to dropout, but it's not normal dropout. In normal dropout we'd normally have a schedule where we start off with dropout (e.g. dropout probability of 0.5) and gradually make it more like a regular non-stochastic node (dropout probability of 0.0). In this idea we start off with not much dropout, and end up dropping out with probability 1.0, after which we delete the associated node.

In this idea, the dropout is to be applied to weight matrices for skip connections and (in RNNs) to time-delay connections with a bigger-than-one time delay. These are connections which won't exist in the final network but which only exist near the start of training. The idea is that allowing the network to depend on these skip-connections early on in training will help it to later learn that same information via the regular channels (e.g. in a regular DNN, through the non-skip connections; or in an RNN, through the regular recurrence). Imagine, for instance, that there is useful information to be had from 10 frames ago in a RNN. By giving the rest of the network a "taste" for this information, then gradually reducing the availability of this information, we encourage the network to find ways to get that information through the recurrent connections.

The dropout-proportion would most likely increase from about 0.0 to 0.5 at the start of training, to 1.0 maybe a third of the way through the iterations. Once the dropout-proportion reaches 1.0 the corresponding weight matrix could be removed entirely since it has no effect.

Implementing this will require implementing a DropoutComponent in nnet-simple-component.h. This will be a little similar to the dropout component in nnet2, except there will be no concept of a scale-- it will scale its input by zero or one, doing dropout with scale==0 with a specified probability. Be careful with the flags returned by Properties()... the backprop needs both the input and output values to be present so that it can figure out the scaling factor [kBackpropNeedsInput|kBackpropNeedsOutput]. This component will have a function to set the dropout probability; and you'll declare in nnet-utils.h a function void SetDropoutProbability(BaseFloat dropout_prob, Nnet *nnet); and a corresponding option to nnet3-copy, which calls this function, that can be used to set the dropout prob from the command line.

Removal of components once the dropout probability falls to zero can be accomplished by nnet3-init with a suitable config. This can't actually remove the component yet [we'd need to implement config commands like delete-component <component-name> and delete-node <node-name>], but you can replace the node input descriptors in such a way as to 'orphan' the component and its node, so it won't actually participate in the computation.

In order to facilitate the removal of nodes, it will be easiest if, instead of splicing together the dropped-out skip-connection with the regular time-spliced inputs, you use Sum(..) after the affine component, to sum together the output of the regular affine layer with the output of an affine layer whose input is the dropped-out skip-connection. That is, instead of modifying the regular TDNN layer which looks like:

component-node name=Tdnn_4_affine component=Tdnn_4_affine input=Append(Offset(Tdnn_3_renorm, -3) , Offset(Tdnn_3_renorm, 3))
component-node name=Tdnn_4_relu component=Tdnn_4_relu input=Tdnn_4_affine
component-node name=Tdnn_4_renorm component=Tdnn_4_renorm input=Tdnn_4_relu
...

to look like:

component-node name=Tdnn_4_affine component=Tdnn_4_affine input=Append(Offset(Tdnn_3_renorm, -3) , Offset(Tdnn_3_renorm, 3), Tdnn_2_dropout)
# note: Tdnn2_dropout is the same as Tdnn_2_renorm but followed by dropout.
component-node name=Tdnn_4_relu component=Tdnn_4_relu input=Tdnn_4_affine
component-node name=Tdnn_4_renorm component=Tdnn_4_renorm input=Tdnn_4_relu
...

instead you should modify it to look like:

component-node name=Tdnn_4_affine component=Tdnn_4_affine input=Append(Offset(Tdnn_3_renorm, -3) , Offset(Tdnn_3_renorm, 3))
component-node name=Tdnn_4_relu component=Tdnn_4_relu input=Sum(Tdnn_4_affine, Tdnn_2_dropout_affine)
component-node name=Tdnn_4_renorm component=Tdnn_4_renorm input=Tdnn_4_relu
# note: Tdnn2_dropout_affine is the same as Tdnn_2_renorm but followed by dropout then an affine component.

Initially I wouldn't worry too much about making the scripts too nice, since this may not even work.

In what I've sketched out above, I've assumed that the dropout precedes the affine component. In fact, it might be better to have the dropout follow the affine component. The reason for this relates to the bias term in the affine component: by discarding the dropout path we won't get a network that's equivalent to the original network with dropout-probability of 1.0, because of the bias term. And we don't have a convenient way to get rid of the bias term (there is no NaturalGradientLinearComponent implemented). This problem disappears if the dropout comes after the affine component. If this turns out to be useful we can figure out how to solve this in a more elegant way later on.

danpovey commented 8 years ago

I am becoming more convinced that this method should be quite useful, and I'd really like someone (or people) to help with this. @galv, is there any chance you could take on the problem of creating the dropout component for nnet3? It will just be a simplification of the nnet2 code, for the most part. Then maybe @freewym can do the modification of the scripts and test this out.

galv commented 8 years ago

It seems reasonable. I would have to think more about whether it could be implemented quickly on GPU (it probably can be, in spite of caching values in back propagation for a long time).

Overall, though, my main concern is that it strikes me that we could get easy gains by adopting other innovations in the neural network literature like batch norm, residual networks, and persistent RNNs (don't quite understand persistent RNNs yet, though; but a 30 times speed up would make experimenting with RNNs a lot easier).

danpovey commented 8 years ago

It seems reasonable. I would have to think more about whether it could be implemented quickly on GPU (it probably can be, in spite of caching values in back propagation for a long time).

It's already been done in nnet2, it's quite easy, no caching is needed-- the component just requires both the inputs and outputs, and can figure out the mask from that.

Overall, though, my main concern is that it strikes me that we could get easy gains by adopting other innovations in the neural network literature like batch norm, residual networks, and persistent RNNs (don't quite understand persistent RNNs yet, though; but a 30 times speed up would make experimenting with RNNs a lot easier).

the natural gradient has a similar effect to batch norm which is why I have not put effort into implementing that. The other things are not things that I have heard about in a speech context.

Dan

danpovey commented 8 years ago

I had a look into ResNets... it's where every other layer, you have a skip connection to the layer before, with the identity matrix. That could very easily be incorporated into our TDNN setups, using the 'Add()' expressions in the config file. It's some very simple scripting. @freewym, I don't know if you have time for this? In the TDNN configs, we have lines like:

component-node name=Tdnn_5_relu component=Tdnn_5_relu input=Tdnn_5_affine

and you could easily change this to:

component-node name=Tdnn_5_relu component=Tdnn_5_relu input=Plus(Tdnn_5_affine, Tdnn_3_affine)

[and, say, only do this for odd-numbered components]. Of course, there are many other ways to do this, but this seems closest in spirit to the way it was originally done. Anyway it's a very low-cost experiment.

Dan

On Mon, Sep 12, 2016 at 6:46 PM, Daniel Povey dpovey@gmail.com wrote:

It seems reasonable. I would have to think more about whether it could be implemented quickly on GPU (it probably can be, in spite of caching values in back propagation for a long time).

It's already been done in nnet2, it's quite easy, no caching is needed-- the component just requires both the inputs and outputs, and can figure out the mask from that.

Overall, though, my main concern is that it strikes me that we could get easy gains by adopting other innovations in the neural network literature like batch norm, residual networks, and persistent RNNs (don't quite understand persistent RNNs yet, though; but a 30 times speed up would make experimenting with RNNs a lot easier).

the natural gradient has a similar effect to batch norm which is why I have not put effort into implementing that. The other things are not things that I have heard about in a speech context.

Dan

freewym commented 8 years ago

As far as I know, ResNets is especially beneficial to train very deep networks (more than hundreds of layers). Not sure if it would improve over the current tdnn setup. But anyway, I can pick it up.

Yiming

danpovey commented 8 years ago

Regarding ResNets (and this is also of broader interest), Vijay just pointed out to me this ArXiv paper from Microsoft Research http://arxiv.org/pdf/1609.03528v1.pdf where they report the best-ever results on Switchboard, at 6.3%. A variety of ResNet is one of their systems, although it seems to involve convolutional concepts-- it's maybe not a standard feed-forward ResNet (but neither would ours be).

They are also using lattice-free MMI, and they cite our lattice-free MMI paper that I just presented in Interspeech... however, it is probably something they were doing already, as Geoff had implemented lattice-free MMI before at IBM; and it's on a conventional 10ms frame rate.

On Mon, Sep 12, 2016 at 8:05 PM, Yiming Wang notifications@github.com wrote:

As far as I know, ResNets is especially beneficial to train very deep networks (more than hundreds of layers). Not sure if it would improve over the current tdnn setup. But anyway, I can pick it up.

Yiming

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/kaldi-asr/kaldi/issues/988#issuecomment-246560353, or mute the thread https://github.com/notifications/unsubscribe-auth/ADJVu1Tvq6YSR7gEkBFKhuEGlFBWOfKGks5qphMVgaJpZM4Jpaxn .

pegahgh commented 8 years ago

Hi Dan. This idea in ResNet is exactly the same as my LADNN paper in MSR. Actually we proposed and submitted this idea to ICASSP before ResNet publication, but they didn't cite our work, although it was internal MSR work and they should cite our paper. This is the reason that MSR people decided to put their paper on archive after submitting paper to conferences and they submitted their own version of LF-MMI to ICASSP on Monday!! Since it is close to whatever I did before, I like to pick this issue if no one already started working on that!

danpovey commented 8 years ago

OK sure. Yiming has tons of stuff to do anyway, I think.

On Tue, Sep 13, 2016 at 1:07 PM, pegahgh notifications@github.com wrote:

Hi Dan. This idea in ResNet is exactly the same as my LADNN paper in MSR. Actually we proposed and submitted this idea to ICASSP before ResNet publication, but they didn't cite our work, although it was internal MSR work and they should cite our paper. This is the reason that MSR people decided to put their paper on archive after submitting paper to conferences and they submitted their own version of LF-MMI to ICASSP on Monday!! Since it is close to whatever I did before, I like to pick this issue if no one already started working on that!

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/kaldi-asr/kaldi/issues/988#issuecomment-246807237, or mute the thread https://github.com/notifications/unsubscribe-auth/ADJVux4zeYzomDsapN9bcDofWsNcmqIjks5qpwKXgaJpZM4Jpaxn .

freewym commented 8 years ago

I am already working on it. Perhaps Pegah could try on other configurations.

Yiming

On Tue, Sep 13, 2016 at 4:20 PM, Daniel Povey notifications@github.com wrote:

OK sure. Yiming has tons of stuff to do anyway, I think.

On Tue, Sep 13, 2016 at 1:07 PM, pegahgh notifications@github.com wrote:

Hi Dan. This idea in ResNet is exactly the same as my LADNN paper in MSR. Actually we proposed and submitted this idea to ICASSP before ResNet publication, but they didn't cite our work, although it was internal MSR work and they should cite our paper. This is the reason that MSR people decided to put their paper on archive after submitting paper to conferences and they submitted their own version of LF-MMI to ICASSP on Monday!! Since it is close to whatever I did before, I like to pick this issue if no one already started working on that!

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/kaldi-asr/kaldi/issues/988#issuecomment-246807237, or mute the thread https://github.com/notifications/unsubscribe-auth/ ADJVux4zeYzomDsapN9bcDofWsNcmqIjks5qpwKXgaJpZM4Jpaxn .

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/kaldi-asr/kaldi/issues/988#issuecomment-246810830, or mute the thread https://github.com/notifications/unsubscribe-auth/ADWAkkQ1tXlr0_IyLdtZXT-V1toxbS_2ks5qpwWogaJpZM4Jpaxn .

Yiming Wang Department of Computer Science The Johns Hopkins University 3400 N. Charles St. Baltimore, MD 21218

danpovey commented 8 years ago

Oh OK. Pegah, if you're interested in this area, maybe you could do that thing about the "ephemeral connections", e.g. implement the dropout component in nnet3. But I want to merge the dropout component before you do the experiments, because otherwise the project will be too big to easily review. Dan

On Tue, Sep 13, 2016 at 1:22 PM, Yiming Wang notifications@github.com wrote:

I am already working on it. Perhaps Pegah could try on other configurations.

Yiming

On Tue, Sep 13, 2016 at 4:20 PM, Daniel Povey notifications@github.com wrote:

OK sure. Yiming has tons of stuff to do anyway, I think.

On Tue, Sep 13, 2016 at 1:07 PM, pegahgh notifications@github.com wrote:

Hi Dan. This idea in ResNet is exactly the same as my LADNN paper in MSR. Actually we proposed and submitted this idea to ICASSP before ResNet publication, but they didn't cite our work, although it was internal MSR work and they should cite our paper. This is the reason that MSR people decided to put their paper on archive after submitting paper to conferences and they submitted their own version of LF-MMI to ICASSP on Monday!! Since it is close to whatever I did before, I like to pick this issue if no one already started working on that!

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <https://github.com/kaldi-asr/kaldi/issues/988#issuecomment-246807237 , or mute the thread https://github.com/notifications/unsubscribe-auth/ ADJVux4zeYzomDsapN9bcDofWsNcmqIjks5qpwKXgaJpZM4Jpaxn .

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/kaldi-asr/kaldi/issues/988#issuecomment-246810830, or mute the thread https://github.com/notifications/unsubscribe- auth/ADWAkkQ1tXlr0_IyLdtZXT-V1toxbS_2ks5qpwWogaJpZM4Jpaxn .

Yiming Wang Department of Computer Science The Johns Hopkins University 3400 N. Charles St. Baltimore, MD 21218

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/kaldi-asr/kaldi/issues/988#issuecomment-246811321, or mute the thread https://github.com/notifications/unsubscribe-auth/ADJVuwloSqrREL2Ul0gt0XHaObkKowAqks5qpwYQgaJpZM4Jpaxn .

pegahgh commented 8 years ago

Good idea! I can implement Dropout component in nnet3 setup. @freewym Did you start using bypass connection on LSTM setup?

freewym commented 8 years ago

Not yet. I am testing it on tdnn now.

On Tue, Sep 13, 2016 at 4:31 PM, pegahgh notifications@github.com wrote:

Good idea! I can implement Dropout component in nnet3 setup. @freewym https://github.com/freewym Did you start using bypass connection on LSTM setup?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/kaldi-asr/kaldi/issues/988#issuecomment-246814037, or mute the thread https://github.com/notifications/unsubscribe-auth/ADWAksOqnvPVAEWIJfACpIA_VUFzcR3vks5qpwg3gaJpZM4Jpaxn .

Yiming Wang Department of Computer Science The Johns Hopkins University 3400 N. Charles St. Baltimore, MD 21218

galv commented 8 years ago

@pegahgh Since I've given this some thought already, feel free to ping me for code review or discussion of implementation.

On Tue, Sep 13, 2016 at 1:31 PM, pegahgh notifications@github.com wrote:

Good idea! I can implement Dropout component in nnet3 setup. @freewym https://github.com/freewym Did you start using bypass connection on LSTM setup?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/kaldi-asr/kaldi/issues/988#issuecomment-246814037, or mute the thread https://github.com/notifications/unsubscribe-auth/AEi_UM-qLWmUwOI8gMd1JvPuiMXLfGfqks5qpwgygaJpZM4Jpaxn .

Daniel Galvez

danpovey commented 8 years ago

Makes sense. Pegah, if you put an early draft of the nnet3 dropout stuff up as a pull request, it will be fastest.

On Tue, Sep 13, 2016 at 1:33 PM, Daniel Galvez notifications@github.com wrote:

@pegahgh Since I've given this some thought already, feel free to ping me for code review or discussion of implementation.

On Tue, Sep 13, 2016 at 1:31 PM, pegahgh notifications@github.com wrote:

Good idea! I can implement Dropout component in nnet3 setup. @freewym https://github.com/freewym Did you start using bypass connection on LSTM setup?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/kaldi-asr/kaldi/issues/988#issuecomment-246814037, or mute the thread https://github.com/notifications/unsubscribe-auth/AEi_UM- qLWmUwOI8gMd1JvPuiMXLfGfqks5qpwgygaJpZM4Jpaxn .

Daniel Galvez

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/kaldi-asr/kaldi/issues/988#issuecomment-246814549, or mute the thread https://github.com/notifications/unsubscribe-auth/ADJVu5UpGin3iblMImpv1-1nK5mTgBpEks5qpwibgaJpZM4Jpaxn .

danpovey commented 8 years ago

Regarding the ephemeral connections, I just noticed that in this paper https://arxiv.org/pdf/1510.08983.pdf about highway LSTMs, they start with dropout on these connections at 0.1 and increase it to 0.8-- so it's a bit like the proposed ephemeral-connections idea (except they don't increase the dropout all to 1 and then remove the component). Anyway, to me it confirms that there is something to the idea.

stale[bot] commented 4 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

stale[bot] commented 4 years ago

This issue has been automatically closed by a bot strictly because of inactivity. This does not mean that we think that this issue is not important! If you believe it has been closed hastily, add a comment to the issue and mention @kkm000, and I'll gladly reopen it.