How the mean and variance statistics are done separately for each time step in your code?

zhiqiangdon commented 7 years ago

Thanks for sharing the code. I have read it, but still couldn't find the code doing the batch statics. It seems you just call

bn_wh = self.bn_hh(wh) bn_wi = self.bn_ih(wi)

each time doing the forward. I guess the pytorch batchnorm module doesn't automatically compute different means and variances for multiple forwards. Could you explain this point? Thanks.

jihunchoi commented 7 years ago

Hi, thank you for giving your interest to this old repo! However I didn't correctly get what you're concerning; AFAIK the PyTorch BatchNorm module automatically normalizes the input batch using the batch statistics of the batch in training mode, and it uses the accumulated means and variances in inference mode. (i.e. when a user invokes the inference mode via model.eval() or model.train(False)) The bn_hh and bn_ih internally use BatchNorm1d module and I think they would automatically compute different means and variances per each batch for multiple forward executions.

Please let me know if I got your issue wrong! I implemented the codes when I didn't know much about PyTorch, since I am feeling a bit embarrassed by some recent interests of many users!

zhiqiangdon commented 7 years ago

Hi,

Thanks for your reply. Yes, in the training mode, it will keep updating the running_mean and running_var. The key point is that each pytorch BN module only has one pair of running_mean and running_var. However, in the paper, the authors say that different pairs of running_mean and running_var should be used for different iterations. That is, for the same batchnorm module, you need to use the first pair in the first iteration, the second pair for the second iteration. Here is what the paper says:

" The batch normalization transform relies on batch statistics to standardize the LSTM activations. It would seem natural to share the statistics that are used for normalization across time, just as recurrent neural networks share their parameters over time. However, we find that simply averaging statistics over time severely degrades performance. Although LSTM activations do converge to a stationary distribution, we observe that their statistics during the initial transient differ significantly (see Figure 5 in Appendix A). Consequently, we recommend using separate statistics for each timestep to preserve information of the initial transient phase in the activations. "

In fact, according to this paper, there are two tricks applying batchnorm to the RNN. One is the initialization of gamma and the other is make separate running_mean and running_var statistics for each iteration or time-step of RNN. Your implementation only catches the first trick. Hope It is clear.

jihunchoi commented 7 years ago

Oh I now remember that when I was implementing the codes the characteristic was on my todo list and I then totally forgot about that. :( I will modify it once I have some time, but if you already fixed the problem could you send a PR please? :)

On 8 Jun 2017, at 5:33 AM, zhiqiangdon notifications@github.com<mailto:notifications@github.com> wrote:

Hi,

Thanks for your reply. Yes, in the training mode, it will keep updating the running_mean and running_var. The key point is that each pytorch BN module only has one pair of running_mean and running_var. However, in the paper, the authors say that different pairs of running_mean and running_var should be used for different iterations. That is, for the same batchnorm module, you need to use the first pair in the first iteration, the second pair for the second iteration. Here is what the paper says:

" The batch normalization transform relies on batch statistics to standardize the LSTM activations. It would seem natural to share the statistics that are used for normalization across time, just as recurrent neural networks share their parameters over time. However, we find that simply averaging statistics over time severely degrades performance. Although LSTM activations do converge to a stationary distribution, we observe that their statistics during the initial transient differ significantly (see Figure 5 in Appendix A). Consequently, we recommend using separate statistics for each timestep to preserve information of the initial transient phase in the activations. "

In fact, according to this paper, there are two tricks applying batchnorm to the RNN. One is the initialization of gamma and the other is make separate running_mean and running_var statistics for each iteration or time-step of RNN. Your implementation only catches the first trick. Hope It is clear.

— You are receiving this because you commented. Reply to this email directly, view it on GitHubhttps://github.com/jihunchoi/recurrent-batch-normalization-pytorch/issues/1#issuecomment-306916164, or mute the threadhttps://github.com/notifications/unsubscribe-auth/ABz4BZbgGKxLkloz0UIIRphjQqc54_Peks5sBwj-gaJpZM4Nwyx5.

zhiqiangdon commented 7 years ago

Hi,

Sorry, I don't have an implementation. I read your code the other day because I was interested in how you implement that trick. :)

jihunchoi commented 7 years ago

Since the internal implementation of BatchNorm module in PyTorch is quite simple -- it keeps the running mean and variance using "buffer". I think separating running statistics per timestep can be simply done via storing multiple buffers, each corresponding to statistics of a timestep, right?

jihunchoi commented 7 years ago

I close this issue due to commit b6923d08e, but I haven't tested much. Please see and give me the feedback if it's wrong.

jihunchoi / recurrent-batch-normalization-pytorch

How the mean and variance statistics are done separately for each time step in your code? #1