Do bidirectional layers share the input-to-hidden weights?

timolohrenz commented 4 years ago

Hmm, I am really thankful about your LSTM implementation which seems to be afaik the only non-cudnn based one which allows me to use customizations like static dropout masks etc.. However I think there might be an issue

As I understood the implementation of the bidirectional LSTM, the additional backward direction is processed by flipping the inputs in time and adding them as additional batches to the input tensor at this point: https://github.com/mravanelli/pytorch-kaldi/blob/775f5dbbf142fb1c1a56604ee603d426ca73a51f/neural_networks.py#L415-L417

Later in the forward pass the input tensor x is then passed through all 4 weight matrices: https://github.com/mravanelli/pytorch-kaldi/blob/775f5dbbf142fb1c1a56604ee603d426ca73a51f/neural_networks.py#L431-L435

Doesn't that mean, that the exact same weight matrices are applied to both directions? I am a bit suspicious as torch summary shows only 4 weight matrices for the input-hidden connections while showing 8 weight matrices for the hidden-hidden connections (I am using a Layer Size of 512 in the LSTM => 512*513= 262656).

   Layer (type)               Output Shape         Param 
      _LSTM-159             [-1, 200, 512]               0
    Linear-160             [-1, 400, 512]         262,656
     Linear-161             [-1, 400, 512]         262,656
     Linear-162             [-1, 400, 512]         262,656
     Linear-163             [-1, 400, 512]         262,656
     Linear-164                  [-1, 512]         262,144
     Linear-165                  [-1, 512]         262,144
     Linear-166                  [-1, 512]         262,144
     Linear-167                  [-1, 512]         262,144
       Tanh-168                  [-1, 512]               0
       Tanh-169                  [-1, 512]               0
     Linear-170                  [-1, 512]         262,144
     Linear-171                  [-1, 512]         262,144
     Linear-172                  [-1, 512]         262,144
     Linear-173                  [-1, 512]         262,144
       Tanh-174                  [-1, 512]               0
       Tanh-175                  [-1, 512]               0

This also means that the number of parameters is not doubling when using bidirectional LSTM as it is mentioned in Issue https://github.com/mravanelli/pytorch-kaldi/issues/214

Is this intentional behavior or am I getting something wrong?

Thanks for your help and the good work!

mravanelli commented 4 years ago

Yes, this is done on purpose. This way you save a lot of parameters, you generalize better, and (according to the task we considered so far) you improve the performance. Best,

Mirco

On Sat, 25 Jul 2020 at 18:12, Timo Lohrenz notifications@github.com wrote:

Hmm, I am really thankful about your LSTM implementation which seems to be afaik the only non-cudnn based one which allows me to use customizations like static dropout masks etc.. However I think there might be an issue

As I understood the implementation of the bidirectional LSTM, the additional backward direction is processed by flipping the inputs in time and adding them as additional batches to the input tensor at this point: https://github.com/mravanelli/pytorch-kaldi/blob/775f5dbbf142fb1c1a56604ee603d426ca73a51f/neural_networks.py#L415-L417

Later in the forward pass the input tensor x is then passed through all 4 weight matrices:

https://github.com/mravanelli/pytorch-kaldi/blob/775f5dbbf142fb1c1a56604ee603d426ca73a51f/neural_networks.py#L431-L435

Doesn't that mean, that the exact same weight matrices are applied to both directions? I am a bit suspicious as torch summary shows only 4 weight matrices for the input-hidden connections while showing 8 weight matrices for the hidden-hidden connections (I am using a Layer Size of 512 in the LSTM => 512*513= 262656).

Layer (type) Output Shape Param _LSTM-159 [-1, 200, 512] 0 Linear-160 [-1, 400, 512] 262,656 Linear-161 [-1, 400, 512] 262,656 Linear-162 [-1, 400, 512] 262,656 Linear-163 [-1, 400, 512] 262,656 Linear-164 [-1, 512] 262,144 Linear-165 [-1, 512] 262,144 Linear-166 [-1, 512] 262,144 Linear-167 [-1, 512] 262,144 Tanh-168 [-1, 512] 0 Tanh-169 [-1, 512] 0 Linear-170 [-1, 512] 262,144 Linear-171 [-1, 512] 262,144 Linear-172 [-1, 512] 262,144 Linear-173 [-1, 512] 262,144 Tanh-174 [-1, 512] 0 Tanh-175 [-1, 512] 0

This also means that the number of parameters is not doubling when using bidirectional LSTM as it is mentioned in Issue #214 https://github.com/mravanelli/pytorch-kaldi/issues/214

Is this intentional behavior or am I getting something wrong?

Thanks for your help and the good work!

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/mravanelli/pytorch-kaldi/issues/240, or unsubscribe https://github.com/notifications/unsubscribe-auth/AEA2ZVU56BPXAHL3BMC4HN3R5NKEBANCNFSM4PHVCXQQ .

timolohrenz commented 4 years ago

Hey Mirco,

okay, thanks for pointing that out. Interesting!

Best regards, Timo

mravanelli / pytorch-kaldi

Do bidirectional layers share the input-to-hidden weights? #240