maxjcohen / transformer

Implementation of Transformer model (originally from Attention is All You Need) applied to Time Series.
https://timeseriestransformer.readthedocs.io/en/latest/
GNU General Public License v3.0
842 stars 165 forks source link

The output of my Transformer is the same for all time steps #40

Closed shamoons closed 3 years ago

shamoons commented 3 years ago

My model is:


from tst import Transformer
import torch.nn as nn

class AudioEncoder(nn.Module):
    def __init__(self, d_inp, d_out):
        super(AudioEncoder, self).__init__()
        self.encoder1 = nn.Linear(d_inp, d_out)
        self.encoder2 = nn.Linear(d_out, d_out)

    def forward(self, inp):
        out = self.encoder1(inp)
        out = nn.ReLU()(out)

        out = self.encoder2(out)
        out = nn.ReLU()(out)

        return out

class AudioDecoder(nn.Module):
    def __init__(self, d_inp, d_out):
        super(AudioDecoder, self).__init__()
        self.decoder1 = nn.Linear(d_inp, d_out)
        self.decoder2 = nn.Linear(d_out, d_out)

    def forward(self, inp):
        out = self.decoder1(inp)
        out = nn.ReLU()(out)

        out = self.decoder2(out)
        out = nn.Tanh()(out)

        return out

class AudioReconstructor(nn.Module):

    def __init__(self, d_input, d_output, d_model, N, q, v, h, chunk_mode, pe, dim_embedding):
        super(AudioReconstructor, self).__init__()

        self.transformer = Transformer(d_input=dim_embedding, d_output=dim_embedding, d_model=d_model, N=N, q=q, v=v, h=h, chunk_mode=chunk_mode, pe=pe, pe_period=800)

        self.audio_encoder = AudioEncoder(d_inp=d_input, d_out=dim_embedding)

        self.audio_decoder = AudioDecoder(d_inp=dim_embedding, d_out=d_output)

    def forward(self, src):
        out = self.audio_encoder(src)

        # Second, do the transformer operation
        out = self.transformer(out)
        out = nn.ReLU()(out)

        out = self.audio_decoder(out)

        return out

If I print the values after self.transformer(out), I get the same values for each time step. Any ideas why that might be?

maxjcohen commented 3 years ago

Hi, I can't say fore sure what is causing this issue, but you could take a look at the attention maps, and the latent vector of the Transformer to see if anything looks suspicious.

This may be unrelated, but I don't understand why you are defining an audio encoder and decoder, in addition of the Transformer which itself is an encoder-decoder architecture. Why not simply use the Transformer as your AudioRecontructor ?

shamoons commented 3 years ago

Thanks for the input. I think I saw somewhere in the docs about the attention map. The purpose of the encoder / decoder linear layers is to learn a vectorized representation of a raw audio signal (1-D) and then to learn from a vector output back to raw audio. I suppose I could just use the raw audio (that’s been unfolded) straight to the Transformer as well?

shamoons commented 3 years ago

Hi, I can't say fore sure what is causing this issue, but you could take a look at the attention maps, and the latent vector of the Transformer to see if anything looks suspicious.

This may be unrelated, but I don't understand why you are defining an audio encoder and decoder, in addition of the Transformer which itself is an encoder-decoder architecture. Why not simply use the Transformer as your AudioRecontructor ?

attn 0

That's what my attention map of the first layer looks like. Seems to be not really learning much of anything.

This is, however, without the linear layers before and after. I'll try adding them back and seeing if anything changes

LIngerwsk commented 3 years ago

Hi, I think I met the same problem as you. I tried to not use the Residual block after every sublayer, then the out is the same for all timesteps. I have viewed your code, but I also find the residual block, maybe you can try it. anyway, I also why the output is same for all steps if I don't use the Residual block.

shamoons commented 3 years ago

Hi, I think I met the same problem as you. I tried to not use the Residual block after every sublayer, then the out is the same for all timesteps. I have viewed your code, but I also find the residual block, maybe you can try it. anyway, I also why the output is same for all steps if I don't use the Residual block.

I'm not sure I follow - we don't have options to set residual blocks, do we?

maxjcohen commented 3 years ago

Residual blocks are currently hard coded, but should be easy to remove, although I'm not sure why you would want to. Your attention map has a weird look to it, but at least not all values are equals.