tatp22 / linformer-pytorch

My take on a practical implementation of Linformer for Pytorch.
https://arxiv.org/pdf/2006.04768.pdf
MIT License
400 stars 36 forks source link

Loss goes to 0 when using LinformerLM #25

Closed terencenwz closed 3 years ago

terencenwz commented 3 years ago

Hi, I used the LinformerLM class with casual=True to do some language modelling. However, there seems to be some leakage as the loss goes to 0 after 1 epoch. Or am I using it wrongly? Thank you.

These are my settings

model = LinformerLM(
        num_tokens=ntoken, # Number of tokens in the LM
        input_size=args.seq_len, # Dimension 1 of the input
        channels=args.embsize, # Dimension 2 of the input
        dim_d=None, # Overwrites the inner dim of the attention heads. If None, sticks with the recommended channels // nhead, as in the "Attention is all you need" paper
        dim_k=16, # The second dimension of the P_bar matrix from the paper
        dim_ff=args.nhid, # Dimension in the feed forward network
        dropout_ff=args.dropout, # Dropout for feed forward network
        nhead=8, # Number of attention heads
        depth=12, # How many times to run the model
        dropout=args.dropout, # How much dropout to apply to P_bar after softmax
        activation="relu", # What activation to use. Currently, only gelu and relu supported, and only on ff network.
        checkpoint_level="C0", # What checkpoint level to use. For more information, see below.
        parameter_sharing="none", # What level of parameter sharing to use. For more information, see below.
        k_reduce_by_layer=0, # Going down `depth`, how much to reduce `dim_k` by, for the `E` and `F` matrices. Will have a minimum value of 1.
        full_attention=False, # Use full attention instead, for O(n^2) time and space complexity. Included here just for comparison
        include_ff=True, # Whether or not to include the Feed Forward layer
        w_o_intermediate_dim=None, # If not None, have 2 w_o matrices, such that instead of `dim*nead,channels`, you have `dim*nhead,w_o_int`, and `w_o_int,channels`
        emb_dim=None, # If you want the embedding dimension to be different than the channels for the Linformer
        causal=True, # If you want this to be a causal Linformer, where the upper right of the P_bar matrix is masked out.
        method="learnable", # The method of how to perform the projection. Supported methods are 'convolution', 'learnable', and 'no_params'
        ff_intermediate=None, # See the section below for more information
        )
tatp22 commented 3 years ago

Hi @terencenwz!

I see that you have opened up an issue here regarding a similar problem: https://github.com/lucidrains/linear-attention-transformer/issues/6. Therefore, this might be related to the way you are running your tests, I am not sure? It should not go down to 0 right away...

But on another note, there are some caveats that you should know when using the linformer for causal lm. Check out #15 and #16 for some more information about this.

terencenwz commented 3 years ago

Hi, thanks for the reply. I believe the linear-attention-transformer is a slightly different problem as the loss goes to infinity instead of 0. I have ran quite a number of different variants of transformer models including the original model and got comparable loss.

I think the problem might be explained by https://github.com/tatp22/linformer-pytorch/issues/16#issuecomment-733525011 where there is some leakage of the future information. The loss did not go down to 0 right away, it took slight more than 1epoch (around 30k update steps)