padding mask and attention mask

tatp22 / linformer-pytorch

My take on a practical implementation of Linformer for Pytorch.

https://arxiv.org/pdf/2006.04768.pdf

MIT License

400 stars 36 forks source link

padding mask and attention mask #11

Closed zackchen-lb closed 3 years ago

zackchen-lb commented 4 years ago

Hi there,

You've done a great job and thanks for the sharing. I'm wondering how you deal with the masking stuff in Linformer since the attention shape, key and value shape have now changed to (n, k) instead of (n, n). I didn't find these in the code. Thanks for your time!

tatp22 commented 4 years ago

Hi @ZEKAICHEN,

I'm not sure I completely understand the question, so I will break it up into two parts, hopefully one of them will answer your question:

If you are looking for a way to pad your inputs, check out my Padder class, which can be found here. Unfortunately, I have not have time to test it on the LinfromerLM module, and I will be doing that in the upcoming days, hopefully.

As for masking, that's the next thing that I want to get done off my bucket list. Look for it in a coming update.

If this is not what you wanted answered, let me know! I'll try and help as much as I can

zackchen-lb commented 4 years ago

Hi @tatp22 ,

Thanks for the feedback. I was exactly talking about the masking. Thanks for letting me know.

tatp22 commented 4 years ago

Hi @ZEKAICHEN,

Actually, what you said is kinda a dilemma, since I actually cannot mask it in the traditional sense (the traditional sense being exactly what you said before). Therefore, my idea was to just mask the Q, K, and V matrices directly, similar to what is done here. I think this is the best way to do it, and I will mask it like this; let me know if you think this is good, and if not, I am open up for ideas.

In theory, I think it should work the same (I can come up with a quick proof or something, maybe), and this is the only logical way that I see that we can pad the inputs. I'll leave this issue open until I make the pull request.

tatp22 commented 4 years ago

I added more information in the PR. If it looks good, let me know :+1:, otherwise I will just merge it, since it really doesn't affect the rest of the code too much

zackchen-lb commented 4 years ago

I agree with masking projection matrices directly and thanks for the updates. But there could be some other solutions maybe? I'm also working on this but based on fairseq. Thanks again.

tatp22 commented 4 years ago

Not sure what else to think of :shrug: This is the best solution I could come up with; I even emailed the authors of the paper about this, but no response, unfortunately...

If no one says anything else, I will merge this soon.

tatp22 commented 4 years ago

Merging

tatp22 commented 4 years ago

Actually, sorry to keep on closing and opening this up again, but another possibility to mask would be to mask in the traditional sense, but instead of the entire (n,n) masking matrix, we can just make an (n,k) masking matrix, and then make it upper triangular as well, but just have the second dimension be cut off. This can work as well, let me know what you think.

tatp22 commented 3 years ago

Hey, I added another upper triangular padding mask option for causal sequences, let me know what you think. It's up on the latest commit, and if the dims are (n,k), it simply cuts off the right part of the matrix.

I'll close this issue now, since I added both options and now users can use whichever one they want.

zackchen-lb commented 3 years ago

Hey, I added another upper triangular padding mask option for causal sequences, let me know what you think. It's up on the latest commit, and if the dims are (n,k), it simply cuts off the right part of the matrix.

I'll close this issue now, since I added both options and now users can use whichever one they want.

Thanks for the updates. Wonderful solution.