lucidrains / performer-pytorch

An implementation of Performer, a linear attention-based transformer, in Pytorch
MIT License
1.08k stars 141 forks source link

[Feature] EncoderDecoder framework, similar to ReformerEncDec #38

Closed gulnazaki closed 3 years ago

gulnazaki commented 3 years ago

Hello Phil,

Nice job on this great architecture. I want to use it as an Encoder Decoder within Deepspeed, so I am thinking of writing a wrapper similar to the one you did for Reformer. Do you have any tips on what to pay attention (no pun intended) and if I need to use padding as in Autopadder?

Thanks

lucidrains commented 3 years ago

Normally I could build you this within the hour, but my desktop machine with my GPUs finally broke after a streak of good use, and we have stay at home orders due to the pandemic in the states. :( I'll get around to this!

gulnazaki commented 3 years ago

Oh, I am sorry to hear this, I wish your dekstop a speedy recovery!

Don't worry, I am now doing some experiments with the Reformer and if I have time I will look at wrapping the Performer myself. I am sure you could do it way more easily since you wrote the thing though.

Thanks for your interest and let me know if you try something out :)

lucidrains commented 3 years ago

Thank you!

Do submit a PR if you end up getting it working!

lucidrains commented 3 years ago

@gulnazaki here's another template https://github.com/lucidrains/x-transformers/blob/main/x_transformers/x_transformers.py#L463

lucidrains commented 3 years ago

Oh, also, no autopadder needed! Performer doesn't use the bucketing scheme like in Reformer. And as for the local attention, I built autopadding right into that library by default :)

gulnazaki commented 3 years ago

Thanks, it took me a while but I figured out it is an overkill for this architecture.

So to be sure and solve some questions I had. The correct way to pad variable length sequences is to pad to the max_seq_len of the model or the max sequence of the batch? Also, padding_value=\<pad> and ignore_index=\<pad> should be used together with a boolean mask? And I can see you are padding for AutoregressiveWrapper (decoder) only, shouldn't padding be applied to encoder also? Thanks!

lucidrains commented 3 years ago

@gulnazaki Just the max sequence of the batch will do! (and correspondingly, the mask should be padded with False's

lucidrains commented 3 years ago

ohhh oops, yeah, I just set the ignore_index by default to 0, which is the usual padding value

yup, as long as your mask is passed into both the encoder and the decoder, attention will be masked out for those padded elements

lucidrains commented 3 years ago

and yup, it should be used together with the boolean mask, since you are masking out the attention too, in addition to ignoring the loss of the padded elements

gulnazaki commented 3 years ago

Thank you. Btw you could also change it to default to 0 on ReformerEncDec (it currently defaults to -100).

lucidrains commented 3 years ago

@gulnazaki oh crap, you are right :| thank you!

lucidrains commented 3 years ago

done

lucidrains commented 3 years ago

hurray, my desktop has been fixed! thank you amazon prime

gulnazaki commented 3 years ago

Nice! I finished some experiments with Reformer (wow generation is slow), so I could start working on the PR if you are not working on it now.

gulnazaki commented 3 years ago

I believe it is gonna be ready in a while, it was easier than I thought since all the pieces are there. I can see there is a difference between the reformer repo, where you use input_mask, while here you use mask. Actually, I can see they are quite interchanged and while Performer expects mask AutoregressiveWrapper expects input_mask. Is there any difference between these two? If not would you mind if I use input_mask everywhere to avoid confusion?

gulnazaki commented 3 years ago

I am also not sure if I should pass return_encodings = True to the encoder or leave it to False and do the matrix multiplication

lucidrains commented 3 years ago

@gulnazaki Lol yea, Reformer is like the opposite of Performer. It's super slow

Yup, I gradually shifted the naming down to just mask. I think it's probably better to stick with mask here, just to stay simple and consistent

Yup, the encoder would do its forward pass with the return_encodings set to True

I was about to start on it, but having an extra contributor would be very welcomed :D I'll wait for your PR

gulnazaki commented 3 years ago

Ok, it is ready. One last question! I see that there is an issue with the forward method of the AutoregressiveWrapper of the decoder when we use list of variable length tensors. The mask has to be smaller than the largest tensor by 1.

I am a bit confused with the whole padding and masking part, and xi, xo (I guess it is because of the \<bos> and \<eos>) but I will submit as is and let me know if it is correct.

gulnazaki commented 3 years ago

Also, I don't know if this is a bug but there is no error when I pass an invalid shape mask as a decoder mask (when using tensor as input).

gulnazaki commented 3 years ago

Because mask is popped and not pushed back on kwargs if it is different shape of x.

lucidrains commented 3 years ago

@gulnazaki looks great! merged and released! I'll clean it up some time later this week and add a feature or two :) thank you!

gulnazaki commented 3 years ago

You're welcome