Closed gulnazaki closed 3 years ago
Normally I could build you this within the hour, but my desktop machine with my GPUs finally broke after a streak of good use, and we have stay at home orders due to the pandemic in the states. :( I'll get around to this!
Oh, I am sorry to hear this, I wish your dekstop a speedy recovery!
Don't worry, I am now doing some experiments with the Reformer and if I have time I will look at wrapping the Performer myself. I am sure you could do it way more easily since you wrote the thing though.
Thanks for your interest and let me know if you try something out :)
Thank you!
Do submit a PR if you end up getting it working!
@gulnazaki here's another template https://github.com/lucidrains/x-transformers/blob/main/x_transformers/x_transformers.py#L463
Oh, also, no autopadder needed! Performer doesn't use the bucketing scheme like in Reformer. And as for the local attention, I built autopadding right into that library by default :)
Thanks, it took me a while but I figured out it is an overkill for this architecture.
So to be sure and solve some questions I had. The correct way to pad variable length sequences is to pad to the max_seq_len of the model or the max sequence of the batch? Also, padding_value=\<pad> and ignore_index=\<pad> should be used together with a boolean mask? And I can see you are padding for AutoregressiveWrapper (decoder) only, shouldn't padding be applied to encoder also? Thanks!
@gulnazaki Just the max sequence of the batch will do! (and correspondingly, the mask should be padded with False
's
ohhh oops, yeah, I just set the ignore_index
by default to 0
, which is the usual padding value
yup, as long as your mask is passed into both the encoder and the decoder, attention will be masked out for those padded elements
and yup, it should be used together with the boolean mask, since you are masking out the attention too, in addition to ignoring the loss of the padded elements
Thank you. Btw you could also change it to default to 0 on ReformerEncDec (it currently defaults to -100).
@gulnazaki oh crap, you are right :| thank you!
done
hurray, my desktop has been fixed! thank you amazon prime
Nice! I finished some experiments with Reformer (wow generation is slow), so I could start working on the PR if you are not working on it now.
I believe it is gonna be ready in a while, it was easier than I thought since all the pieces are there. I can see there is a difference between the reformer repo, where you use input_mask, while here you use mask. Actually, I can see they are quite interchanged and while Performer expects mask
AutoregressiveWrapper expects input_mask
. Is there any difference between these two? If not would you mind if I use input_mask
everywhere to avoid confusion?
I am also not sure if I should pass return_encodings = True
to the encoder or leave it to False and do the matrix multiplication
@gulnazaki Lol yea, Reformer is like the opposite of Performer. It's super slow
Yup, I gradually shifted the naming down to just mask
. I think it's probably better to stick with mask
here, just to stay simple and consistent
Yup, the encoder would do its forward pass with the return_encodings
set to True
I was about to start on it, but having an extra contributor would be very welcomed :D I'll wait for your PR
Ok, it is ready. One last question! I see that there is an issue with the forward
method of the AutoregressiveWrapper
of the decoder when we use list of variable length tensors. The mask has to be smaller than the largest tensor by 1.
I am a bit confused with the whole padding and masking part, and xi, xo (I guess it is because of the \<bos> and \<eos>) but I will submit as is and let me know if it is correct.
Also, I don't know if this is a bug but there is no error when I pass an invalid shape mask as a decoder mask (when using tensor as input).
Because mask is popped and not pushed back on kwargs if it is different shape of x.
@gulnazaki looks great! merged and released! I'll clean it up some time later this week and add a feature or two :) thank you!
You're welcome
Hello Phil,
Nice job on this great architecture. I want to use it as an Encoder Decoder within Deepspeed, so I am thinking of writing a wrapper similar to the one you did for Reformer. Do you have any tips on what to pay attention (no pun intended) and if I need to use padding as in Autopadder?
Thanks