Why is only the query being layer normalized for the input to self-attention?

seanswyi commented 1 year ago

Hi. I was going through your code for some self-studying purposes and noticed that only the query is being layer normalized. This is also observed in the original TensorFlow implementation (https://github.com/kang205/SASRec/blob/master/model.py#L54). Is there a reason why you do so?

I would assume that the inputs to the query, key, and value should all be the same. I've implemented the Transformer myself and have done it this way.

pmixer commented 1 year ago

Hi. I was going through your code for some self-studying purposes and noticed that only the query is being layer normalized. This is also observed in the original TensorFlow implementation (https://github.com/kang205/SASRec/blob/master/model.py#L54). Is there a reason why you do so?

I would assume that the inputs to the query, key, and value should all be the same. I've implemented the Transformer myself and have done it this way.

Hi Sean, thx for the question!

Yes, as what you have observed, I did so to make it consistent with original tf implementation.

I'm not the paper author, thus authors may provide more reliable answers for the issue, well, here's my try to answer it:

For this application(recommender system), each item got mapped to an embedding, for this case, the query is a series of items, we hope to use this query to find the most likely item the user may interact with nextly. So an intuitive way to to use transformer to map the item sequence into an embedding, then dot product all item embeddings(quite like vanilla matrix factorization method for recommendation) for scoring, for keys(to be scored items), we keep using raw embeddings rather than putting it into the transformer.

I tried having the same transformer model to encode keys as well before final dot product, it do not work very well(for evaluation metrics), and consumes much more compute power, guess we need another model, at least not the same one used to encode item sequences for this kind of attempt.

The original paper made efforts to show transformers work well for recommendation years ago, we may refactor the model by ourselves for current needs, it's not required to keep original paradigm, but just putting both sides(item sequence, candidate items for next-item-prediction) do not work well according to my personal experiments.

seanswyi commented 1 year ago

Thanks for the answer. Actually, I'm also noticing that my own SASRec implementation's not achieving good performance and am trying to see what I did differently from you or the original implementation. The layer normalization was just one of the many differences lol. I'll keep this in mind!

pmixer / SASRec.pytorch

Why is only the query being layer normalized for the input to self-attention? #31