[Question] How to use Attention to aggregate history?

hkristof03 commented 1 year ago

@OmarMAmin mentioned Exploring Heterogeneous Metadata for Video Recommendation with Two-tower Model paper in this discussion.

I read through the paper multiple times and a few resources about existing Attention mechanisms, but I haven't figured out the following questions:

Which Attention should be used from TF, Luong-style attention or Bahdanau-style attention.?
What should be exactly the query and value inputs to the Attention layer? They trained a User-Item recommender and it looks like that for the Attention layers in both the Query and Candidate Towers either User or Item features were used.

From the paper:

Maybe @patrickorlando do you have an idea related to this?

patrickorlando commented 1 year ago

Yeah I can't find any specifics in this paper @hkristof03, but I've seen the use of scaled dot-product attention (Luong Style) in other papers like SASrec.

The inputs must be a sequence of items that the user has interacted with. It's self attention so the query, key and value are the same.

There are some useful implementations of the transformer encoder blocks in the keras-nlp package here

caesarjuly commented 1 year ago

I haven't looked at the paper yet. In my experience, if sequential history is essential, you can use self-attention or even Transformer encoder to learn the interactions and extract features. And then put an additive attention layer after the multiple output embeddings. It will aggregate them into one global embedding. You can refer to the implementation of Microsoft teams. In some places, it's also called attentive pooling.

tensorflow / recommenders

[Question] How to use Attention to aggregate history? #641