Open hkristof03 opened 1 year ago
Yeah I can't find any specifics in this paper @hkristof03, but I've seen the use of scaled dot-product attention (Luong Style) in other papers like SASrec.
The inputs must be a sequence of items that the user has interacted with. It's self attention so the query, key and value are the same.
There are some useful implementations of the transformer encoder blocks in the keras-nlp package here
I haven't looked at the paper yet. In my experience, if sequential history is essential, you can use self-attention or even Transformer encoder to learn the interactions and extract features. And then put an additive attention layer after the multiple output embeddings. It will aggregate them into one global embedding. You can refer to the implementation of Microsoft teams. In some places, it's also called attentive pooling.
@OmarMAmin mentioned Exploring Heterogeneous Metadata for Video Recommendation with Two-tower Model paper in this discussion.
I read through the paper multiple times and a few resources about existing Attention mechanisms, but I haven't figured out the following questions:
From the paper:
Maybe @patrickorlando do you have an idea related to this?