Closed HaisongDing closed 2 years ago
@HaisongDing Hi Haisong, many thanks for your attention. Please find the implementation of the attention modules here, and the retrieval function here. These python code exactly implements the equations in the paper for efficiency, but yours looks simpler and more clever. Besides, it is suggested to implement the training and inference (argmax+indexing) differently, as this benefits the inference performance. Best, Hongfei
Thanks for your quick response.
Hi Hongfei, Does this repo also contain the implementation of your "Learning Hard Retrieval Decoder Attention for Transformers" paper? If not, will it be released? Based on my understanding, the "hard retrieval" is achieved by replacing P with P'=Multinomial Sampling(P), P=(P'-P).detach()+P. Please kindly correct me if I am wrong.