Implementation of Memorizing Transformers (ICLR 2022), attention net augmented with indexing and retrieval of memories using approximate nearest neighbors, in Pytorch
MIT License
620
stars
46
forks
source link
Arguments to reproduce the models from the original paper? #4
This looks like excellent work! I have gone through the original paper and your repo, and am now trying to reproduce the model from the paper as closely as possible. Of course, the modifications you made such as hybrid attention instead of sigmoid gate are fine.
Specifically, I would like to be able to try some of the variations in Table 4:
Suppose I'm interested in the 4th to last row with Context 512 Memory 8192 XL cache 512. Can you help me the model arguments to do that? Here is my initial attempt, with reference to Section 4.2:
Hi lucidrains,
This looks like excellent work! I have gone through the original paper and your repo, and am now trying to reproduce the model from the paper as closely as possible. Of course, the modifications you made such as hybrid attention instead of sigmoid gate are fine.
Specifically, I would like to be able to try some of the variations in Table 4:![image](https://user-images.githubusercontent.com/9099139/173904127-be6c495f-3502-4a06-b9c3-f6c861b539fa.png)
Suppose I'm interested in the 4th to last row with Context 512 Memory 8192 XL cache 512. Can you help me the model arguments to do that? Here is my initial attempt, with reference to Section 4.2:
A second question is what are the model arguments to reproduce to first row of Table 4, with no memory nor XL cache? Thanks in advance.