In the paper, they say that they tie all the latent transformer weights. However in this implementation, TF in the first layer is not shared with the rest.
It should probably be
for block_ind in range(self_per_cross_attn):
self_attns.append(nn.ModuleList([
get_latent_attn(_cache=True, key = block_ind),
get_latent_ff(_cache=True, key = block_ind)
]))
Hi Phil,
Want to confirm the reason behind this design choice: https://github.com/lucidrains/perceiver-pytorch/blob/c3d505a997a6e3521e83d7d2bf57cb8b62e3fbd6/perceiver_pytorch/perceiver_pytorch.py#L194-L210
In the paper, they say that they tie all the latent transformer weights. However in this implementation, TF in the first layer is not shared with the rest.
It should probably be
What do you think?