Weight sharing not consistent with paper

Hi Phil,

Want to confirm the reason behind this design choice: https://github.com/lucidrains/perceiver-pytorch/blob/c3d505a997a6e3521e83d7d2bf57cb8b62e3fbd6/perceiver_pytorch/perceiver_pytorch.py#L194-L210

In the paper, they say that they tie all the latent transformer weights. However in this implementation, TF in the first layer is not shared with the rest.

It should probably be

            for block_ind in range(self_per_cross_attn):
                self_attns.append(nn.ModuleList([
                    get_latent_attn(_cache=True, key = block_ind),
                    get_latent_ff(_cache=True, key = block_ind)
                ]))

What do you think?

lucidrains / perceiver-pytorch

Weight sharing not consistent with paper #67