Implementing a small ViT-VQGAN

Hello!

I am trying to implement a ViT-VQGAN architecture (publication) for a project and from my understanding I can use this repository. However, I am having a bit of trouble understanding all the parameters available in the repo and how to match with what I have on the paper.

In page 5 of the publication the authors show the parameters for the ViT small encoder network which I am trying to recreate, which are the following:

blocks: 8
heads: 8
model dimension: 512
hidden dimension: 2048
dropout: 0
tokens: 1024

I am mainly having trouble with the hidden dimension parameter. (Probably I am doing something wrong due to my lack of experience with this repo)

Currently I was trying the following for the encoder part:

encoder = ViTransformerWrapper(
    image_size = 256,
    patch_size = 32,
    attn_layers = Encoder(
        dim = 512,
        depth=8,
        heads=8,
        ff_dim = 2048,
    )
)

which returns the following error:

TypeError: FeedForward.__init__() got multiple values for argument 'dim'

I also tried to use the TransformerWrapper or XTransformer class instead of the Encoder, however I obviously got an error.

I also took a look at the vit-vqgan defined in the party-pytorch repo of yours, but If I understand correctly there are a few differences from the original ViT-VQGAN (correct me if I am wrong) and I was not able to make it work as I intended (again I probably messed up hyperparameters).

Could you kindly guide me on how to define the ViT encoder architecture of the ViT-VQGAN with this repository? Hopefully it is trivial for you to understand what I am doing wrong and what the correct configuration would be. :) It would be of great help.

lucidrains / x-transformers

Implementing a small ViT-VQGAN #132