I am trying to implement a ViT-VQGAN architecture (publication) for a project and from my understanding I can use this repository.
However, I am having a bit of trouble understanding all the parameters available in the repo and how to match with what I have on the paper.
In page 5 of the publication the authors show the parameters for the ViT small encoder network which I am trying to recreate, which are the following:
blocks: 8
heads: 8
model dimension: 512
hidden dimension: 2048
dropout: 0
tokens: 1024
I am mainly having trouble with the hidden dimension parameter. (Probably I am doing something wrong due to my lack of experience with this repo)
Currently I was trying the following for the encoder part:
TypeError: FeedForward.__init__() got multiple values for argument 'dim'
I also tried to use the TransformerWrapper or XTransformer class instead of the Encoder, however I obviously got an error.
I also took a look at the vit-vqgan defined in the party-pytorch repo of yours, but If I understand correctly there are a few differences from the original ViT-VQGAN (correct me if I am wrong) and I was not able to make it work as I intended (again I probably messed up hyperparameters).
Could you kindly guide me on how to define the ViT encoder architecture of the ViT-VQGAN with this repository?
Hopefully it is trivial for you to understand what I am doing wrong and what the correct configuration would be. :)
It would be of great help.
Hello!
I am trying to implement a ViT-VQGAN architecture (publication) for a project and from my understanding I can use this repository. However, I am having a bit of trouble understanding all the parameters available in the repo and how to match with what I have on the paper.
In page 5 of the publication the authors show the parameters for the ViT small encoder network which I am trying to recreate, which are the following:
I am mainly having trouble with the hidden dimension parameter. (Probably I am doing something wrong due to my lack of experience with this repo)
Currently I was trying the following for the encoder part:
which returns the following error:
I also tried to use the TransformerWrapper or XTransformer class instead of the Encoder, however I obviously got an error.
I also took a look at the vit-vqgan defined in the party-pytorch repo of yours, but If I understand correctly there are a few differences from the original ViT-VQGAN (correct me if I am wrong) and I was not able to make it work as I intended (again I probably messed up hyperparameters).
Could you kindly guide me on how to define the ViT encoder architecture of the ViT-VQGAN with this repository? Hopefully it is trivial for you to understand what I am doing wrong and what the correct configuration would be. :) It would be of great help.