lucidrains / DALLE-pytorch

Implementation / replication of DALL-E, OpenAI's Text to Image Transformer, in Pytorch
MIT License
5.55k stars 643 forks source link

VQGanVAE1024 "vae must be an instance of DiscreteVAE" #87

Closed afiaka87 closed 3 years ago

afiaka87 commented 3 years ago

@lucidrains I believe the relevant line is here:

https://github.com/lucidrains/DALLE-pytorch/blob/2268864941d8eef2ba73a4488fe05673d447d493/dalle_pytorch/dalle_pytorch.py#L306

I tried adding it in myself, but it needs the taming imports and I'm not familiar with those.

afiaka87 commented 3 years ago

Yep, it's following that first bad run almost verbatim.

lucidrains commented 3 years ago

@afiaka87 so interestingly enough, OpenAI's paper said that one conv like attention layer is necessary at the very last layer of DALL-E, but for now, you can just remove it altogether

afiaka87 commented 3 years ago

cool. will do. keep me posted on a fix.

lucidrains commented 3 years ago

@afiaka87 yup, that pattern is a leak in the masking, where the past attends to the future by accident. it comes up again and again in autoregressive transformer training

afiaka87 commented 3 years ago

til then, i'm gonna train the hell out of this thing.

lucidrains commented 3 years ago

will do!

TheodoreGalanos commented 3 years ago

Is there a suggested combination of attention we should use?

Also, OT, but does anyone have a good intuition about the values? I don't totally get dim size for dalle in relation to performance / compute. Does it have to be the size of the feature map?

lucidrains commented 3 years ago

@TheodoreGalanos You can't go wrong with all full attention if you can, especially when the image sequence is low, is my opinion. Dimensions, just keep it at 1024 if you can, 512 on a budget. Heads at 8 minimum, more if possible

Nope, it's completely different than the feature map dimensions of the VAE, if that is what you are referring to

TheodoreGalanos commented 3 years ago

Thank you @lucidrains !

afiaka87 commented 3 years ago

@lucidrains so full is generally going to be better, but more compute heavy?

TheodoreGalanos commented 3 years ago

@afiaka87 One last side now. Since you've experimented with the vqgan, is it amenable to transfer learning? I don't want to necessarily train one today (I have a vqvae already) but would fine tune one if it was possible.

afiaka87 commented 3 years ago

@afiaka87 One last side now. Since you've experimented with the vqgan, is it amenable to transfer learning? I don't want to necessarily train one today (I have a vqgan already) but would fine tune one if it was possible.

I just ran it for the first time like 3 hours ago ha. I've actually not successfully done transfer learning on any of these options. I just start over each time trying to make it better from scratch. No clue, unfortunately.

TheodoreGalanos commented 3 years ago

Oh okay! thanks, will try and train one then. I'm not entirely sure how easy it would be, I'll take a look.

afiaka87 commented 3 years ago

If you were late to the party or (like me) need a refresher on what happened here, here is a graph of the live session we were all viewing. The relevant runs are runs 7 through 10

graphs