krennic999 / STAR

STAR: Scale-wise Text-to-image generation via Auto-Regressive representations
https://krennic999.github.io/STAR/
83 stars 0 forks source link

Training details of the VQVAE #2

Open daiyixiang666 opened 2 weeks ago

daiyixiang666 commented 2 weeks ago

Hello, can you tell me more about the training details of the VQVAE, for example, the training dataset, batchsize and number of epoches, these are not clearly written in the Paper, beside, for the VQVAE training part, do you directly use the code in the VAR or do some other modification? Really thanks for your reply

daiyixiang666 commented 2 weeks ago

Besides, have you try adaLNzero? Thanks a lot

krennic999 commented 2 weeks ago

Hello, we did not retrain or fine-tune the VQVAE in VAR, we directly used their provided weights, patch-nums were set to [1, 2, 3, 4, 5, 6, 8, 10, 13, 16, 24, 32] for 512 generation (its reconstruction capability was okay), nothing other was modified in the VQVAE part. However, we think certain problems in their origin VQVAE exist, we will be grateful if you have some new insights about the VQVAE.

krennic999 commented 2 weeks ago

No, we remove all adaLN in VAR, and inject textual conditions via start token and cross-attention in each transformer layer.

daiyixiang666 commented 2 weeks ago

Did them released the weight for 512 in VQVAE? Why I can only find the 256 version

krennic999 commented 2 weeks ago

We don't know how they train under 512 images, however, their provided version can adapt to 512 images as well. Our further plan may include re-training or fine-tuning the VQVAE.

daiyixiang666 commented 2 weeks ago

great, thanks!

kabachuha commented 1 week ago

Also never methods such as Lookup-free quantization (from MagViT2) or it's upgraded Binary sphere quantization may be worth investigating for the future, as they promise even more efficient compression and FID scores

kabachuha commented 1 week ago

Also the newer VQGAN-LC, where they report the codebook utilization of 99% when scaling the VQVAE to 100000 tokens (may be much useful for VAR) https://github.com/zh460045050/VQGAN-LC

daiyixiang666 commented 1 week ago

Beside, I want to ask about the training details of the VAR, what is the total batchsize and what is the learning rate of the model. Or something like how long do you train the model?

krennic999 commented 1 week ago

Hi, due to our current lack of resources, we use batch size=216 with learning rate=5e-5 to train 256*256 images, then fine tune on 512 images with batch size=32, learning rate=1e-6, and each stage requires approximately one to two weeks of training. We plan to expand the training scale in the upcoming version to achieve more stable results.

daiyixiang666 commented 1 week ago

What about the cfg, topk and top_p setting to generate such high quality result

krennic999 commented 1 week ago

cfg=4.0, top_k=(600 if si < 9 else 300), top_p=0.8 for replicate the results in tabels, sometimes set more_smooth=True for more stable generation

huge123 commented 6 days ago

No, we remove all adaLN in VAR, and inject textual conditions via start token and cross-attention in each transformer layer.

thanks for your reply. but what is the start token, and how is it constructed?

krennic999 commented 4 days ago

Hi, the start token is defined as a 1x1xC token map at the initial scale, which is then used to generate higher resolution token maps. In our setup, the start token comes from the pooled features of CLIP, providing overall information about the text. This allows for the subsequent autoregressive generation of new token maps. In the original VAR setup, the start token comes from the ImageNet class embeddings for class-conditioned generation.