lucidrains / titok-pytorch

Implementation of TiTok, proposed by Bytedance in "An Image is Worth 32 Tokens for Reconstruction and Generation"
MIT License
147 stars 3 forks source link

tokens/codebook size #2

Open kanttouchthis opened 3 weeks ago

kanttouchthis commented 3 weeks ago

The paper mentions a codebook size of 4096 for all models with 128/64/32 tokens for 256x256 and 128/64 tokens for 512x512. I was wondering why the example configuration in README.md and titok.py differs from the configurations mentioned in the paper as 32 tokens likely won't be enough for 512x512.

We primarily investigate the following TiTok variants: TiTok-S-128 (i.e., small model with 128 tokens), TiTok-B-64 (i.e., base model with 64 tokens), and TiTok-L-32 (i.e., large model with 32 tokens), where each variant designed to halve the latent space size while scaling up the model size. For resolution 512, we double the latent size to ensure more details are kept at higher resolution, leading to TiTok-L-64 and TiTok-B-128. In the final setting for TiTok training, the codebook is configured to N = 4096

lucidrains commented 3 weeks ago

@kanttouchthis hello! yes, i just chose some numbers at random, specifically 32 because of the title of the paper, which is a reference in itself to the vision transformers paper

lucidrains commented 3 weeks ago

ah, just came across this proxy codes section

the results in this paper depend on an already well trained vqgan vae model..

edit: they claim this is different than distillation, but how?

yangdongchao commented 3 weeks ago

ah, just came across this proxy codes section

the results in this paper depend on an already well trained vqgan vae model..

“we propose to train 1D VQ models with the discrete codes generated by an off-the-shelf MaskGIT-VQGAN model, which we refer to as proxy codes” Can you undertand this trick? I cannot undertand how to implement this. Because MaskGIT-VQGAN will generate more tokens then their proposed 32 tokens. If you can understand that, please share your idea.

lucidrains commented 3 weeks ago

@yangdongchao yes, this section is so confusing. i have no idea what they are doing

lucidrains commented 3 weeks ago
Screen Shot 2024-06-20 at 7 07 32 AM
yangdongchao commented 3 weeks ago

@yangdongchao yes, this section is so confusing. i have no idea what they are doing

Yes, I have write email to ask help from authors, but them do not reply.

lucidrains commented 3 weeks ago

yea, i'm going to deprioritize this work, not that excited anymore

inspirit commented 3 weeks ago

yeah, when i saw the way they trained it i could not understand how is it different from distillation + fine tune from bigger model... disappointment

yucornetto commented 3 weeks ago
Screen Shot 2024-06-20 at 7 07 32 AM

Hi @lucidrains @inspirit @yangdongchao ,

We thank for your interests in our work, and really appreciate for your pytorch reproduction. It just came to my awareness that seems some ppl feel the two-stage training part confusing, and we will definitely refine the writing on that part for a clarification.

For a short clarification regarding discussion in this issue:

What role does MaskGIT-VQGAN play in the two-stage training?

At the "warm-up" stage, we use MaskGIT-VQGAN's code as the reconstruction targets, with cross-entropy loss to supervise the outputs of TiTok's de-tokenizer. The training objective is similar to BEIT which learns to reconstruct a set of VQGAN codes. At the "decoder-finetuning" stage, we fine-tune the TiTok de-tokenizer along with MaskGIT-VQGAN's decoder towards raw pixels, along with typical VQGAN losses, including perceptual loss and gan loss.

Does TiTok require a pre-existing well-trained VQGAN/VAE to work?

As shown in our paper Tab. 3 (c) (screenshot attached), TiTok when trained using the Taming-VQGAN setting from scratch (thus single-stage w/o pre-existing models) can outperform its 2D counterpart, or Taming-VQGAN itself while using much fewer tokens.

However, the performance is still left behind MaskGIT-VQGAN, which has a much stronger training recipe compared to Taming-VQGAN (they basically have a similar architecture but MaskGIT-VQGAN has a rFID 2.28 against Taming-VQGAN's 7.94). However, the MaskGIT-VQGAN's training recipe remains mysterious without public code or implementation details in the MaskGIT paper. To compensate the reconstruction gap compared to SOTA methods on ImageNet, we have adopted two-stage training, so that TiTok can benefit from MaskGIT-VQGAN, given that we do not have access to its training recipe. This explanation is also illustrated in the paper 4.3 Ablation Studies.

In short, as shown in the experiments TiTok is totally fine trained w/o pre-existing VQGAN/VAE. If anyone is aware of stronger public VQGAN training recipe exists, please let us know and we are more than happy to try TiTok with that.

image

At the end, thank you all for attention and interests, we just obtained open-source approval and welcome to check out more details at https://github.com/bytedance/1d-tokenizer

yangdongchao commented 3 weeks ago
Screen Shot 2024-06-20 at 7 07 32 AM

Hi @lucidrains @inspirit @yangdongchao ,

We thank for your interests in our work, and really appreciate for your pytorch reproduction. It just came to my awareness that seems some ppl feel the two-stage training part confusing, and we will definitely refine the writing on that part for a clarification.

For a short clarification regarding discussion in this issue:

What role does MaskGIT-VQGAN play in the two-stage training?

At the "warm-up" stage, we use MaskGIT-VQGAN's code as the reconstruction targets, with cross-entropy loss to supervise the outputs of TiTok's de-tokenizer. The training objective is similar to BEIT which learns to reconstruct a set of VQGAN codes. At the "decoder-finetuning" stage, we fine-tune the TiTok de-tokenizer along with MaskGIT-VQGAN's decoder towards raw pixels, along with typical VQGAN losses, including perceptual loss and gan loss.

Does TiTok require a pre-existing well-trained VQGAN/VAE to work?

As shown in our paper Tab. 3 (c) (screenshot attached), TiTok when trained using the Taming-VQGAN setting from scratch (thus single-stage w/o pre-existing models) can outperform its 2D counterpart, or Taming-VQGAN itself while using much fewer tokens.

However, the performance is still left behind MaskGIT-VQGAN, which has a much stronger training recipe compared to Taming-VQGAN (they basically have a similar architecture but MaskGIT-VQGAN has a rFID 2.28 against Taming-VQGAN's 7.94). However, the MaskGIT-VQGAN's training recipe remains mysterious without public code or implementation details in the MaskGIT paper. To compensate the reconstruction gap compared to SOTA methods on ImageNet, we have adopted two-stage training, so that TiTok can benefit from MaskGIT-VQGAN, given that we do not have access to its training recipe. This explanation is also illustrated in the paper 4.3 Ablation Studies.

In short, as shown in the experiments TiTok is totally fine trained w/o pre-existing VQGAN/VAE. If anyone is aware of stronger public VQGAN training recipe exists, please let us know and we are more than happy to try TiTok with that.

image

At the end, thank you all for attention and interests, we just obtained open-source approval and welcome to check out more details at https://github.com/bytedance/1d-tokenizer

Thank you for your response. Based on your explanation, I think the first stage training should be: we first use TiTok-encoder to encode the information into 32 learnable embedding, the we use a VQ quantizer to quantize the 32 embedding vector, then, we concat several [MASK] token with quantized 32 vector, then input into a transformer decoder. Lastly, we add some MLP layer to make prediction for proxy codes. Then the second stage: we only update the decoder and add a pixel-level decoder (MASKGIT-VQGAN-decoder) to get image.
I follow the first stage, but I find the VQ training is not stable, it is easy to collapse, which results 32 learnable embedding be quantized into the same token. I want to ask whether you meet the similar situations.