Open kanttouchthis opened 3 weeks ago
@kanttouchthis hello! yes, i just chose some numbers at random, specifically 32 because of the title of the paper, which is a reference in itself to the vision transformers paper
ah, just came across this proxy codes
section
the results in this paper depend on an already well trained vqgan vae model..
edit: they claim this is different than distillation, but how?
ah, just came across this
proxy codes
sectionthe results in this paper depend on an already well trained vqgan vae model..
“we propose to train 1D VQ models with the discrete codes generated by an off-the-shelf MaskGIT-VQGAN model, which we refer to as proxy codes” Can you undertand this trick? I cannot undertand how to implement this. Because MaskGIT-VQGAN will generate more tokens then their proposed 32 tokens. If you can understand that, please share your idea.
@yangdongchao yes, this section is so confusing. i have no idea what they are doing
@yangdongchao yes, this section is so confusing. i have no idea what they are doing
Yes, I have write email to ask help from authors, but them do not reply.
yea, i'm going to deprioritize this work, not that excited anymore
yeah, when i saw the way they trained it i could not understand how is it different from distillation + fine tune from bigger model... disappointment
Hi @lucidrains @inspirit @yangdongchao ,
We thank for your interests in our work, and really appreciate for your pytorch reproduction. It just came to my awareness that seems some ppl feel the two-stage training part confusing, and we will definitely refine the writing on that part for a clarification.
For a short clarification regarding discussion in this issue:
What role does MaskGIT-VQGAN play in the two-stage training?
At the "warm-up" stage, we use MaskGIT-VQGAN's code as the reconstruction targets, with cross-entropy loss to supervise the outputs of TiTok's de-tokenizer. The training objective is similar to BEIT which learns to reconstruct a set of VQGAN codes. At the "decoder-finetuning" stage, we fine-tune the TiTok de-tokenizer along with MaskGIT-VQGAN's decoder towards raw pixels, along with typical VQGAN losses, including perceptual loss and gan loss.
Does TiTok require a pre-existing well-trained VQGAN/VAE to work?
As shown in our paper Tab. 3 (c) (screenshot attached), TiTok when trained using the Taming-VQGAN setting from scratch (thus single-stage w/o pre-existing models) can outperform its 2D counterpart, or Taming-VQGAN itself while using much fewer tokens.
However, the performance is still left behind MaskGIT-VQGAN, which has a much stronger training recipe compared to Taming-VQGAN (they basically have a similar architecture but MaskGIT-VQGAN has a rFID 2.28 against Taming-VQGAN's 7.94). However, the MaskGIT-VQGAN's training recipe remains mysterious without public code or implementation details in the MaskGIT paper. To compensate the reconstruction gap compared to SOTA methods on ImageNet, we have adopted two-stage training, so that TiTok can benefit from MaskGIT-VQGAN, given that we do not have access to its training recipe. This explanation is also illustrated in the paper 4.3 Ablation Studies.
In short, as shown in the experiments TiTok is totally fine trained w/o pre-existing VQGAN/VAE. If anyone is aware of stronger public VQGAN training recipe exists, please let us know and we are more than happy to try TiTok with that.
At the end, thank you all for attention and interests, we just obtained open-source approval and welcome to check out more details at https://github.com/bytedance/1d-tokenizer
![]()
Hi @lucidrains @inspirit @yangdongchao ,
We thank for your interests in our work, and really appreciate for your pytorch reproduction. It just came to my awareness that seems some ppl feel the two-stage training part confusing, and we will definitely refine the writing on that part for a clarification.
For a short clarification regarding discussion in this issue:
What role does MaskGIT-VQGAN play in the two-stage training?
At the "warm-up" stage, we use MaskGIT-VQGAN's code as the reconstruction targets, with cross-entropy loss to supervise the outputs of TiTok's de-tokenizer. The training objective is similar to BEIT which learns to reconstruct a set of VQGAN codes. At the "decoder-finetuning" stage, we fine-tune the TiTok de-tokenizer along with MaskGIT-VQGAN's decoder towards raw pixels, along with typical VQGAN losses, including perceptual loss and gan loss.
Does TiTok require a pre-existing well-trained VQGAN/VAE to work?
As shown in our paper Tab. 3 (c) (screenshot attached), TiTok when trained using the Taming-VQGAN setting from scratch (thus single-stage w/o pre-existing models) can outperform its 2D counterpart, or Taming-VQGAN itself while using much fewer tokens.
However, the performance is still left behind MaskGIT-VQGAN, which has a much stronger training recipe compared to Taming-VQGAN (they basically have a similar architecture but MaskGIT-VQGAN has a rFID 2.28 against Taming-VQGAN's 7.94). However, the MaskGIT-VQGAN's training recipe remains mysterious without public code or implementation details in the MaskGIT paper. To compensate the reconstruction gap compared to SOTA methods on ImageNet, we have adopted two-stage training, so that TiTok can benefit from MaskGIT-VQGAN, given that we do not have access to its training recipe. This explanation is also illustrated in the paper 4.3 Ablation Studies.
In short, as shown in the experiments TiTok is totally fine trained w/o pre-existing VQGAN/VAE. If anyone is aware of stronger public VQGAN training recipe exists, please let us know and we are more than happy to try TiTok with that.
![]()
At the end, thank you all for attention and interests, we just obtained open-source approval and welcome to check out more details at https://github.com/bytedance/1d-tokenizer
Thank you for your response. Based on your explanation, I think the first stage training should be: we first use TiTok-encoder to encode the information into 32 learnable embedding, the we use a VQ quantizer to quantize the 32 embedding vector, then, we concat several [MASK] token with quantized 32 vector, then input into a transformer decoder. Lastly, we add some MLP layer to make prediction for proxy codes. Then the second stage: we only update the decoder and add a pixel-level decoder (MASKGIT-VQGAN-decoder) to get image.
I follow the first stage, but I find the VQ training is not stable, it is easy to collapse, which results 32 learnable embedding be quantized into the same token. I want to ask whether you meet the similar situations.
The paper mentions a codebook size of 4096 for all models with 128/64/32 tokens for 256x256 and 128/64 tokens for 512x512. I was wondering why the example configuration in
README.md
andtitok.py
differs from the configurations mentioned in the paper as 32 tokens likely won't be enough for 512x512.