Open Epiphqny opened 1 year ago
@Epiphqny wow Yuqing! those results do not look half bad! i'll have to think about your results a bit more. so this work builds upon the cvivit from the phenaki paper. in that paper, i believe they encode the first frame separately from the rest (to allow for single image pretraining). however, in this work, they decide to just pad on the left and use the same encoding for the first frame vs the rest. perhaps i can add the cvivit way for the sake of comparing the two
@Epiphqny once i circle back to this, also want to craft out a few more specialized discriminators (fourier domain as well as temporal)
@Epiphqny did you use LFQ or FSQ btw? could you share your hyperparameters?
@Epiphqny added it here if you want to run some experiments
Hi @lucidrains, thanks for your prompt response! Actually, I didn't use the LFQ or FSQ, instead, I used the quantization in CVQ-VAE https://github.com/lyndonzheng/CVQ-VAE, and extend the 2D conv to 3D causal conv like magvit2. For the training parameters, I've followed the setup used in VQGAN and initialized the weights using a CVQ-VAE model prertrained on image data. I will trained the updated code of first frame and looking forward to the updated discriminator!
@Epiphqny ohh i see! i didn't know you only used the causal conv
i'm not sure what the issue is then
@lucidrains Thanks for your response ! I will try more modules in this implementation and update the results later.
@lucidrains Thanks for your response ! I will try more modules in this implementation and update the results later.
Hi @Epiphqny , Is there any progress on improving results?
Hi @lucidrains , thanks for your awesome work! I used your causal conv implementation and trained on a video vqgan network. The results are as follows: Original clip sequence: The reconstructed clip sequence: I've noticed that the reconstruction seems to heavily rely on the initial frame. As the sequence progresses, the clarity of the images appears to diminish, leading to a more blurring effect with each subsequent frame. Could you provide any insights into this phenomenon? Thank you for your time and assistance!