Open haibo-qiu opened 3 months ago
Some question, looking forward to the your kindly reply
Hi, in fact, I've also noticed that the provided VQVAE has issues with reconstruction, which I believe partly due to its patchsize of 16 and codebook size 4096 in VAR. This especially affects image reconstruction under small objects. The facial details in the examples you provided are too intricate for the VQVAE. Similar distortion also appeared in the 256 version, including deformation and color shift at the reconstruction of details, see 256 recon examples:
And I' ve also noticed a set of patch_nums defined by origin VAR at https://github.com/FoundationVision/VAR/blob/main/utils/arg_util.py#L246 (see https://github.com/FoundationVision/VAR/issues/54), when patch_nums set as [1, 2, 3, 4, 6, 9, 13, 18, 24, 32], the reconstructions of above two figures are also not satisfying:
Training data used is publicly available and does not include any additional dataset, and we are currently working on the VAE for better results in small objects.
do u have colab demo to show the issue?
Hi @krennic999,
Thank you for your response!
If you do not use additional datasets, back to my original question, how do you train the model to generate high-quality and detailed human face images, given the tokenizer's limitations in accurately reconstructing faces? Is it because your training data includes images where human faces occupy the majority of the area, and this kind of images are easier to reconstruct?
Hi, sorry for the late reply, yes I tend to think so. Restoring the details of large objects is easier than that of small objects. The details of VQVAE are still being explored.
errr... currently we do not have colab demo, you can refer to some similar VQ-based VAEs to see their reconstruction results. The codebook size and patch size will both affect the quality of image reconstruction. And for multi-scale VQVAE in VAR, it seems that the selection of scale also affects the quality of reconstructed images.
Hi @krennic999
Thank you for your excellent work!
I am very interested in the high-quality and detailed human face images you demonstrated in your paper, like the ones below:
In this issue, you mentioned that you used the VQVAE from VAR for 512x512 reconstruction and generation. However, when I attempt to use the same model to reconstruct images (512x512), including human faces, I cannot achieve high-quality results: the face areas, in particular, appear distorted. Below are some examples where the original images (left) and the reconstructed images (right) are shown:
I also have other examples with much worse distortion, but I did not include them here as they look somewhat uncomfortable.
My question is, how do you train the model to generate high-quality and detailed human face images considering the tokenizer's limitations in reconstructing faces accurately? Did you use any additional face data or any specific techniques to address this issue?
Thanks in advance~