krennic999 / STAR

STAR: Scale-wise Text-to-image generation via Auto-Regressive representations
https://krennic999.github.io/STAR/
113 stars 1 forks source link

How to generate the high-quality and detailed human face images #3

Open haibo-qiu opened 3 months ago

haibo-qiu commented 3 months ago

Hi @krennic999

Thank you for your excellent work!

I am very interested in the high-quality and detailed human face images you demonstrated in your paper, like the ones below: Example Image Example Image

In this issue, you mentioned that you used the VQVAE from VAR for 512x512 reconstruction and generation. However, when I attempt to use the same model to reconstruct images (512x512), including human faces, I cannot achieve high-quality results: the face areas, in particular, appear distorted. Below are some examples where the original images (left) and the reconstructed images (right) are shown:

Original Image Reconstructed Image

Original ImageReconstructed Image

I also have other examples with much worse distortion, but I did not include them here as they look somewhat uncomfortable.

My question is, how do you train the model to generate high-quality and detailed human face images considering the tokenizer's limitations in reconstructing faces accurately? Did you use any additional face data or any specific techniques to address this issue?

Thanks in advance~

daiyixiang666 commented 3 months ago

Some question, looking forward to the your kindly reply

krennic999 commented 3 months ago

Hi, in fact, I've also noticed that the provided VQVAE has issues with reconstruction, which I believe partly due to its patchsize of 16 and codebook size 4096 in VAR. This especially affects image reconstruction under small objects. The facial details in the examples you provided are too intricate for the VQVAE. Similar distortion also appeared in the 256 version, including deformation and color shift at the reconstruction of details, see 256 recon examples: cropped_img2img2_recon_latent16

krennic999 commented 3 months ago

And I' ve also noticed a set of patch_nums defined by origin VAR at https://github.com/FoundationVision/VAR/blob/main/utils/arg_util.py#L246 (see https://github.com/FoundationVision/VAR/issues/54), when patch_nums set as [1, 2, 3, 4, 6, 9, 13, 18, 24, 32], the reconstructions of above two figures are also not satisfying: cropped_img1img1_recon_latent32

cropped_img2 img2_recon_latent32

Training data used is publicly available and does not include any additional dataset, and we are currently working on the VAE for better results in small objects.

s9anus98a commented 3 months ago

do u have colab demo to show the issue?

haibo-qiu commented 3 months ago

Hi @krennic999,

Thank you for your response!

If you do not use additional datasets, back to my original question, how do you train the model to generate high-quality and detailed human face images, given the tokenizer's limitations in accurately reconstructing faces? Is it because your training data includes images where human faces occupy the majority of the area, and this kind of images are easier to reconstruct?

krennic999 commented 2 months ago

Hi, sorry for the late reply, yes I tend to think so. Restoring the details of large objects is easier than that of small objects. The details of VQVAE are still being explored.

krennic999 commented 2 months ago

errr... currently we do not have colab demo, you can refer to some similar VQ-based VAEs to see their reconstruction results. The codebook size and patch size will both affect the quality of image reconstruction. And for multi-scale VQVAE in VAR, it seems that the selection of scale also affects the quality of reconstructed images.