haoosz / ViCo

Official PyTorch codes for the paper: "ViCo: Detail-Preserving Visual Condition for Personalized Text-to-Image Generation"
MIT License
238 stars 15 forks source link

How about the results with human images as inputs? #4

Open LaLaLailalai opened 1 year ago

LaLaLailalai commented 1 year ago

Hi @haoosz , Thanks for your fantastic work, I‘m curious about the results with human images as input. Could you show more results of human images?

haoosz commented 1 year ago

Thank you for being interested in our work!
I train the model on the images of Gal Gadot and here are some results:

"A cyberpunk photo of S*" a-cyberpunk-photo-of-*

"A photo of S* sitting in a movie theater" a-photo-of-*-sitting-in-a-movie-theater

"A photo of S* sitting in the kitchen" a-photo-of-*-sitting-in-the-kitchen

I also upload the training images and pretrained weights and you can try it.

haoosz commented 1 year ago

I have updated the training code. You can train the model on your desired human images now. Thanks!

Landroval2 commented 1 year ago

Hello! I am trying to replicate some of the results shown here, but i am not getting good results. Is S* the special token for all the checkpoints provided? Thanks!

markrmiller commented 1 year ago

Not getting good results with a trained model either. Something seems off. The images generated during training look okay, but then those generated by vico_txt2img.py are not even close...

haoosz commented 1 year ago

That's weird. What are your results directly using the pretrained weights?

Put all the pretrained weights under logs/gal_gadot/checkpoints and the training images under images/gal_gadot.
Use the following command to test:

python scripts/vico_txt2img.py --ddim_eta 0.0  --n_samples 4  --n_iter 2  --scale 7.5  --ddim_steps 50  --ckpt_path models/ldm/stable-diffusion-v1/sd-v1-4.ckpt  --image_path images/gal_gadot/1.jpg --ft_path logs/gal_gadot --load_step 399 --prompt "a cyberpunk photo of *" --outdir outputs/gal_gadot

It is supposed to produce similar results as my run above. Please try it out and put your outputs here. I may locate the problem based on that. Thank you.

haoosz commented 1 year ago

Hi @Landroval2,

The pretrained weight file embeddings_gs-STEP.pt is the trained S. It varies among different steps (300, 350, 400). You need to ensure you use the weights at the same step for the image attention module and the S.

Landroval2 commented 1 year ago

Hi @haoosz, thanks for your answer! I have been testing this again and getting good images with the gal gadot model. However, the results with the batman model are not great, almost no variability. Could you share some prompts/steps that you used in that case? Thanks again!

haoosz commented 1 year ago

The images of the batman toy are casually self-collected. The results with the batman indeed show low variability using some prompts.

You can try the following prompts (I use the default time step = 400):

Thanks!

Landroval2 commented 1 year ago

Thanks for your answer! I will be trying those prompts to see what happens.

okaris commented 1 year ago

@Landroval2 were you able to get better results? Can you share some insights please?

okaris commented 1 year ago

@haoosz I was finally able to get the same results with gal_gadot for inferences. Could you share the training parameters and command for that particular run please?

markrmiller commented 1 year ago

The issue I had was simply not using an identifierKeith . I think I was trying a 3 letter token. When I changed to the same type token in the config currently, everything worked as expected. It almost felt like prompt influence is even worse than TI, but otherwise results were stellar.

LiGe-In commented 11 months ago

Hi @haoosz , Thanks for your great work, I got similar results using the images of Gal Gadot, but got bad results on my own datasets. Is there anything I need to pay attention to when making a data set? Are there any requirements?

haoosz commented 11 months ago

You can try to adjust the training step, the random seed, and the initial word. Besides, the quality of the training data is also important. I have tried on my own images and got reasonable results.