gwang-kim / DiffusionCLIP

[CVPR 2022] Official PyTorch Implementation for DiffusionCLIP: Text-guided Image Manipulation Using Diffusion Models
Other
785 stars 113 forks source link

Text editing in non-isolated images #12

Closed tsaxena closed 2 years ago

tsaxena commented 2 years ago

Hi, Thanks for your work. I am trying the pretrained models on a few test images to see what the results look like. I was trying out the tennis_baseball_t500.pth to see how it works. It works well when the tennisball is well isolated but not so much when the object is part of a scene. When we fine tune the model, the paper says I need 30 or so images, were these images well isolated. If I replace it with images where tennis ball is a small part of the image, will the performance improve?

gwang-kim commented 2 years ago

Hi, @tsaxena, thanks for your interests. Yes, I think if we fine-tune the model with more images including images where the tennis ball is a small part of the image as well as isolated images, it can generalize better even in the cases you mentioned. But in my opinion, there is a limitation in the localizing ability of CLIP image encoders, so the performance will be also limited.