keras-team / keras-cv

Industry-strength Computer Vision workflows with Keras
Other
988 stars 321 forks source link

Text guided image generation with stable diffusion #939

Open LukeWood opened 1 year ago

LukeWood commented 1 year ago

https://twitter.com/_akhaliq/status/1582175757153230849?s=21&t=rCRiXt4-XW41JIyx-jwc-Q

bhack commented 1 year ago

What is the scope of this? Are we waiting for a model release from Google Research?

LukeWood commented 1 year ago

just opening to triage - added a label for tracking.

tanzhenyu commented 1 year ago

We have completed text_to_image, I think the next should be img2img. Closing this request for now

innat commented 1 year ago

@tanzhenyu (cc. @LukeWood )

We have completed text_to_image,

Actually, text-guided image generation can be categorized one of variant of text-to-image. It's not the same. In text guided image generation, there would be an sample input image too.

image

See this for more details. 👉 https://github.com/huggingface/diffusers/issues/1254

tanzhenyu commented 1 year ago

@tanzhenyu (cc. @LukeWood )

We have completed text_to_image,

Actually, text-guided image generation can be categorized one of variant of text-to-image. It's not the same. In text guided image generation, there would be an sample input image too.

image

See this for more details. 👉 huggingface/diffusers#1254

Oh I see, so that's the img2img I was referring to then. I am working on it, so re-opening this

innat commented 1 year ago

@tanzhenyu (cc. @miguelCalado) Here is another interesting variant of image-t-image with text guided. Placing it in case you're interested to take it.

Paper: Null-Text Inversion for Editing Real Images by Google Original code: in PyTorch (o_O) TF 2 code: https://github.com/miguelCalado/prompt-to-prompt-tensorflow (uses keras-cv)

teaser

miguelCalado commented 1 year ago

Hi!

Thank you for referring to my implementation of the Prompt-to-Prompt paper @innat. I would be happy to do a PR of the code (after some refactoring) if you guys want 😊 It is a cool method and a useful tool to have in the arsenal when dealing with cross-attention injection, which seems kinda popular these days.

But since the discussion is around text-guided image generation, why don't you start by adding negative prompting? It seems to be useful, especially when dealing with SD 2.x.

tanzhenyu commented 1 year ago

Hi!

Thank you for referring to my implementation of the Prompt-to-Prompt paper @innat. I would be happy to do a PR of the code (after some refactoring) if you guys want 😊 It is a cool method and a useful tool to have in the arsenal when dealing with cross-attention injection, which seems kinda popular these days.

But since the discussion is around text-guided image generation, why don't you start by adding negative prompting? It seems to be useful, especially when dealing with SD 2.x.

That would be great! Do you want to start with negative prompting, or should I? (I have been busy with the 0.4 release so this might take me 2 weeks)

miguelCalado commented 1 year ago

Sure! It would be my pleasure!

I opened an issue #1206 for further discussion.

innat commented 1 year ago

@miguelCalado Congrats for the 1st place of keras community price. 👍

tanzhenyu commented 1 year ago

keras community price.

Congrats @miguelCalado !! We really appreciate your work. If you have other ideas to improve our existing offering, please go ahead!

miguelCalado commented 1 year ago

Thanks everyone for the wishes! I'm still in shock and this whole thing hasn't really settled down :sweat_smile:

Yes, I'm looking forward to contributing some more. It would be cool to see implemented in KerasCV multi-prompting - adding weights to parts of the prompts, different schedulers, more versions of stable diffusion (e.g. 1.5), other research works (e.g. image variation or Imagic), etc. There is a lot of room for contributions :grin:

But one PR at a time! The Prompt-to-prompt will take me a bit as I'm still giving the final touches (adding support for multiple batches and other small things :slightly_smiling_face:).

Thanks everyone!

miguelCalado commented 1 year ago

teaser

Continuing the thread of text-guided image generation, this work also looks interesting: "it refines the cross-attention units to attend to all subject tokens in the text prompt and strengthen - or excite - their activations, encouraging the model to generate all subjects described in the text prompt.". The video summarizes it pretty well

It appears simple to implement (no training/finetuning), and it could work as a parameter on the text_to_image method (e.g. apply_excite_tokens=[2,5]). Might have a go after being done with the Prompt-to-prompt PR!

Elvenson commented 1 year ago

Hi guys, I wonder if this issue is still active. But I found a helpful reference for the img2img implementation here. I also integrated this logic into my branch and tested it. I hope it can help you guys.

jbischof commented 1 year ago

Thank you @Elvenson! This had been deprioritized but we'll take a look at your code.