keras-team / keras-cv

Industry-strength Computer Vision workflows with Keras
Other
999 stars 329 forks source link

Support Image-to-Prompt #1493

Closed innat closed 1 year ago

innat commented 1 year ago

Short Description

Similar to image-captioning / retrieval model. Similar operatiions of text-to-image, image-to-image.

Papers

BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation

1670928184033-62441d1d9fdefb55a0b7d12c

Existing Implementations

Motivation

Other Information

( If this ticket doesn't fit on issue section, please move it to discussion.)

NiharJani2002 commented 1 year ago

Is this issue solved @innat ?
If it is not solved, can you assign me this issue to work upon ?

innat commented 1 year ago

@NiharJani2002 Only keras team can assign. Please wait to hear back from keras team if it's ok to take it.

ccing. @jbischof @ianstenbit

ianstenbit commented 1 year ago

Is the scope of this issue to add a image_to_text workflow to StableDiffusion?

If so, that sounds good to me. It's probably best to start with an example notebook, and we can evaluate either including it in the API or publishing the example on KerasIO from there.

innat commented 1 year ago

@ianstenbit

Is the scope of this issue to add a image_to_text workflow to StableDiffusion?

(afaik), not entirely. But more likely image-captioning. In the first post, there's a model mentioned, BLIP, please take a look at that.

Though image-captioning isn't listed in the current road-map but the relation of this domain with the current hote cake (stable-diffusion) becomes mutually close. A use case for example, I've a image dataset around 10k, and I'm using BLIP 2 to generate prompt (approximate) to make the training pairs.

ianstenbit commented 1 year ago

Hey @innat -- sorry I totally misunderstood. Was looking too fast :-)

This looks super interesting, and if someone in the community is interested in porting it to Keras, we can certainly look into making it a KerasCV offering. At this time, though, we can't add this to our near-term roadmap for the KerasCV team, so it would have to be a community driven effort.

Probably for something of this scale, the right approach would be to create a separate repo with BLIP components that depend on KerasCV where possible, and once it's up and running we can try to integrate it into our API.

innat commented 1 year ago

@ianstenbit Thanks for the response. You made a valid suggestion. I'm working with the image2prompt at kaggle in my spare time, and I will look forward to translate BLIP in keras.

I've one query (probably old one, so sorry if it's already discussed and decided). What if the model consist of (around) same percentage of cv components and nlp components, where should it live? For example, arch of model BLIP, consist of cv and nlp components; one of their LLMs model is variant of T5, vanila version of t5 is available on keras-nlp.


Uh, it might fall in keras-nlp (as it did)! https://www.tensorflow.org/tutorials/text/image_captioning cc. @mattdangerw

ianstenbit commented 1 year ago

Good question @innat -- our plan for now is to have KerasCV depend on KerasNLP for models which require both CV and NLP components, so if there are relevant NLP components that don't exist in KerasNLP yet, we should strive to include them there and we can depend on them as necessary.