A way forward - Githubissues

ariG23498 commented 7 months ago

The idea is to have generative fill with open source models!

The pipeline would contain the following moving parts:

Input image (eg. an image of a dog running through a field)
Input edit prompt: Replace the dog with a tiger
A model that extracts objects from the edit_prompt: to_replace = dog, replace_with = tiger
An image captioning model: a small dog running through a grassy field
Simple string manipulation to update the caption with the replace_with object
Getting the segmentation mask from the to_replace object
Using the original image, the mask and the updated caption inside an inpainting pipeline to get the edited image.

Space: https://huggingface.co/spaces/open-gen-fill/open-gen-fill-v1

What is the best way to tackle #3? The idea is to have complex edit_prompt in the future, but we would still be able to get the to_replace and the replace_with objects from the editing prompt.

sayakpaul commented 7 months ago

Try InstructPix2Pix or MGIE first.

pedrogengo commented 7 months ago

I was experimenting with the project today and I found it super useful. I was talking to @ritwikraha today tho about some tweaks I need to do in the code to make it work in some "edge" cases (maybe not too edge).

The user may call an object differently than the captioner. This will lead to an issue where it will fail to replace the old object with the new one in the prompt. Example: the user asks to change the soda can to a bottle of wine and the LLM will parse it as soda can, bottle of wine. The captioner will output a pepsi can in front of mountains. Thus, during the replacement, we won't find soda can in the captioner response, and it will reuse the original prompt. To solve this, we may need to have another LLM call to change the prompt (or a single call after get the prompt).


messages = [
{"role": "system", "content": "Follow the examples and return the expected output"},
{"role": "user", "content": "Caption: a bottle of wine in a sunny landscape\nQuery: Swap the bottle with a headphone"},  # example 1
{"role": "assistant", "content": "a headphone in a sunny landscape"},  # example 1
{"role": "user", "content": "Caption: a dog playing in the garden\nQuery: Change the dog with cat"},  # example 2
{"role": "assistant", "content": "a cat playing in the garden"},  # example 2
{"role": "user", "content": f"Caption: {caption}\nQuery: {text_prompt}"}
]

text = tokenizer.apply_chat_template( messages, tokenize=False, add_generation_prompt=True ) model_inputs = tokenizer([text], return_tensors="pt").to(device)

2. In the last part, during the inpainting pipeline, you are using a strength of 0.6. As we want to REPLACE the object, I think we should use a strength of 1., to avoid keep any detail/shape from the old object.

```python
output = pipeline(
    prompt=prompt,
    image=Image.fromarray(image.astype(np.uint8)),
    mask_image=Image.fromarray(mask),
    height=1024,
    width=560,
    negative_prompt=negative_prompt,
    guidance_scale=7.5,
    strength=1.
).images[0]

ritwikraha commented 7 months ago

I was experimenting with the project today and I found it super useful. I was talking to @ritwikraha today tho about some tweaks I need to do in the code to make it work in some "edge" cases (maybe not too edge).

The user may call an object differently than the captioner. This will lead to an issue where it will fail to replace the old object with the new one in the prompt. Example: the user asks to change the soda can to a bottle of wine and the LLM will parse it as soda can, bottle of wine. The captioner will output a pepsi can in front of mountains. Thus, during the replacement, we won't find soda can in the captioner response, and it will reuse the original prompt. To solve this, we may need to have another LLM call to change the prompt (or a single call after get the prompt).
messages = [
    {"role": "system", "content": "Follow the examples and return the expected output"},
    {"role": "user", "content": "Caption: a bottle of wine in a sunny landscape\nQuery: Swap the bottle with a headphone"},  # example 1
    {"role": "assistant", "content": "a headphone in a sunny landscape"},  # example 1
    {"role": "user", "content": "Caption: a dog playing in the garden\nQuery: Change the dog with cat"},  # example 2
    {"role": "assistant", "content": "a cat playing in the garden"},  # example 2
    {"role": "user", "content": f"Caption: {caption}\nQuery: {text_prompt}"}
]

text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True
)
model_inputs = tokenizer([text], return_tensors="pt").to(device)
In the last part, during the inpainting pipeline, you are using a strength of 0.6. As we want to REPLACE the object, I think we should use a strength of 1., to avoid keep any detail/shape from the old object.
output = pipeline(
    prompt=prompt,
    image=Image.fromarray(image.astype(np.uint8)),
    mask_image=Image.fromarray(mask),
    height=1024,
    width=560,
    negative_prompt=negative_prompt,
    guidance_scale=7.5,
    strength=1.
).images[0]

@pedrogengo thank you for the awesome suggestions, I have incorporated your feedback here, we will work on getting this into the main pipeline and also improving the prompt for the SD model.

sayakpaul commented 7 months ago

@ritwikraha can you share some images and edit instructions (original edit instructions and unmodified). I want to try them out on MGIE.

ritwikraha commented 7 months ago

@ritwikraha can you share some images and edit instructions (original edit instructions and unmodified). I want to try them out on MGIE.

Sure, here you go

# an image of a dog
url_dog = "https://i.imgur.com/CiAbKbS.jpg"
edit_prompt_dog = "replace the dog with a walrus"
# image of the london bridge
url_bridge = "https://i.imgur.com/qrN1OzK.jpg" 
edit_prompt_bridge = "change the bridge to chocolates"
# image of a beer bottle on a bed
url_bottle = "https://i.imgur.com/4ujXoav.jpg"
edit_prompt _bottle = "replace the bottle with a firecracker"
# image of a Lamborghini by the side of a building
url_car = "https://i.imgur.com/7zWdoN1.jpg"
edit_prompt_car = "change the Lamborghini to a rolls royce phantom"

sayakpaul commented 6 months ago

Some results with MGIE:

Edit Instruction	Input Edit Instruction	Input Image	Edited Image
replace the dog with a walrus	If we were to replace the dog in the image with a large, powerful, and majestic walrus, the scene would take on a completely different and surreal aspect.The walrus would be seen walking through the grassy field with a serene and calm demeanor, possibly enjoying the sunlight and the outdoors.
change the bridge to chocolates	If the bridge in the image were replaced with chocolate, it would create a whimsical and visually appealing scene.The bridge would be replaced by a large choculate, resembling a delicious chococolate treat.
replace the bottle with a firecracker	If we replaced the Guinness stout bottle in the image with a sparkler or a fire cracker, the scene would be completely different and more dramatic.The sparkler would create a bright, continuous stream of sparks, illuminating the surrounding area and possibly the bed.
change the Lamborghini to a rolls royce phantom	The image would feature a sleek, high-end car parked in front of a building, possibly a luxury car dealership or a prestigious event venue.The car's design and color would be more sophisticated and refined, reflecting the high-quality and prestigured nature of the brand.

I had to do aspect-ratio preserving resizing to the input images because they were very high-resolution. Source of truth: https://github.com/apple/ml-mgie/blob/main/demo.ipynb.

ritwikraha commented 6 months ago

Some results with MGIE:

Edit Instruction Input Edit Instruction Input Image Edited Image replace the dog with a walrus If we were to replace the dog in the image with a large, powerful, and majestic walrus, the scene would take on a completely different and surreal aspect.The walrus would be seen walking through the grassy field with a serene and calm demeanor, possibly enjoying the sunlight and the outdoors. change the bridge to chocolates If the bridge in the image were replaced with chocolate, it would create a whimsical and visually appealing scene.The bridge would be replaced by a large choculate, resembling a delicious chococolate treat. replace the bottle with a firecracker If we replaced the Guinness stout bottle in the image with a sparkler or a fire cracker, the scene would be completely different and more dramatic.The sparkler would create a bright, continuous stream of sparks, illuminating the surrounding area and possibly the bed. change the Lamborghini to a rolls royce phantom The image would feature a sleek, high-end car parked in front of a building, possibly a luxury car dealership or a prestigious event venue.The car's design and color would be more sophisticated and refined, reflecting the high-quality and prestigured nature of the brand. I had to do aspect-ratio preserving resizing to the input images because they were very high-resolution. Source of truth: https://github.com/apple/ml-mgie/blob/main/demo.ipynb.

Wow, this looks fantastic! The input edit instructions looks very interesting, will play out with this and include a notebook. Thanks again ❤️

hipsterusername commented 6 months ago

Very cool project all.

Invoke team would love to recreate this workflow for folks in Invoke once the path forward is laid out.

Happy to help support however we can.

ritwikraha commented 6 months ago

Very cool project all.

Invoke team would love to recreate this workflow for folks in Invoke once the path forward is laid out.

Thanks for the nice gesture @hipsterusername, how do you plan to integrate it?

Happy to help support however we can.

Currently we are looking to:

Looking to fix issues with the captioning model and entity extraction
Making the SDXL model more robust with the generations
Optimizing the models to run with as low resource as possible
Handling complex edit_prompt from the user

As we move forward we will be updating both the README and the issues to keep everyone updated. PRs, Issues, ideas and discussions are always welcome since we will continue to build this in public ✨

sayakpaul commented 6 months ago

Why not trying to push the limits with MGIE since it already shows progress right off the bat.

hipsterusername commented 6 months ago

Very cool project all. Invoke team would love to recreate this workflow for folks in Invoke once the path forward is laid out.

Thanks for the nice gesture @hipsterusername, how do you plan to integrate it?

Happy to help support however we can.

Currently we are looking to:

Looking to fix issues with the captioning model and entity extraction

Making the SDXL model more robust with the generations

Optimizing the models to run with as low resource as possible

Handling complex edit_prompt from the user

As we move forward we will be updating both the README and the issues to keep everyone updated. PRs, Issues, ideas and discussions are always welcome since we will continue to build this in public ✨

If we can validate it generally captures intent, we have a “unified canvas” feature (in our OSS UI) where this would make sense. If you’re interested in partnering on that, don’t hesitate to reach out.

Will keep an eye on progress!

ariG23498 commented 6 months ago

Why not trying to push the limits with MGIE since it already shows progress right off the bat.

We have a roadmap to include InstructPix2Pix and MGIE in the future iterations, but for now, we are willing to push the boundaries of all the individual modules for generative fill. This has a dirrect correspondance with the way we (/w @ritwikraha) thought about the project in the initial phase. We did not want to implement a paper here, rather wanted to have fun with different modules, stumble upon some problems and learn what works and what does not.

sayakpaul commented 6 months ago

That is admirable. But I think there’s some misunderstanding here. I am not asking to implement a paper. I am simply asking to see if it’s possible to achieve what you’re envisioning with the existing checkpoints of MGIE and InstructPix2Pix because these simplify the pipeline, thereby also reducing the latency. Furthermore, these were trained particularly for image editing with natural language constructs.

Your current workflow will always suffer from a complexity (both conceptual and speed) introduced by the number of modules you are introducing to it. In a way, this is still leveraging existing modules which you are collating together rather than implementing them from scratch.

I think both the above approaches have merits and demerits. At the end, I think it’s a good plan to show different approaches, their outputs, their compute costs, etc. so that the project becomes a little more actionable and extensible to the community.

Hope that clarifies things ✌️

ariG23498 commented 6 months ago

@sayakpaul that does clarify a lot!

Thanks for the detailed reply.

ritwikraha / Open-Generative-Fill

A way forward #2