Open ariG23498 opened 7 months ago
Try InstructPix2Pix or MGIE first.
I was experimenting with the project today and I found it super useful. I was talking to @ritwikraha today tho about some tweaks I need to do in the code to make it work in some "edge" cases (maybe not too edge).
soda can, bottle of wine
. The captioner will output a pepsi can in front of mountains
. Thus, during the replacement, we won't find soda can in the captioner response, and it will reuse the original prompt.
To solve this, we may need to have another LLM call to change the prompt (or a single call after get the prompt).
messages = [
{"role": "system", "content": "Follow the examples and return the expected output"},
{"role": "user", "content": "Caption: a bottle of wine in a sunny landscape\nQuery: Swap the bottle with a headphone"}, # example 1
{"role": "assistant", "content": "a headphone in a sunny landscape"}, # example 1
{"role": "user", "content": "Caption: a dog playing in the garden\nQuery: Change the dog with cat"}, # example 2
{"role": "assistant", "content": "a cat playing in the garden"}, # example 2
{"role": "user", "content": f"Caption: {caption}\nQuery: {text_prompt}"}
]
text = tokenizer.apply_chat_template( messages, tokenize=False, add_generation_prompt=True ) model_inputs = tokenizer([text], return_tensors="pt").to(device)
2. In the last part, during the inpainting pipeline, you are using a strength of 0.6. As we want to REPLACE the object, I think we should use a strength of 1., to avoid keep any detail/shape from the old object.
```python
output = pipeline(
prompt=prompt,
image=Image.fromarray(image.astype(np.uint8)),
mask_image=Image.fromarray(mask),
height=1024,
width=560,
negative_prompt=negative_prompt,
guidance_scale=7.5,
strength=1.
).images[0]
I was experimenting with the project today and I found it super useful. I was talking to @ritwikraha today tho about some tweaks I need to do in the code to make it work in some "edge" cases (maybe not too edge).
- The user may call an object differently than the captioner. This will lead to an issue where it will fail to replace the old object with the new one in the prompt. Example: the user asks to change the soda can to a bottle of wine and the LLM will parse it as
soda can, bottle of wine
. The captioner will outputa pepsi can in front of mountains
. Thus, during the replacement, we won't find soda can in the captioner response, and it will reuse the original prompt. To solve this, we may need to have another LLM call to change the prompt (or a single call after get the prompt).messages = [ {"role": "system", "content": "Follow the examples and return the expected output"}, {"role": "user", "content": "Caption: a bottle of wine in a sunny landscape\nQuery: Swap the bottle with a headphone"}, # example 1 {"role": "assistant", "content": "a headphone in a sunny landscape"}, # example 1 {"role": "user", "content": "Caption: a dog playing in the garden\nQuery: Change the dog with cat"}, # example 2 {"role": "assistant", "content": "a cat playing in the garden"}, # example 2 {"role": "user", "content": f"Caption: {caption}\nQuery: {text_prompt}"} ] text = tokenizer.apply_chat_template( messages, tokenize=False, add_generation_prompt=True ) model_inputs = tokenizer([text], return_tensors="pt").to(device)
- In the last part, during the inpainting pipeline, you are using a strength of 0.6. As we want to REPLACE the object, I think we should use a strength of 1., to avoid keep any detail/shape from the old object.
output = pipeline( prompt=prompt, image=Image.fromarray(image.astype(np.uint8)), mask_image=Image.fromarray(mask), height=1024, width=560, negative_prompt=negative_prompt, guidance_scale=7.5, strength=1. ).images[0]
@pedrogengo thank you for the awesome suggestions, I have incorporated your feedback here, we will work on getting this into the main pipeline and also improving the prompt for the SD model.
@ritwikraha can you share some images and edit instructions (original edit instructions and unmodified). I want to try them out on MGIE.
@ritwikraha can you share some images and edit instructions (original edit instructions and unmodified). I want to try them out on MGIE.
Sure, here you go
# an image of a dog
url_dog = "https://i.imgur.com/CiAbKbS.jpg"
edit_prompt_dog = "replace the dog with a walrus"
# image of the london bridge
url_bridge = "https://i.imgur.com/qrN1OzK.jpg"
edit_prompt_bridge = "change the bridge to chocolates"
# image of a beer bottle on a bed
url_bottle = "https://i.imgur.com/4ujXoav.jpg"
edit_prompt _bottle = "replace the bottle with a firecracker"
# image of a Lamborghini by the side of a building
url_car = "https://i.imgur.com/7zWdoN1.jpg"
edit_prompt_car = "change the Lamborghini to a rolls royce phantom"
Some results with MGIE:
Edit Instruction | Input Edit Instruction | Input Image | Edited Image |
---|---|---|---|
replace the dog with a walrus | If we were to replace the dog in the image with a large, powerful, and majestic walrus, the scene would take on a completely different and surreal aspect.The walrus would be seen walking through the grassy field with a serene and calm demeanor, possibly enjoying the sunlight and the outdoors. | ||
change the bridge to chocolates | If the bridge in the image were replaced with chocolate, it would create a whimsical and visually appealing scene.The bridge would be replaced by a large choculate, resembling a delicious chococolate treat. | ||
replace the bottle with a firecracker | If we replaced the Guinness stout bottle in the image with a sparkler or a fire cracker, the scene would be completely different and more dramatic.The sparkler would create a bright, continuous stream of sparks, illuminating the surrounding area and possibly the bed. | ||
change the Lamborghini to a rolls royce phantom | The image would feature a sleek, high-end car parked in front of a building, possibly a luxury car dealership or a prestigious event venue.The car's design and color would be more sophisticated and refined, reflecting the high-quality and prestigured nature of the brand. |
I had to do aspect-ratio preserving resizing to the input images because they were very high-resolution. Source of truth: https://github.com/apple/ml-mgie/blob/main/demo.ipynb.
Some results with MGIE:
Edit Instruction Input Edit Instruction Input Image Edited Image replace the dog with a walrus If we were to replace the dog in the image with a large, powerful, and majestic walrus, the scene would take on a completely different and surreal aspect.The walrus would be seen walking through the grassy field with a serene and calm demeanor, possibly enjoying the sunlight and the outdoors. change the bridge to chocolates If the bridge in the image were replaced with chocolate, it would create a whimsical and visually appealing scene.The bridge would be replaced by a large choculate, resembling a delicious chococolate treat. replace the bottle with a firecracker If we replaced the Guinness stout bottle in the image with a sparkler or a fire cracker, the scene would be completely different and more dramatic.The sparkler would create a bright, continuous stream of sparks, illuminating the surrounding area and possibly the bed. change the Lamborghini to a rolls royce phantom The image would feature a sleek, high-end car parked in front of a building, possibly a luxury car dealership or a prestigious event venue.The car's design and color would be more sophisticated and refined, reflecting the high-quality and prestigured nature of the brand. I had to do aspect-ratio preserving resizing to the input images because they were very high-resolution. Source of truth: https://github.com/apple/ml-mgie/blob/main/demo.ipynb.
Wow, this looks fantastic! The input edit instructions looks very interesting, will play out with this and include a notebook. Thanks again ❤️
Very cool project all.
Invoke team would love to recreate this workflow for folks in Invoke once the path forward is laid out.
Happy to help support however we can.
Very cool project all.
Invoke team would love to recreate this workflow for folks in Invoke once the path forward is laid out.
Thanks for the nice gesture @hipsterusername, how do you plan to integrate it?
Happy to help support however we can.
Currently we are looking to:
edit_prompt
from the user As we move forward we will be updating both the README and the issues to keep everyone updated. PRs, Issues, ideas and discussions are always welcome since we will continue to build this in public ✨
Why not trying to push the limits with MGIE since it already shows progress right off the bat.
Very cool project all. Invoke team would love to recreate this workflow for folks in Invoke once the path forward is laid out.
Thanks for the nice gesture @hipsterusername, how do you plan to integrate it?
Happy to help support however we can.
Currently we are looking to:
- Looking to fix issues with the captioning model and entity extraction
- Making the SDXL model more robust with the generations
- Optimizing the models to run with as low resource as possible
- Handling complex
edit_prompt
from the userAs we move forward we will be updating both the README and the issues to keep everyone updated. PRs, Issues, ideas and discussions are always welcome since we will continue to build this in public ✨
If we can validate it generally captures intent, we have a “unified canvas” feature (in our OSS UI) where this would make sense. If you’re interested in partnering on that, don’t hesitate to reach out.
Will keep an eye on progress!
Why not trying to push the limits with MGIE since it already shows progress right off the bat.
We have a roadmap to include InstructPix2Pix and MGIE in the future iterations, but for now, we are willing to push the boundaries of all the individual modules for generative fill. This has a dirrect correspondance with the way we (/w @ritwikraha) thought about the project in the initial phase. We did not want to implement a paper here, rather wanted to have fun with different modules, stumble upon some problems and learn what works and what does not.
That is admirable. But I think there’s some misunderstanding here. I am not asking to implement a paper. I am simply asking to see if it’s possible to achieve what you’re envisioning with the existing checkpoints of MGIE and InstructPix2Pix because these simplify the pipeline, thereby also reducing the latency. Furthermore, these were trained particularly for image editing with natural language constructs.
Your current workflow will always suffer from a complexity (both conceptual and speed) introduced by the number of modules you are introducing to it. In a way, this is still leveraging existing modules which you are collating together rather than implementing them from scratch.
I think both the above approaches have merits and demerits. At the end, I think it’s a good plan to show different approaches, their outputs, their compute costs, etc. so that the project becomes a little more actionable and extensible to the community.
Hope that clarifies things ✌️
@sayakpaul that does clarify a lot!
Thanks for the detailed reply.
The idea is to have generative fill with open source models!
The pipeline would contain the following moving parts:
edit_prompt
:to_replace = dog
,replace_with = tiger
replace_with
objectto_replace
objectSpace: https://huggingface.co/spaces/open-gen-fill/open-gen-fill-v1
What is the best way to tackle
#3
? The idea is to have complexedit_prompt
in the future, but we would still be able to get theto_replace
and thereplace_with
objects from the editing prompt.