sail-sg / EditAnything

Edit anything in images powered by segment-anything, ControlNet, StableDiffusion, etc. (ACM MM)
Apache License 2.0
3.34k stars 193 forks source link

Train ControlNet - prompt for each segmented mask region? #5

Open alelordelo opened 1 year ago

alelordelo commented 1 year ago

Hi again! : )

is it possible to train with a prompt for each segmented mask region?

Ex: Input image: house

Segmentation mask:

Prompts:

That would open up a lot of possibilities!

gasvn commented 1 year ago

Great idea. We will try it, or the pull request from your side is also very welcomed.

AmanKishore commented 1 year ago

This would be wild!

alelordelo commented 1 year ago

Great idea. We will try it, or the pull request from your side is also very welcomed.

Would totally contribute if I could, but I come from Swift/iOS world, and super limited with Python : /

alelordelo commented 1 year ago

Hi again @gasvn!

I saw you added this: Generate semantic labels for each SAM mask. python sam2semantic.py

would it be possible to train the ControlNet with those segmentation labels for each prompt?

gasvn commented 1 year ago

That's possible. We are working on it. For now, you can try our new gradio demo. It combines the inpainting and edit anything, so it can achieve most of the editing ability on a part under the guidance of text prompt. https://huggingface.co/spaces/shgao/EditAnything

alelordelo commented 1 year ago

thanks @gasvn , just tested your demo, super cool! looking forward to test multi prompt train/inference! ; )

alelordelo commented 1 year ago

Hi @gasvn, any news on the segmented training? : )

gasvn commented 1 year ago

Hi @gasvn, any news on the segmented training? : ) There is a concern about segmented training. I am afraid that lacking training data would makes the model collapse. So the segment with text prompt would be an important issue. For now, I am using blip2 generated text prompt. But I am not sure if this is suitable for stable diffusion. Any suggestions? Thanks~

alelordelo commented 1 year ago

I think the dataset would be:

Input image Segmented masks Prompt for each segmented mask (either manual label or automatically generated by OpenCLIP, Blip, etc)

then you have both text, image and segmentation as conditioning.

I have a dataset that like this that I could test. Do you see how I could test to train a model with this kind of setup?

alelordelo commented 1 year ago

Hi @gasvn, I did some research on this...

Abut "segment with text prompt" we could do a test with JSON COCO dataset: https://cocodataset.org/#home

I want to give this. shot, but not sure if its currently possible to train with JSON -> image pairs?

gasvn commented 1 year ago

training with text JSON -> image pairs is possible. I think it's needed to slightly change the controlnet to make each segment region has a unique text prompt instead of using just global text prompt.

alelordelo commented 1 year ago

hi @gasvn , any plans for multiple prompt per mask segment ?