lllyasviel / ControlNet

Let us control diffusion models!
Apache License 2.0
29.76k stars 2.69k forks source link

The type of control input of segmentation ControlNet? #187

Open ifsheldon opened 1 year ago

ifsheldon commented 1 year ago

Hi! Thanks for this awesome work!

I read through your paper on arXiv but I have a small confusion about the segmentation ControlNet.

Specifically, what is the type of the control input (i.e., segmentation)? Is it an ordinary segmentation image, in which each pixel has a class label that is either one-hot encoded or an integer? Or, is it a colorful image, in which the class label of each pixel is color coded?

I presumed, when I read the paper, that an ordinary segmentation image is used to be the direct input control (without preprocessing), but I saw in a community news, some artists changed the color of the segmentation to change the generated artifacts. That makes me speculate that the control of segmentation ControlNet may be color-coded segmentation and it was also trained with color-coded segmentation images? But this does not make sense to me since some labels such as "wall" and "road" in ADE20K have visually similar color code (RGB:787878 and RGB: 8C8C8C) while they have little semantic similarity.

ifsheldon commented 1 year ago

I've dug into the code a little bit. It seems the input control is color-coded segmentation.

https://github.com/lllyasviel/ControlNet/blob/d249f5bfc66c7af9b3102dccc2162c6d17270748/gradio_seg2image.py#L29

https://github.com/lllyasviel/ControlNet/blob/d249f5bfc66c7af9b3102dccc2162c6d17270748/annotator/uniformer/__init__.py#L22

But this can be a bug and we can observe it: When we draw a slightly complex scene directly with color codes, some classes will be mixed together due to color similarity.

See the below color-coded segmentation image. image

Left: Hovel #FF00FF Right: Bus #FF00F5

And the generated samples with prompt "hovel and bus, masterpiece, high quality": image image image

So we can see that the model does confuse "bus" with "hovel".

To fix this problem, I think the ad-hoc way could be: Train a embedding matrix for ADE20K classes and the embedding dimension is simply 3, and then map an discrete segmentation image to a feature map where each pixel has the corresponding embedding for its label.

geroldmeisinger commented 12 months ago

see here https://github.com/lllyasviel/ControlNet/issues/172