Open ifsheldon opened 1 year ago
I've dug into the code a little bit. It seems the input control is color-coded segmentation.
But this can be a bug and we can observe it: When we draw a slightly complex scene directly with color codes, some classes will be mixed together due to color similarity.
See the below color-coded segmentation image.
Left: Hovel #FF00FF Right: Bus #FF00F5
And the generated samples with prompt "hovel and bus, masterpiece, high quality":
So we can see that the model does confuse "bus" with "hovel".
To fix this problem, I think the ad-hoc way could be: Train a embedding matrix for ADE20K classes and the embedding dimension is simply 3, and then map an discrete segmentation image to a feature map where each pixel has the corresponding embedding for its label.
Hi! Thanks for this awesome work!
I read through your paper on arXiv but I have a small confusion about the segmentation ControlNet.
Specifically, what is the type of the control input (i.e., segmentation)? Is it an ordinary segmentation image, in which each pixel has a class label that is either one-hot encoded or an integer? Or, is it a colorful image, in which the class label of each pixel is color coded?
I presumed, when I read the paper, that an ordinary segmentation image is used to be the direct input control (without preprocessing), but I saw in a community news, some artists changed the color of the segmentation to change the generated artifacts. That makes me speculate that the control of segmentation ControlNet may be color-coded segmentation and it was also trained with color-coded segmentation images? But this does not make sense to me since some labels such as "wall" and "road" in ADE20K have visually similar color code (RGB:787878 and RGB: 8C8C8C) while they have little semantic similarity.