lllyasviel / ControlNet

Let us control diffusion models!
Apache License 2.0
29.93k stars 2.7k forks source link

[Double Control] What double-control model is most needed? #31

Open lllyasviel opened 1 year ago

lllyasviel commented 1 year ago

Discussed in https://github.com/lllyasviel/ControlNet/discussions/30

Originally posted by **lllyasviel** February 12, 2023 We plan to train some models with "double controls", use two concat control maps and we are considering using images with holes as the second control map. This will lead to some model like "depth-aware inpainting" or "canny-edge-aware inpainting". Please also let us know if you have good suggestions.

This is a re-post. Please go to disscussion for disscussion.

ajundo commented 1 year ago

I guess pose + single source image control would be useful, at least for anime. Although a custom character dreambooth model with https://github.com/lllyasviel/ControlNet/discussions/12 seems to work, single image pose shifting is really attractive to me.

batrlatom commented 1 year ago

1) depth + segmentation? for example I would like to render movie scene 2) t-1 rendered frame and t+1 keyframe frame? When you want to render movies in anime style and want temporal stability in output. When I am trying just naive pixel img2img, each output frame is slightly different and it looks quite noisy

Take a look on my video made with instruct pix2pix ... https://www.reddit.com/r/StableDiffusion/comments/10x4fkr/pip2pix_marble_terminator/?utm_source=share&utm_medium=web2x&context=3

3) novel view synthesis? Given one, two or more images of an object, generate a new view of the same object. For example I have generated sneakers image and now I want to generate new views to be able to manufacture it. Example: https://thissneakerdoesnotexist.com/3d-info/

sam598 commented 1 year ago

Is this simply concatenating additional input channels on the hint image, or actually combining two separately trained control networks?

i would see both as extremely useful.

josephrocca commented 1 year ago

Potentially a naive question, but I'm wondering about using vector inputs like FaceNet/CLIP/etc embeddings as a second control, rather than spatial inputs like depth/edges/etc?