lllyasviel / ControlNet

Let us control diffusion models!
Apache License 2.0
29.94k stars 2.7k forks source link

[Question] Training a camera position controlnet ? #699

Open arthurwolf opened 1 month ago

arthurwolf commented 1 month ago

Hello!

Thanks for the amazing project.

I'm often in the situation where I've generated a scene I really like, but I'd like to rotate the camera a bit more to the right, or zoom in, or put the camera a bit higher up, etc.

Currently, the only way I found to do this would be to generate a 3D model of the scene (possibly automatically from a controlnet-generated depth map?), rotate that, generate a new depth map, and use that to regenerate the map.

But:

Another option somebody suggested, was training LORAs on specific angles, and have many LORAs for many different angles/camera position. But again, pretty cumbersome (and a lot of training). Also, not even sure if that'd work.

Or train a single LORA, but with a dataset that matches many different angle "keywords" to many different positioned images? As you can see, I'm a bit lost.

I figured what I really want to do, is manipulate the part of the model's "internal conception" of the scene that defines its rotation (if there is such a thing...). Like there has to be some set of weights that defines if we look at a subject from the front or the back, if a face is seen sideways or 3 quarters, etc.

So my question is, would it be possible to create a controlnet that would do this?

My main problem I see, is controlnet training, as described in

https://github.com/lllyasviel/ControlNet/blob/main/docs/train.md

takes images as an input.

But the input in my case wouldn't be an image, it would be an angle.

So my best guess of how to do this (and this is likely completely wrong), would be:

  1. Take a 3D scene.
  2. Render it at a specific angle/zoom/camera position.
  3. Take that generated image, and a text description of the camera position : angle212 height1.54 etc. Or maybe (angle 0.3) (height 0.25) ? I.e. play on the strength of the tokens? Something like that.
  4. Add each pair of generated image and corresponding position text to the dataset (completely ignoring the "black and white" input image)
  5. Generate thousands, train.

isometric-scene-with-3d-grocery-store-shop-vector-25647334

grocery store, shop, vector graphic, rotation-0.5, height-0.5, distance-0.3, sunrotation-0.2, sunheight-0.5, sundistance-1.0, 

Would this work? Does it have any chance to? If not, is there any way to do this that would work?

Would a single 3D scene (or even a dumb cube on a plane) work, or do I need a large variety of scenes?

I would love some kind of input/feedback/advice on this.

Thanks so much to anyone who takes the time to reply.

Cheers.