pytti-tools / pytti-book

PyTTI Documentation and Tutorials
https://pytti-tools.github.io/pytti-book/intro.html
36 stars 10 forks source link

[tutorial] init images and semantic vs direct prompting #41

Open dmarx opened 2 years ago

dmarx commented 2 years ago

https://discord.com/channels/869630568818696202/899135695677968474/976590248387698748

dmarx commented 2 years ago
  1. init image just used to initialize the generation process instead of noise. all steering achieved by text prompt. the init image only directly impacts the first frame of the generation process.

    scenes: "a photograph of a cat"
    init_image: dog.png
  2. "direct" weight added to the init image to discourage the individual pixels in the image from changing from their current values. The init image "directly" impacts the generation of all frames in the scene.

    scenes: "a photograph of a cat"
    init_image: dog.png
    direct_init_weight: 1
  3. Using a semantic weight instead changes the behavior a lot. Like in (1), the actual pixel values of the init image only directly inform the first frame. the rest of the frames will still be informed by the information in the image similarly to if someone was describing the image out loud but you never got to look at it. "this is a picture of a dog", "this is a photograph", "it is day time" etc. But we lose all the positional information, which we kept in (2)

    scenes: "a photograph of a cat"
    init_image: dog.png
    semantic_init_weight: 1
  4. We can mix and match the effects here however we want

    scenes: "a photograph of a cat"
    init_image: dog.png
    semantic_init_weight: 1
    direct_init_weight: 1
dmarx commented 2 years ago

so let's say you've got an init image and a text prompt the generation will load up that init image the same as if it were like a "previous frame" but then when it starts doing it's thing, it's gonna basically ignore the content of the image and just start trying to manipulate the image towards the text prompt so pytti achieves this by converting the text into a "semantic" representation, i.e. a bunch of numbers that carries a bunch of informational content from the text pytti does the same thing to the image you're generating and tries to move the image's semantic representation close to the prompt's. that's how CLIP guidance works.

so CLIP can take either text or an image and represent those things in the same semantic space (edited) which means that you can use the semantic content of images to steer the generation process exactly the same way as you do with text so we consequently have two ways we can use images for the steering process we can be very literal and old school with what is often called a "reconstruction loss" which means we compare the image we generated with the steering image pixel-by-pixel, and try to nudge the generation to be close to the original picture at each pixel position this is what pytti calls "direct" prompting or stabilization alternatively, we can use the image a source of information content like we would with text this is what pytti calls "semantic" prompting/stabilization

allright, so back to init images we have direct_init_weight and semantic_init_weight so when you give pytti an init image, you can tell it to use a reconstruction loss, or a semantic (CLIP) loss, or even both (edited) or neither