Add image as an input option along with prompt and seed

TheodoreGalanos commented 2 years ago

Would it be possible in the current implementation to also add an image alongside a prompt as the "seed package" for the frame?

nateraw commented 2 years ago

Is this a duplicate of #27 ? Is that what you mean?

TheodoreGalanos commented 2 years ago

Given an input video, produce outputs for each frame.

This feels like the opposite although I might not be understanding it correctly. I guess what I mean is "given an input image and prompt, produce a frame for a video", or something like:

prompt + image -> first frame
prompt + image -> last frame
first frame + last frame -> video

I hope I didn't miss anything in there :)

bryanyzhu commented 2 years ago

+1 to this feature because sometimes we want to generate a video from existing photos, not unconditional text-to-image generation.

TheodoreGalanos commented 2 years ago

Hmm wait, would this entail inverting like we used to do with styleGAN models? Is the only way to do it through something like textual inversion?

In that case, I get why this might be a bit over the top demand :D

nateraw commented 2 years ago

Ah I see. If I understand correctly, you'd like to use the same image at each timestep as the initial image? So not frame by frame, but rather starting with the same frame each time?

vyshnavigutta369 commented 2 years ago

I think @TheodoreGalanos means adding init_image as in img2img pipeline from stable diffusion rep, which should be easy to add! I also feel this feature would be extremely useful.

Alx-AI commented 2 years ago

Echo this, essentially what I think we mean is:

Either init with an image and walk to a separate prompt and/or start with a prompt and walk to an image

Echo that this would be super useful!!

sameraslan commented 2 years ago

Agree that this would be super useful!

The first element of the prompts list can correspond to the init-img and the rest of the prompts can correspond to the generated images

nateraw commented 2 years ago

I played around with this idea in Colab, and it indeed does work. Would need to add a new img2img pipeline though. Need to decide exactly what it does though. There are many different options. I would prefer same init image used on each call to __call__ fn. wdyt

aaallan98 commented 1 year ago

+1 to this feature.. Anyone knows how to customize Colab to do this ?

hellovivian commented 1 year ago

Being able to pass in an initial image would really help with usability -- a person wouldn't have to exhaustively try seeds or text prompts with a bunch of different styles / subjects.

One way to implement this is to follow what the img2img pipeline does, which is to encode the initial image into latents and pass those in with the StableDiffusionWalkPipeline. I tried doing this in the attached code, but it didn't really work.

Code for Reproduction


"""
Generating latents to pass into the StableDiffusionWalkPipeline as per the Img2Img Pipeline
"""
import torch
from stable_diffusion_videos import StableDiffusionWalkPipeline
from typing import List, Optional, Tuple, Union

import PIL
from PIL import Image
import numpy as np

def randn_tensor(
    shape: Union[Tuple, List],
    generator: Optional[Union[List["torch.Generator"], "torch.Generator"]] = None,
    device: Optional["torch.device"] = None,
    dtype: Optional["torch.dtype"] = None,
    layout: Optional["torch.layout"] = None,
):
    """This is a helper function that allows to create random tensors on the desired `device` with the desired `dtype`. When
    passing a list of generators one can seed each batched size individually. If CPU generators are passed the tensor
    will always be created on CPU.
    """
    # device on which tensor is created defaults to device
    rand_device = device
    batch_size = shape[0]

    layout = layout or torch.strided
    device = device or torch.device("cpu")

    if generator is not None:
        gen_device_type = generator.device.type if not isinstance(generator, list) else generator[0].device.type
        if gen_device_type != device.type and gen_device_type == "cpu":
            rand_device = "cpu"
            if device != "mps":
                logger.info(
                    f"The passed generator was created on 'cpu' even though a tensor on {device} was expected."
                    f" Tensors will be created on 'cpu' and then moved to {device}. Note that one can probably"
                    f" slighly speed up this function by passing a generator that was created on the {device} device."
                )
        elif gen_device_type != device.type and gen_device_type == "cuda":
            raise ValueError(f"Cannot generate a {device} tensor from a generator of type {gen_device_type}.")

    if isinstance(generator, list):
        shape = (1,) + shape[1:]
        latents = [
            torch.randn(shape, generator=generator[i], device=rand_device, dtype=dtype, layout=layout)
            for i in range(batch_size)
        ]
        latents = torch.cat(latents, dim=0).to(device)
    else:
        latents = torch.randn(shape, generator=generator, device=rand_device, dtype=dtype, layout=layout).to(device)

    return latents

def prepare_latents(pipeline, image, timestep, batch_size, num_images_per_prompt, dtype, device, generator=None):
    if not isinstance(image, (torch.Tensor, PIL.Image.Image, list)):
        raise ValueError(
            f"`image` has to be of type `torch.Tensor`, `PIL.Image.Image` or list but is {type(image)}"
        )

    image = image.to(device=device, dtype=dtype)

    batch_size = batch_size * num_images_per_prompt
    if isinstance(generator, list) and len(generator) != batch_size:
        raise ValueError(
            f"You have passed a list of generators of length {len(generator)}, but requested an effective batch"
            f" size of {batch_size}. Make sure the batch size matches the length of the generators."
        )

    if isinstance(generator, list):
        init_latents = [
            pipeline.vae.encode(image[i : i + 1]).latent_dist.sample(generator[i]) for i in range(batch_size)
        ]
        init_latents = torch.cat(init_latents, dim=0)
    else:
        init_latents = pipeline.vae.encode(image).latent_dist.sample(generator)

    # init_latents = pipeline.vae.config.scaling_factor * init_latents

    if batch_size > init_latents.shape[0] and batch_size % init_latents.shape[0] == 0:
        # expand init_latents for batch_size
        deprecation_message = (
            f"You have passed {batch_size} text prompts (`prompt`), but only {init_latents.shape[0]} initial"
            " images (`image`). Initial images are now duplicating to match the number of text prompts. Note"
            " that this behavior is deprecated and will be removed in a version 1.0.0. Please make sure to update"
            " your script to pass as many initial images as text prompts to suppress this warning."
        )
        deprecate("len(prompt) != len(image)", "1.0.0", deprecation_message, standard_warn=False)
        additional_image_per_prompt = batch_size // init_latents.shape[0]
        init_latents = torch.cat([init_latents] * additional_image_per_prompt, dim=0)
    elif batch_size > init_latents.shape[0] and batch_size % init_latents.shape[0] != 0:
        raise ValueError(
            f"Cannot duplicate `image` of batch size {init_latents.shape[0]} to {batch_size} text prompts."
        )
    else:
        init_latents = torch.cat([init_latents], dim=0)

    shape = init_latents.shape
    noise = randn_tensor(shape, generator=generator, device=device, dtype=dtype)

    # get latents
    init_latents = pipeline.scheduler.add_noise(init_latents, noise, timestep)
    latents = init_latents

    return latents

def preprocess(image):
    if isinstance(image, torch.Tensor):
        return image
    elif isinstance(image, PIL.Image.Image):
        image = [image]

    if isinstance(image[0], PIL.Image.Image):
        w, h = image[0].size
        w, h = map(lambda x: x - x % 8, (w, h))  # resize to integer multiple of 8

        image = [np.array(i.resize((w, h)))[None, :] for i in image]
        image = np.concatenate(image, axis=0)
        image = np.array(image).astype(np.float32) / 255.0
        image = image.transpose(0, 3, 1, 2)
        image = 2.0 * image - 1.0
        image = torch.from_numpy(image)
    elif isinstance(image[0], torch.Tensor):
        image = torch.cat(image, dim=0)
    return image

with torch.no_grad():
    pipeline = StableDiffusionWalkPipeline.from_pretrained(
        "CompVis/stable-diffusion-v1-4",
        torch_dtype=torch.float16,
        revision="fp16",
        safety_checker=None,
        ).to("cuda")

    #Open image prompt. Resize and preprocess
    init_image = Image.open("input.png").convert("RGB")
    init_image = init_image.resize((384, 384))
    image = preprocess(init_image)

    # Set pipeline call parameters
    num_inference_steps = 50
    prompt = "wendy from peter pan looking up at night sky, impressionism"
    num_images_per_prompt = 1
    batch_size = 1

    device = torch.device(f"cuda:0")

    # Set timesteps
    timesteps_tensor = pipeline.scheduler.set_timesteps(num_inference_steps = 50)
    timesteps = pipeline.scheduler.timesteps

    # Get datatype for latents
    embeds_a = pipeline.embed_text(prompt)
    latents_dtype = embeds_a.dtype

    # Get latents
    latent_timestep = timesteps[:1].repeat(batch_size * num_images_per_prompt)
    latents = prepare_latents(pipeline, image, latent_timestep, batch_size, num_images_per_prompt, latents_dtype, device, generator=None)

    # call pipeline using the latents from the initial image
    images =pipeline(prompt, height=384, width=384, strength=0.8, guidance_scale =0.8,generator=None,  num_inference_steps = 50, latents=latents)
    images[0][0].save("output.png")

Example Input and Output (This input is the "input.png" needed for the above code to run") wendy hi2 *Second image is slightly smaller at 384x384 but this was because I resized it.

Desired Behavior (as per the img2img pipeline) wendy goodenough

@nateraw If this worked for you on a colab, do you have any insights? If anyone else has any thoughts on why passing in the latents doesn't seem to be working as the img2img pipeline I tried to follow, let me know. :)

nateraw / stable-diffusion-videos

Add image as an input option along with prompt and seed #37