Open TheodoreGalanos opened 2 years ago
Is this a duplicate of #27 ? Is that what you mean?
Given an input video, produce outputs for each frame.
This feels like the opposite although I might not be understanding it correctly. I guess what I mean is "given an input image and prompt, produce a frame for a video", or something like:
prompt + image -> first frame
prompt + image -> last frame
first frame + last frame -> video
I hope I didn't miss anything in there :)
+1 to this feature because sometimes we want to generate a video from existing photos, not unconditional text-to-image generation.
Hmm wait, would this entail inverting like we used to do with styleGAN models? Is the only way to do it through something like textual inversion?
In that case, I get why this might be a bit over the top demand :D
Ah I see. If I understand correctly, you'd like to use the same image at each timestep as the initial image? So not frame by frame, but rather starting with the same frame each time?
I think @TheodoreGalanos means adding init_image as in img2img pipeline from stable diffusion rep, which should be easy to add! I also feel this feature would be extremely useful.
Echo this, essentially what I think we mean is:
Either init with an image and walk to a separate prompt and/or start with a prompt and walk to an image
Echo that this would be super useful!!
Agree that this would be super useful!
The first element of the prompts list can correspond to the init-img and the rest of the prompts can correspond to the generated images
I played around with this idea in Colab, and it indeed does work. Would need to add a new img2img pipeline though. Need to decide exactly what it does though. There are many different options. I would prefer same init image used on each call to __call__
fn. wdyt
+1 to this feature.. Anyone knows how to customize Colab to do this ?
Being able to pass in an initial image would really help with usability -- a person wouldn't have to exhaustively try seeds or text prompts with a bunch of different styles / subjects.
One way to implement this is to follow what the img2img pipeline does, which is to encode the initial image into latents and pass those in with the StableDiffusionWalkPipeline. I tried doing this in the attached code, but it didn't really work.
Code for Reproduction
"""
Generating latents to pass into the StableDiffusionWalkPipeline as per the Img2Img Pipeline
"""
import torch
from stable_diffusion_videos import StableDiffusionWalkPipeline
from typing import List, Optional, Tuple, Union
import PIL
from PIL import Image
import numpy as np
def randn_tensor(
shape: Union[Tuple, List],
generator: Optional[Union[List["torch.Generator"], "torch.Generator"]] = None,
device: Optional["torch.device"] = None,
dtype: Optional["torch.dtype"] = None,
layout: Optional["torch.layout"] = None,
):
"""This is a helper function that allows to create random tensors on the desired `device` with the desired `dtype`. When
passing a list of generators one can seed each batched size individually. If CPU generators are passed the tensor
will always be created on CPU.
"""
# device on which tensor is created defaults to device
rand_device = device
batch_size = shape[0]
layout = layout or torch.strided
device = device or torch.device("cpu")
if generator is not None:
gen_device_type = generator.device.type if not isinstance(generator, list) else generator[0].device.type
if gen_device_type != device.type and gen_device_type == "cpu":
rand_device = "cpu"
if device != "mps":
logger.info(
f"The passed generator was created on 'cpu' even though a tensor on {device} was expected."
f" Tensors will be created on 'cpu' and then moved to {device}. Note that one can probably"
f" slighly speed up this function by passing a generator that was created on the {device} device."
)
elif gen_device_type != device.type and gen_device_type == "cuda":
raise ValueError(f"Cannot generate a {device} tensor from a generator of type {gen_device_type}.")
if isinstance(generator, list):
shape = (1,) + shape[1:]
latents = [
torch.randn(shape, generator=generator[i], device=rand_device, dtype=dtype, layout=layout)
for i in range(batch_size)
]
latents = torch.cat(latents, dim=0).to(device)
else:
latents = torch.randn(shape, generator=generator, device=rand_device, dtype=dtype, layout=layout).to(device)
return latents
def prepare_latents(pipeline, image, timestep, batch_size, num_images_per_prompt, dtype, device, generator=None):
if not isinstance(image, (torch.Tensor, PIL.Image.Image, list)):
raise ValueError(
f"`image` has to be of type `torch.Tensor`, `PIL.Image.Image` or list but is {type(image)}"
)
image = image.to(device=device, dtype=dtype)
batch_size = batch_size * num_images_per_prompt
if isinstance(generator, list) and len(generator) != batch_size:
raise ValueError(
f"You have passed a list of generators of length {len(generator)}, but requested an effective batch"
f" size of {batch_size}. Make sure the batch size matches the length of the generators."
)
if isinstance(generator, list):
init_latents = [
pipeline.vae.encode(image[i : i + 1]).latent_dist.sample(generator[i]) for i in range(batch_size)
]
init_latents = torch.cat(init_latents, dim=0)
else:
init_latents = pipeline.vae.encode(image).latent_dist.sample(generator)
# init_latents = pipeline.vae.config.scaling_factor * init_latents
if batch_size > init_latents.shape[0] and batch_size % init_latents.shape[0] == 0:
# expand init_latents for batch_size
deprecation_message = (
f"You have passed {batch_size} text prompts (`prompt`), but only {init_latents.shape[0]} initial"
" images (`image`). Initial images are now duplicating to match the number of text prompts. Note"
" that this behavior is deprecated and will be removed in a version 1.0.0. Please make sure to update"
" your script to pass as many initial images as text prompts to suppress this warning."
)
deprecate("len(prompt) != len(image)", "1.0.0", deprecation_message, standard_warn=False)
additional_image_per_prompt = batch_size // init_latents.shape[0]
init_latents = torch.cat([init_latents] * additional_image_per_prompt, dim=0)
elif batch_size > init_latents.shape[0] and batch_size % init_latents.shape[0] != 0:
raise ValueError(
f"Cannot duplicate `image` of batch size {init_latents.shape[0]} to {batch_size} text prompts."
)
else:
init_latents = torch.cat([init_latents], dim=0)
shape = init_latents.shape
noise = randn_tensor(shape, generator=generator, device=device, dtype=dtype)
# get latents
init_latents = pipeline.scheduler.add_noise(init_latents, noise, timestep)
latents = init_latents
return latents
def preprocess(image):
if isinstance(image, torch.Tensor):
return image
elif isinstance(image, PIL.Image.Image):
image = [image]
if isinstance(image[0], PIL.Image.Image):
w, h = image[0].size
w, h = map(lambda x: x - x % 8, (w, h)) # resize to integer multiple of 8
image = [np.array(i.resize((w, h)))[None, :] for i in image]
image = np.concatenate(image, axis=0)
image = np.array(image).astype(np.float32) / 255.0
image = image.transpose(0, 3, 1, 2)
image = 2.0 * image - 1.0
image = torch.from_numpy(image)
elif isinstance(image[0], torch.Tensor):
image = torch.cat(image, dim=0)
return image
with torch.no_grad():
pipeline = StableDiffusionWalkPipeline.from_pretrained(
"CompVis/stable-diffusion-v1-4",
torch_dtype=torch.float16,
revision="fp16",
safety_checker=None,
).to("cuda")
#Open image prompt. Resize and preprocess
init_image = Image.open("input.png").convert("RGB")
init_image = init_image.resize((384, 384))
image = preprocess(init_image)
# Set pipeline call parameters
num_inference_steps = 50
prompt = "wendy from peter pan looking up at night sky, impressionism"
num_images_per_prompt = 1
batch_size = 1
device = torch.device(f"cuda:0")
# Set timesteps
timesteps_tensor = pipeline.scheduler.set_timesteps(num_inference_steps = 50)
timesteps = pipeline.scheduler.timesteps
# Get datatype for latents
embeds_a = pipeline.embed_text(prompt)
latents_dtype = embeds_a.dtype
# Get latents
latent_timestep = timesteps[:1].repeat(batch_size * num_images_per_prompt)
latents = prepare_latents(pipeline, image, latent_timestep, batch_size, num_images_per_prompt, latents_dtype, device, generator=None)
# call pipeline using the latents from the initial image
images =pipeline(prompt, height=384, width=384, strength=0.8, guidance_scale =0.8,generator=None, num_inference_steps = 50, latents=latents)
images[0][0].save("output.png")
Example Input and Output (This input is the "input.png" needed for the above code to run") *Second image is slightly smaller at 384x384 but this was because I resized it.
Desired Behavior (as per the img2img pipeline)
@nateraw If this worked for you on a colab, do you have any insights? If anyone else has any thoughts on why passing in the latents doesn't seem to be working as the img2img pipeline I tried to follow, let me know. :)
Would it be possible in the current implementation to also add an image alongside a prompt as the "seed package" for the frame?