huggingface / diffusers

🤗 Diffusers: State-of-the-art diffusion models for image and audio generation in PyTorch and FLAX.
https://huggingface.co/docs/diffusers
Apache License 2.0
26.4k stars 5.43k forks source link

Denoising SDXL iteration images for coherent image previews that a user could understand #7001

Open MonkeeMan1 opened 9 months ago

MonkeeMan1 commented 9 months ago

Hello,

I'm currently trying to create image previews with SDXL. This works! However, the image output are very noisy. A very long time ago I found a solution to this for sd1.5 but unfortunately it has been lost to time.

How would I go about denoising these images so they are a little more coherent to a human viewer? I know the first couple of iterations are always going to be very noisy, but eventually it should be possible to convert this noise into a blurry image that a human could understand.

import time
from diffusers import StableDiffusionXLPipeline
import torch

pipe = StableDiffusionXLPipeline.from_pretrained(
    "stabilityai/stable-diffusion-xl-base-1.0", torch_dtype=torch.float16, variant="fp16", use_safetensors=True
).to("cuda")

prompt = "Astronaut in a jungle, cold color palette, muted colors, detailed, 8k"

def callback(pipe, step_index, timestep, callback_kwargs):
    latents = callback_kwargs.get("latents")

    start_time = time.time()
    with torch.no_grad():
        pipe.upcast_vae()
        latents = latents.to(
            next(iter(pipe.vae.post_quant_conv.parameters())).dtype)
        images = pipe.vae.decode(
            latents / pipe.vae.config.scaling_factor, return_dict=False)[0]
        images = pipe.image_processor.postprocess(images, output_type='pil')

        images[0].save(f"./imgs/{step_index}.png")

    end_time = time.time()

    print(f"Time taken to generate image: {end_time - start_time} seconds")

    return callback_kwargs

pipe(prompt=prompt, callback_on_step_end=callback)
yiyixuxu commented 9 months ago

is this what you're looking for? https://github.com/huggingface/diffusers/discussions/6991#discussioncomment-8491149 for questions like this, can we use discussion in the future? https://github.com/huggingface/diffusers/discussions

MonkeeMan1 commented 9 months ago

Hey, thank you very much for the reply. I apologise that this is the wrong place to put this question, in the future I will definetly make it in the discussions section.

Unfortunately this isn't quite what I'm looking for. The image previews with this solution are still very noisy. A previous solution I had with sd 1.5 looked like the image attached below:

image

This would be ideal as it is really amazing to see how the images improve with no noise in them.

CoffeeVampir3 commented 9 months ago

Heya, @MonkeeMan1 I think I have an example of what you're asking about here https://gist.github.com/CoffeeVampir3/610e4627042ac8f36b45da6ec3af776f This notebook is a bit old so may not run, but should serve as an example of how to do the thing.

Basically there's one extra step where you decode the latents at each step, this can be kind of slow so this example uses TAESDXL vae decoder https://github.com/madebyollin/taesd

MonkeeMan1 commented 9 months ago

Heya, @MonkeeMan1 I think I have an example of what you're asking about here https://gist.github.com/CoffeeVampir3/610e4627042ac8f36b45da6ec3af776f This notebook is a bit old so may not run, but should serve as an example of how to do the thing.

Basically there's one extra step where you decode the latents at each step, this can be kind of slow so this example uses TAESDXL vae decoder https://github.com/madebyollin/taesd

Hey, thank you very much for the response. Taesdxl definetly looks like this is what im looking for. However, the implementation you sent doesn't quite seem to do it for me. I may just be making a mistake, so apologies if thats the case. I've took out the relevant stuff (I think) and this should work as far as I understand it.

The output from this code can be seen below;

import io
from diffusers import DiffusionPipeline, LMSDiscreteScheduler, AutoencoderTiny
import numpy as np
import torch
from PIL import Image

pipe = DiffusionPipeline.from_pretrained(
    "stabilityai/stable-diffusion-xl-base-1.0", torch_dtype=torch.float16,
).to("cuda")

scheduler = LMSDiscreteScheduler(
    beta_start=0.00085, beta_end=0.012, beta_schedule="scaled_linear", num_train_timesteps=1000)

TINY_AUTOENCODER = AutoencoderTiny.from_pretrained(
    "madebyollin/taesd", torch_dtype=torch.float16)
TINY_AUTOENCODER.to("cuda")

prompt = "A capybara holding a sword whilst wearing a knights costuem,"

def to_png_image(img_np):
    """Convert a numpy array to PNG format image."""
    img = Image.fromarray((img_np * 255).astype(np.uint8))
    buf = io.BytesIO()
    img.save(buf, format='png', compress_level=0)
    return buf.getvalue()

def decode_tensors(pipe, step, timestep, callback_kwargs):
    latents = callback_kwargs["latents"]
    img = TINY_AUTOENCODER.decode(latents)
    img_np = img[0].squeeze(0).permute(
        1, 2, 0).cpu().detach().numpy().astype('float32')
    img_np = np.clip((img_np + 1) / 2.0, 0, 1)
    buf = to_png_image(img_np)
    with open(f"./imgs/{step}.png", 'wb') as f:
        f.write(buf)

    return callback_kwargs

image = pipe(
    height=1024,
    width=1024,
    prompt=prompt,
    negative_prompt="",
    guidance_scale=7.5,
    num_inference_steps=20,
    callback_on_step_end=decode_tensors,
    callback_on_step_end_tensor_inputs=["latents"],
).images[0]

image.save("./imgs/final.png")

Output: image

desired output should resemble a blury image something like this: image

MonkeeMan1 commented 9 months ago

Hi, I am still looking for a solution to this problem if anybody could help :)

github-actions[bot] commented 8 months ago

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.