madebyollin / taesd

Tiny AutoEncoder for Stable Diffusion
MIT License
580 stars 27 forks source link

What is the `scaling_factor`? #3

Closed sayakpaul closed 1 year ago

sayakpaul commented 1 year ago

We have latent_shift and latent_magnitude values here:

https://github.com/madebyollin/taesd/blob/main/taesd.py#L44C1-L45C23

But is there a scaling_factor as well or is it just one?

scaling_factor as observed in https://github.com/huggingface/diffusers/blob/ea5b0575f8f91b76f32fb6f6930c0bc30e42865e/src/diffusers/models/autoencoder_kl.py#L61.

madebyollin commented 1 year ago

There is no scaling_factor for TAESD - TAESD directly converts SD(XL) latents into RGB images in [0, 1] (see the usage in the example notebook). So if you need to specify a value you can probably set it to 1.0.

(The latent_shift and latent_magnitude values in taesd.py are only relevant if you want to store latents into RGBA PNG files - sorry for the confusion.)

sayakpaul commented 1 year ago

Thanks for your reply!

I am trying to integrate your work in diffusers so that users can use it very easily (of course, crediting this repository).

With the following code (diffusers was installed using pip install git+https://github.com/huggingface/diffusers@feat/tiny-autoenc):

import torch
from diffusers import DiffusionPipeline, TinyAutoencoder

pipe = DiffusionPipeline.from_pretrained(
    "stabilityai/stable-diffusion-2-1-base", torch_dtype=torch.float16
)
pipe.vae = TinyAutoencoder.from_pretrained("sayakpaul/taesd-diffusers", torch_dtype=torch.float16)
pipe = pipe.to("cuda")

prompt = "slice of delicious New York-style berry cheesecake"
image = pipe(prompt, num_inference_steps=25, height=512, width=512, guidance_scale=3.0).images[0]
image

I am getting:

image

Is the quality somewhat expected?

To give you some more context here's what we do in the standard pipeline settings.

After we get the latents from the UNet,

  1. We first decode it: https://github.com/huggingface/diffusers/blob/ea5b0575f8f91b76f32fb6f6930c0bc30e42865e/src/diffusers/pipelines/stable_diffusion/pipeline_stable_diffusion.py#L697.
  2. And then run it through the postprocessor: https://github.com/huggingface/diffusers/blob/ea5b0575f8f91b76f32fb6f6930c0bc30e42865e/src/diffusers/pipelines/stable_diffusion/pipeline_stable_diffusion.py#L708.
  3. Then we denormalize the image: https://github.com/huggingface/diffusers/blob/ea5b0575f8f91b76f32fb6f6930c0bc30e42865e/src/diffusers/image_processor.py#L240.
  4. And then generate the final PIL image.

From your example notebook, comparing this line:

res_taesd = taesd_dec(latents).cpu().permute(0, 2, 3, 1).float().clamp(0, 1).numpy()

to this one in diffusers, it feels like that the additional (images / 2 + 0.5) is not required for the tiny autoencoder?

Would be amazing to get your thoughts here.

sayakpaul commented 1 year ago

to this one in diffusers, it feels like that the additional (images / 2 + 0.5) is not required for the tiny autoencoder?

Seems like it's indeed the case.

When I do:

import PIL 

pipe.vae = TinyAutoencoder.from_pretrained(
    "sayakpaul/taesd-diffusers", torch_dtype=torch.float16
).to("cuda")
latents = pipe(
    prompt, num_inference_steps=25, height=512, width=512, guidance_scale=3.0,
    generator=torch.manual_seed(0), output_type="latent"
).images

decoded_image = pipe.vae.decode(
    latents / pipe.vae.config.scaling_factor, return_dict=False
)[0]
decoded_image = decoded_image.permute(0, 2, 3, 1).float().clamp(0, 1).cpu().detach().numpy().squeeze(0)

PIL.Image.fromarray((decoded_image * 255).round().astype("uint8"))

With this, I am getting:

image

sayakpaul commented 1 year ago

When I use the original VAE, I get:

from diffusers import AutoencoderKL

original_vae = AutoencoderKL.from_pretrained(
    "stabilityai/stable-diffusion-2-1-base", subfolder="vae", torch_dtype=torch.float16
).to("cuda")
pipe.vae = original_vae

prompt = "slice of delicious New York-style berry cheesecake"
image = pipe(
    prompt, num_inference_steps=25, height=512, width=512, guidance_scale=3.0,
    generator=torch.manual_seed(0)
).images[0]
image

image

sayakpaul commented 1 year ago

Closing the issue.

madebyollin commented 1 year ago

Yup, TAESD directly predicts values in [0, 1] so you don't need the additional denormalization step (though clamping is still recommended). The image here looks correct to me 👍