Closed sayakpaul closed 1 year ago
There is no scaling_factor
for TAESD - TAESD directly converts SD(XL) latents into RGB images in [0, 1] (see the usage in the example notebook). So if you need to specify a value you can probably set it to 1.0.
(The latent_shift
and latent_magnitude
values in taesd.py
are only relevant if you want to store latents into RGBA PNG files - sorry for the confusion.)
Thanks for your reply!
I am trying to integrate your work in diffusers
so that users can use it very easily (of course, crediting this repository).
With the following code (diffusers
was installed using pip install git+https://github.com/huggingface/diffusers@feat/tiny-autoenc
):
import torch
from diffusers import DiffusionPipeline, TinyAutoencoder
pipe = DiffusionPipeline.from_pretrained(
"stabilityai/stable-diffusion-2-1-base", torch_dtype=torch.float16
)
pipe.vae = TinyAutoencoder.from_pretrained("sayakpaul/taesd-diffusers", torch_dtype=torch.float16)
pipe = pipe.to("cuda")
prompt = "slice of delicious New York-style berry cheesecake"
image = pipe(prompt, num_inference_steps=25, height=512, width=512, guidance_scale=3.0).images[0]
image
I am getting:
Is the quality somewhat expected?
To give you some more context here's what we do in the standard pipeline settings.
After we get the latents from the UNet,
From your example notebook, comparing this line:
res_taesd = taesd_dec(latents).cpu().permute(0, 2, 3, 1).float().clamp(0, 1).numpy()
to this one in diffusers, it feels like that the additional (images / 2 + 0.5)
is not required for the tiny autoencoder?
Would be amazing to get your thoughts here.
to this one in diffusers, it feels like that the additional (images / 2 + 0.5) is not required for the tiny autoencoder?
Seems like it's indeed the case.
When I do:
import PIL
pipe.vae = TinyAutoencoder.from_pretrained(
"sayakpaul/taesd-diffusers", torch_dtype=torch.float16
).to("cuda")
latents = pipe(
prompt, num_inference_steps=25, height=512, width=512, guidance_scale=3.0,
generator=torch.manual_seed(0), output_type="latent"
).images
decoded_image = pipe.vae.decode(
latents / pipe.vae.config.scaling_factor, return_dict=False
)[0]
decoded_image = decoded_image.permute(0, 2, 3, 1).float().clamp(0, 1).cpu().detach().numpy().squeeze(0)
PIL.Image.fromarray((decoded_image * 255).round().astype("uint8"))
With this, I am getting:
When I use the original VAE, I get:
from diffusers import AutoencoderKL
original_vae = AutoencoderKL.from_pretrained(
"stabilityai/stable-diffusion-2-1-base", subfolder="vae", torch_dtype=torch.float16
).to("cuda")
pipe.vae = original_vae
prompt = "slice of delicious New York-style berry cheesecake"
image = pipe(
prompt, num_inference_steps=25, height=512, width=512, guidance_scale=3.0,
generator=torch.manual_seed(0)
).images[0]
image
Closing the issue.
Yup, TAESD directly predicts values in [0, 1] so you don't need the additional denormalization step (though clamping is still recommended). The image here looks correct to me 👍
We have
latent_shift
andlatent_magnitude
values here:https://github.com/madebyollin/taesd/blob/main/taesd.py#L44C1-L45C23
But is there a
scaling_factor
as well or is it just one?scaling_factor
as observed in https://github.com/huggingface/diffusers/blob/ea5b0575f8f91b76f32fb6f6930c0bc30e42865e/src/diffusers/models/autoencoder_kl.py#L61.