madebyollin / taesd

Tiny AutoEncoder for Stable Diffusion
MIT License
495 stars 27 forks source link

Data range when training taesd #18

Closed stardusts-hj closed 1 month ago

stardusts-hj commented 1 month ago

Thanks for providing taesd. I'm trying to finetune taesd and I'm wondering what is the data range when you trained taesd. I see there is data transformation in diffusers TinyVAE, is it correct that you train tinyvae with the following data range

'''training'''
x # [0,1]
SD_latent = SD_vae.encoder(x) * vae_factor
taesd_latent = taesd_encoder(x) 
enc_loss = L2(SD_latent, taesd_latent)

taesd_output = taesd_decoder(SD_latent)
dec_loss = L2(x, taesd_output)

However, in the diffusers code, they convert the output of taesd decoder with scaling

def forward(self, x):
    x = self.layers(x)
    # scale image from [0, 1] to [-1, 1] to match diffusers convention
    return x.mul(2).sub(1)

Does it mean I have to convert the x when calculating the dec_loss if I want to use Tinyautoencoder in the diffusers like

'''training'''
x # [0,1]
SD_latent = SD_vae.encoder(x) * vae_factor
taesd_latent = taesd_encoder(x) 
enc_loss = L2(SD_latent, taesd_latent)

taesd_output = taesd_decoder(SD_latent) # auto convert to [-1,1]
# convert x to [-1,1]
dec_loss = L2(x, taesd_output)
madebyollin commented 1 month ago

The data range conventions are:

These ranges apply to both inputs and outputs. So your examples need to scale images before sending them to the SD VAE encoder. I think the correct pseudocode would be:

'''training with taesd.py'''
x # [0,1]
SD_latent = SD_vae.encoder(x.mul(2).sub(1)) * vae_factor
taesd_latent = taesd.encoder(x) 
enc_loss = L2(SD_latent, taesd_latent)

taesd_output = taesd.decoder(SD_latent)
dec_loss = L2(x, taesd_output)

'''training with diffusers.AutoencoderTiny'''
x # [0,1]
SD_latent = SD_vae.encoder(x.mul(2).sub(1)) * vae_factor
taesd_latent = autoencodertiny_encoder(x.mul(2).sub(1)) 
enc_loss = L2(SD_latent, taesd_latent)

taesd_output = autoencodertiny_decoder(SD_latent) # auto convert to [-1,1]
# convert x to [-1,1]
dec_loss = L2(x.mul(2).sub(1), taesd_output)

I posted example TAESDXL training code here BTW, should be useful reference (specifically the DiffusersVAEWrapper portion).

stardusts-hj commented 1 month ago

The data range conventions are:

  • taesd.py: images are in [0, 1], latents are gaussian-distributed
  • diffusers.AutoencoderTiny images are in [-1, 1], latents are unit-normalized (you could apply the scale factor, but it's just 1.0)
  • diffusers.AutoencoderKL images are in [-1, 1], latents are not unit-normalized until you apply the scale factor

These ranges apply to both inputs and outputs. So your examples need to scale images before sending them to the SD VAE encoder. I think the correct pseudocode would be:

'''training with taesd.py'''
x # [0,1]
SD_latent = SD_vae.encoder(x.mul(2).sub(1)) * vae_factor
taesd_latent = taesd.encoder(x) 
enc_loss = L2(SD_latent, taesd_latent)

taesd_output = taesd.decoder(SD_latent)
dec_loss = L2(x, taesd_output)

'''training with diffusers.AutoencoderTiny'''
x # [0,1]
SD_latent = SD_vae.encoder(x.mul(2).sub(1)) * vae_factor
taesd_latent = autoencodertiny_encoder(x.mul(2).sub(1)) 
enc_loss = L2(SD_latent, taesd_latent)

taesd_output = autoencodertiny_decoder(SD_latent) # auto convert to [-1,1]
# convert x to [-1,1]
dec_loss = L2(x.mul(2).sub(1), taesd_output)

I posted example TAESDXL training code here BTW, should be useful reference (specifically the DiffusersVAEWrapper portion).

Thank you so much for your reply! I'll follow your example.