google / prompt-to-prompt

Apache License 2.0
2.98k stars 279 forks source link

wrong DDIM inversion step #79

Open Tim3s opened 8 months ago

Tim3s commented 8 months ago
def next_step(self, model_output: Union[torch.FloatTensor, np.ndarray], timestep: int, sample: Union[torch.FloatTensor, np.ndarray]):
    timestep, next_timestep = min(timestep - self.scheduler.config.num_train_timesteps // self.scheduler.num_inference_steps, 999), timestep
    alpha_prod_t = self.scheduler.alphas_cumprod[timestep] if timestep >= 0 else self.scheduler.final_alpha_cumprod
    alpha_prod_t_next = self.scheduler.alphas_cumprod[next_timestep]
    beta_prod_t = 1 - alpha_prod_t
    next_original_sample = (sample - beta_prod_t ** 0.5 * model_output) / alpha_prod_t ** 0.5
    next_sample_direction = (1 - alpha_prod_t_next) ** 0.5 * model_output
    next_sample = alpha_prod_t_next ** 0.5 * next_original_sample + next_sample_direction
    return next_sample

This is the code for calculating 'next step' in null_text_w_ptp.ipynb, but I think it's different from the equation in your paper.

According to your paper, next step can be calculated as: $$z{t+1}=\sqrt{{\alpha{t+1}}\over \alpha_t}zt + \left(\sqrt{{1 \over {\alpha{t+1}}} - 1} - \sqrt{{1 \over \alphat} - 1}\right)\cdot \epsilon\theta (z_t, t, C)$$

Latent is calculated at timestep $t$, and this code calculates latent from timestep $t-20$ to timestep $t$, so this code performs the same as using $\epsilon_\theta (zt, t+20, C))$ instead of $\epsilon\theta (z_t, t, C))$.

Hprairie commented 6 months ago

DDIM allows for faster diffusion. It looks like they are taking 1000 diffusion steps total with 50 sampling steps, thus a stride of 20.