ouusan / some-papers

0 stars 0 forks source link

Golden Noise for Diffusion Models: A Learning Framework #33

Open ouusan opened 4 days ago

ouusan commented 4 days ago

paper: https://arxiv.org/pdf/2411.09502v1 1.Motivations (1) It is well known that text prompts significantly matter to the quality and fidelity of the synthesized images. However, image synthesis is induced by both the text prompts and the noise. Noise affect both the overall aesthetics and the semantic faithfulness between the synthesized images and the provided text prompt, but not all initial noise is equally effective--->turn a random noise into a golden noise (2) Recent studies [3, 5, 7, 22, 27] observe that some selected or optimized noises are golden noises that can help the diffusion models to produce images of better semantic faithfulness with text prompts, and can also improve the overall quality of the synthesized images, while they are often not widely adopted in practice for several reasons e.g. introduce significant time delays in order to optimize the noises or they require indepth modifications when applied to different diffusion models with varying architectures. --->formulate a noise prompt learning framework: collect a training dataset for noise prompt learning, and trained NPNet can directly transform a random Gaussian noise into a golden noise to boost the performance

  1. Overview image

    3.Key ideas (1)noise prompt dataset (NPD) collection:
    original random Gaussian noise xT (with text prompt)---> xT −1--->DDIM-Inversion(·) with text prompt --->x′T . xT, x′T--->diffusion inversion process--->synthesized images x0, x′0---> use HPSv2(human preference score, to select noise pairs meet this criterion [s0 + m < s′0]) --->get pairs of noise and golden noise(xT, x′T) for training. (2)noise prompt network (NPNet):

    <1>**singular value predictor: to predict the singular values of the target noise** (why use SVD: the singular vectors of xT and x′T exhibit remarkable similarity in latent space) image **ϕ(·, ·, ·) as a ternary function that sum three inputs, f(·) as the linear layer function, g(·) as the multi-head self-attention layer** <2>**residual predictor: to predict the residual between the source noise and the target noise** image image image **φ(·) as the UpSample-DownConv operation, φ′(·)as the DownSample-UpConv operation, and the ψ(·) as the ViT model. e as the normalized text embedding.**

Once trained, NPNet can be directly applied to the T2I diffusion model by inputting the initial noise xT and prompt embedding c encoded by the frozen text encoder of the diffusion model.

ouusan commented 4 days ago

Recent studies [3, 5, 7, 22, 27] observe that some selected or optimized noises are golden noises that can help the T2I diffusion models to produce images of better semantic faithfulness with text prompts, and can also improve the overall quality of the synthesized images. [3,7] incorporate attention to reduce the truncate errors: [3] #36 Attend-and-Excite [7] #34 INITNO

ouusan commented 3 days ago

Meng et al. [23] reported that adding the random noise at each timestep during the sampling process and then redenoising, leads to a substantial improvement in the semantic faithfulness of the synthesized images. [23] #35