ml-explore / mlx-examples

Examples in the MLX framework
MIT License
5.95k stars 843 forks source link

Saving Intermediate Noising Images #330

Open pudepiedj opened 8 months ago

pudepiedj commented 8 months ago

A request for information/documentation rather than an 'issue', but I've been trying to track and document the diffusion process in image2image.py in mlx-examples from start to finish, and I can easily save intermediate denoising images to show the emergence of new models, but where can I insert intermediate save-image code to do the same for the noising trajectory? I've tried for several days without success. Sorry to be dense.

The denoising of a single image with --strength = [0.1, 0.2, ..., 1.0] is illuminating. Here's a (greatly-reduced-resolution) triangulated image that illustrates the whole thing as we emerge from the latent space. What I'd like to be able to do is trace the 'descent' into the latent space as well from the original image. Of course nothing here is new or surprising, but I find it interesting to see how the variation from the original evolves as the strength increases. (These intermediate steps are saved every 20 iterations. What is more or less exactly the original image - also generated by mlx-examples - is, of course, the one at the bottom.)

triangular_pattern_1

angeloskath commented 8 months ago

That image is pretty cool :-) !

So, in stable_diffusion/sampler.py you can use the function add_noise to add the noise for a specific timestep. By the way this is equivalent to the first column in your image because strength corresponds to exactly that how far back to move in the diffusion process.

pudepiedj commented 8 months ago

That image is pretty cool :-) !

So, in stable_diffusion/sampler.py you can use the function add_noise to add the noise for a specific timestep. By the way this is equivalent to the first column in your image because strength corresponds to exactly that how far back to move in the diffusion process.

Thank you! Yes I think I understand the denoising from the left-hand-side, but I assume that the left-most image is itself the result of adding noise to the original image, and what I've been trying to do is capture the stages of that process by saving intermediate levels just like in the picture but starting from the original image, effectively the reverse of the process illustrated. If I go to the .add_noise it only seems to be called once, but I thought the noise was added stage-by-stage to the original image, gradually degrading it into the latent space. Please correct me if I am wrong. It's an intercept for that process I can't find. I'd be most grateful to be able to clear this up one way or the other; it's been driving me nuts! :)

angeloskath commented 8 months ago

The beauty of adding iid gaussian noise is that adding a little noise N times or a lot of noise once is exactly the same so no need to do the noising process iteratively. For details see equations 2-4 in Denoising Diffusion Probabilistic Models.

Now in order to get a forward path instead of points on independent paths the simplest way would be to use betas directly as follows:

betas = _linspace(config.beta_start, config.beta_end, config.num_train_steps)
x0 = ...
xt = [x0]
for b in betas:
    noise = mx.random.normal(shape=x0.shape)
    xt.append(noise * b.sqrt() + (1-b).sqrt() * xt[-1])
pudepiedj commented 8 months ago

The beauty of adding iid gaussian noise is that adding a little noise N times or a lot of noise once is exactly the same so no need to do the noising process iteratively. For details see equations 2-4 in Denoising Diffusion Probabilistic Models.

Now in order to get a forward path instead of points on independent paths the simplest way would be to use betas directly as follows:

betas = _linspace(config.beta_start, config.beta_end, config.num_train_steps)
x0 = ...
xt = [x0]
for b in betas:
    noise = mx.random.normal(shape=x0.shape)
    xt.append(noise * b.sqrt() + (1-b).sqrt() * xt[-1])

Thank you. I appreciate the explanation and that you took the trouble. I was just about to say that after scrutinising the code much more carefully I came across this in the __init__.py code:

        # Get the latents from the input image and add noise according to the
        # start time.
        x_0, _ = self.autoencoder.encode(image[None])
        x_0 = mx.broadcast_to(x_0, [n_images] + x_0.shape[1:])
        x_T = self.sampler.add_noise(x_0, mx.array(start_step))

        # Perform the denoising loop
        yield from self._denoising_loop(
            x_T, start_step, conditioning, num_steps, cfg_weight
        )

which seems just to add noise all at once. OK now I understand. What I don't understand is why it wouldn't make good sense to use the text conditioning during the noising process as well as the denoising so that the latent space generated was more amenable to the eventual target image while also retaining elements of the characteristics of the seed image. But I'll read the paper you kindly recommend first and then it may become clear!

pudepiedj commented 8 months ago

It's a very interesting paper from which I learnt a lot, and it was especially gratifying to see and understand in the chained conditional probabilities the mathematical background to the python zip(channel, channel[1:], channel[2:]) snippet.

It doesn't really deal with the question why we wouldn't 'noise selectively' using the text-prompt embedding as a guide, but I regenerated the pattern above with a different starting-image and prompt and reorganised the grid, and I suddenly saw something that I hadn't appreciated: that if you follow down the columns you can see the increased noising taking place and that it gets deeper at each level, so the longer strength decodings actually start deeper into the latent space. (You can also see it in my original image above if you go up the columns, so I won't post the alternative.)

I think this must arise from the indirect inclusion of the strength parameter in the single add_noise call via the start_step variable in the snippet I quoted above, so the larger strength is, the greater the noising and the deeper into the latent space we travel before we start denoising, if I've understood it correctly.

So in effect you can do what I originally wanted to do - tracing the whole process from starting-image to final image across different levels/depths of latent space - by tracing an L-shaped path from the bottom-left image, which is almost the original, up n images then right n images, which gives the noising and denoising process even if the noising is done all at once, but the effect is the same.

pudepiedj commented 8 months ago

I think the cfg configuration parameter is closely related to the beta_t variance in the paper. Anyway, I think this can be regarded as the end of this particular journey, but running across the rows in this final image is instructive, as is the moment when the changes in the final image become more pronounced. Less obvious here because the seed (left) and target (right) are similar, but very striking in other cases. tpyramid_pattern6 github Here is the command-prompt (with a few extra parameters I've added to image2image.py to generate the necessary images for the triangular array):

% python3 stable_diffusion/image2image.py stable_diffusion/images2images/Triangular/testimageSTRENGTHNew5_200_200.png "Idyllic country landscape. Impressionism. Style of Cezanne." -sd 20 -o stable_diffusion/images2images/Triangular/testimageSTRENGTHNew6.png --n_images 1 --n_rows 1 --steps 200 --cfg 7.5 -pp -gt -sp

Deciphered:

  -h, --help            show this help message and exit
  --strength STRENGTH   value in (0,1); larger means more variation in the output image
  --n_images N_IMAGES   total number of images arranged in n_rows
  --n_rows N_ROWS       the number of rows in the grid of final images
  --steps STEPS         maximum number of steps N
  --cfg CFG             configuration number N
  --negative_prompt NEGATIVE_PROMPT
                        things to avoid in the final images
  --decoding_batch_size DECODING_BATCH_SIZE
  -o OUTPUT, --output OUTPUT
                        base.ext filename for outputs
  -sd SHOW_DENOISING, --show_denoising SHOW_DENOISING
                        show denoising images every N iterations
  -sp SAVE_PROMPT, --save_prompt SAVE_PROMPT
                        save the main text-prompt as metadata
  --save_last_N SAVE_LAST_N
                        save all the last N consecutive sets of images
  -pp PRINT_PARSER, --print_parser PRINT_PARSER
                        print the argument Namespace at inception
  -gt GENERATE_TRIANGLE, --generate_triangle GENERATE_TRIANGLE
                        create a triangular display of progressive strength

The reason for --save_last_N is that I've discovered that the final few steps often over-smooth the output in a way that is less attractive, but of course it depends on the prompt. But saving the last five steps allows for the best to be selected.