Open pudepiedj opened 10 months ago
That image is pretty cool :-) !
So, in stable_diffusion/sampler.py
you can use the function add_noise
to add the noise for a specific timestep. By the way this is equivalent to the first column in your image because strength corresponds to exactly that how far back to move in the diffusion process.
That image is pretty cool :-) !
So, in
stable_diffusion/sampler.py
you can use the functionadd_noise
to add the noise for a specific timestep. By the way this is equivalent to the first column in your image because strength corresponds to exactly that how far back to move in the diffusion process.
Thank you! Yes I think I understand the denoising
from the left-hand-side, but I assume that the left-most image is itself the result of adding noise to the original image, and what I've been trying to do is capture the stages of that process by saving intermediate levels just like in the picture but starting from the original image, effectively the reverse of the process illustrated. If I go to the .add_noise
it only seems to be called once, but I thought the noise was added stage-by-stage to the original image, gradually degrading it into the latent space. Please correct me if I am wrong. It's an intercept
for that process I can't find. I'd be most grateful to be able to clear this up one way or the other; it's been driving me nuts! :)
The beauty of adding iid gaussian noise is that adding a little noise N times or a lot of noise once is exactly the same so no need to do the noising process iteratively. For details see equations 2-4 in Denoising Diffusion Probabilistic Models.
Now in order to get a forward path instead of points on independent paths the simplest way would be to use betas directly as follows:
betas = _linspace(config.beta_start, config.beta_end, config.num_train_steps)
x0 = ...
xt = [x0]
for b in betas:
noise = mx.random.normal(shape=x0.shape)
xt.append(noise * b.sqrt() + (1-b).sqrt() * xt[-1])
The beauty of adding iid gaussian noise is that adding a little noise N times or a lot of noise once is exactly the same so no need to do the noising process iteratively. For details see equations 2-4 in Denoising Diffusion Probabilistic Models.
Now in order to get a forward path instead of points on independent paths the simplest way would be to use betas directly as follows:
betas = _linspace(config.beta_start, config.beta_end, config.num_train_steps) x0 = ... xt = [x0] for b in betas: noise = mx.random.normal(shape=x0.shape) xt.append(noise * b.sqrt() + (1-b).sqrt() * xt[-1])
Thank you. I appreciate the explanation and that you took the trouble. I was just about to say that after scrutinising the code much more carefully I came across this in the __init__.py
code:
# Get the latents from the input image and add noise according to the
# start time.
x_0, _ = self.autoencoder.encode(image[None])
x_0 = mx.broadcast_to(x_0, [n_images] + x_0.shape[1:])
x_T = self.sampler.add_noise(x_0, mx.array(start_step))
# Perform the denoising loop
yield from self._denoising_loop(
x_T, start_step, conditioning, num_steps, cfg_weight
)
which seems just to add noise all at once. OK now I understand. What I don't understand is why it wouldn't make good sense to use the text conditioning during the noising process as well as the denoising so that the latent space generated was more amenable to the eventual target image while also retaining elements of the characteristics of the seed image. But I'll read the paper you kindly recommend first and then it may become clear!
It's a very interesting paper from which I learnt a lot, and it was especially gratifying to see and understand in the chained conditional probabilities the mathematical background to the python zip(channel, channel[1:], channel[2:])
snippet.
It doesn't really deal with the question why we wouldn't 'noise selectively' using the text-prompt embedding as a guide, but I regenerated the pattern above with a different starting-image and prompt and reorganised the grid, and I suddenly saw something that I hadn't appreciated: that if you follow down the columns you can see the increased noising taking place and that it gets deeper at each level, so the longer strength decodings actually start deeper into the latent space. (You can also see it in my original image above if you go up the columns, so I won't post the alternative.)
I think this must arise from the indirect inclusion of the strength
parameter in the single add_noise
call via the start_step
variable in the snippet I quoted above, so the larger strength
is, the greater the noising and the deeper into the latent space we travel before we start denoising, if I've understood it correctly.
So in effect you can do what I originally wanted to do - tracing the whole process from starting-image to final image across different levels/depths of latent space - by tracing an L-shaped
path from the bottom-left image, which is almost the original, up n
images then right n
images, which gives the noising and denoising process even if the noising is done all at once, but the effect is the same.
I think the cfg
configuration parameter is closely related to the beta_t
variance in the paper. Anyway, I think this can be regarded as the end of this particular journey, but running across the rows in this final image is instructive, as is the moment when the changes in the final image become more pronounced. Less obvious here because the seed (left) and target (right) are similar, but very striking in other cases.
Here is the command-prompt (with a few extra parameters I've added to image2image.py
to generate the necessary images for the triangular array):
% python3 stable_diffusion/image2image.py stable_diffusion/images2images/Triangular/testimageSTRENGTHNew5_200_200.png "Idyllic country landscape. Impressionism. Style of Cezanne." -sd 20 -o stable_diffusion/images2images/Triangular/testimageSTRENGTHNew6.png --n_images 1 --n_rows 1 --steps 200 --cfg 7.5 -pp -gt -sp
Deciphered:
-h, --help show this help message and exit
--strength STRENGTH value in (0,1); larger means more variation in the output image
--n_images N_IMAGES total number of images arranged in n_rows
--n_rows N_ROWS the number of rows in the grid of final images
--steps STEPS maximum number of steps N
--cfg CFG configuration number N
--negative_prompt NEGATIVE_PROMPT
things to avoid in the final images
--decoding_batch_size DECODING_BATCH_SIZE
-o OUTPUT, --output OUTPUT
base.ext filename for outputs
-sd SHOW_DENOISING, --show_denoising SHOW_DENOISING
show denoising images every N iterations
-sp SAVE_PROMPT, --save_prompt SAVE_PROMPT
save the main text-prompt as metadata
--save_last_N SAVE_LAST_N
save all the last N consecutive sets of images
-pp PRINT_PARSER, --print_parser PRINT_PARSER
print the argument Namespace at inception
-gt GENERATE_TRIANGLE, --generate_triangle GENERATE_TRIANGLE
create a triangular display of progressive strength
The reason for --save_last_N
is that I've discovered that the final few steps often over-smooth the output in a way that is less attractive, but of course it depends on the prompt. But saving the last five steps allows for the best to be selected.
A request for information/documentation rather than an 'issue', but I've been trying to track and document the diffusion process in
image2image.py
inmlx-examples
from start to finish, and I can easily save intermediate denoising images to show the emergence of new models, but where can I insert intermediate save-image code to do the same for the noising trajectory? I've tried for several days without success. Sorry to be dense.The denoising of a single image with
--strength = [0.1, 0.2, ..., 1.0]
is illuminating. Here's a (greatly-reduced-resolution) triangulated image that illustrates the whole thing as we emerge from the latent space. What I'd like to be able to do is trace the 'descent' into the latent space as well from the original image. Of course nothing here is new or surprising, but I find it interesting to see how the variation from the original evolves as the strength increases. (These intermediate steps are saved every 20 iterations. What is more or less exactly the original image - also generated bymlx-examples
- is, of course, the one at the bottom.)