ttulttul / ComfyUI-Iterative-Mixer

Nodes that implement iterative mixing of samples to help with upscaling quality
GNU General Public License v3.0
115 stars 9 forks source link

ComfyUI Iterative Mixing Nodes

This repo contains nodes for ComfyUI that combine to implement a strategy I'm calling Iterative Mixing of Latents. The technique is somewhat borrowed from the DemoFusion paper, with gratitude. I also acknowledge BlenderNeko for the inspiration that led to the Batch Unsampler node included in this pack.

Note: This pack contains some deprecated nodes

In my first attempt at iterative mixing, I developed a few KSampler-type nodes that call the underlying sample() function inside ComfyUI to denoise and then mix noised latents step by step. This method produces grainy output for reasons I don't yet fully understand. On the advice of @comfyanonymous, I have now moved onto a new approach that uses the SamplerCustom node.

Updates

July 9th, 2024

June 13th, 2024

March 9th, 2024

February 9th, 2024

January 12th, 2024

January 4th, 2024

December 30th, 2023

December 29th, 2023

Nodes

IterativeMixingSampler:

Important: Do not enable add_noise in the SamplerCustom node. The iterative mixing sampler does not need noise to be injected by the SamplerCustom node and will generate garbage if this option is set to true.

This node feeds into a SamplerCustom node to implement iterative mixing sampling, with various options to control the process.

IterativeMixingScheduler:

Use this node in combination with the IterativeMixingSampler described above. This node generates a set of sigmas to feed into the mixing sampler. It requires a model input to fetch the sigmas-related parameters from the model. The options on this node should be self-explanatory.

IterativeMixingSchedulerAdvanced:

This is the same as the IterativeMixingScheduler but adds a start and end step so that you can control the steps for denoising. Try setting the end step to 50% of the total step count, which will cause the iterative mixing sampler to generate grainy output containing rich noise that can be passed into another iterative mixing sampler for refinement.

Note that you must adjust the start_blending_at and stop_blending_at parameters on the sampler node to match the same proportion of total step count specified in the IterativeMixingSchedulerAdvanced, otherwise the blending schedule will configure itself to match the total steps in the range you specify in this node, rather than fitting the curve to the steps parameter. For instance, if you set steps to 40 and start_at_step to 0 and end_at_step to 20 (i.e. 50% of the way through the total steps), then you must adjust stop_blending_at to 80 so that the blending schedule will be stretched horizontally by a factor of two to account for the smaller length of the sigmas tensor being passed by the IterativeMixingSchedulerAdvanced. I know this is confusing and if a better way emerges, I will support it.

MixingMaskGeneratorNode:

This node generates a batch of perlin noise masks. In future, you will be able to feed these masks into the IterativeMixingSampler to precisely control latent mixing by applying a mask to the process at each step. For now, it offers a way to see for yourself what the perlin masks look like at various scale levels.

Deprecated Nodes

The nodes described below are deprecated. Use them at your own risk to obtain interesting and potentially buggy results.

Iterative Mixing KSampler:

This node de-noises a latent image while mixing a bit of a noised sequence of latents at each step.

Batch Unsampler:

This node takes a latent image as input, adding noise to it in the manner described in the original Latent Diffusion Paper.

Note: The normalize_fraction amount is highly experimental and unscientific. It serves to remove any bias in the final noised latent in the batch so that sampling from that noise will have the maximum possible dynamic range.

Iterative Mixing KSampler Advanced:

This node de-noises a latent image while mixing a bit of the noised latents in from the Batch Unsampler at each step. Note that the number of steps is inferred from the size of the input latent batch from the Batch Unsampler, which is why this parameter is missing.

How the hell does this work?

Whenever we upscale an image, we are taking a relatively small amount of information and making it cover a larger spatial area. Given a 64x64 latent image (this is the size of latent that Stable Diffusion 1.5 VAEs generate), upscaling to 128x128 implies that each of the original latent pixels has to do 4x the work.

Here is a walk-through of how upscaling happens using this node pack.

Example: We start with a 768x512px image generated from an SD1.5 model. Since this image was generated at a resolution that is close to the resolution of images the model was trained on (SD1.5 was trained at 512x512), the model can generate a coherent scene that properly attends to all the pixels no matter how far apart they are, ensuring we don't get extra limbs and weirdness: Original 786x512px SD1.5 Image (from latent)

Then, we upscale it by 2x using the wonderfully fast NNLatentUpscale model, which uses a small neural network to upscale the latents as they would be upscaled if they had been converted to pixel space and back. This results in a pretty clean but somewhat fuzzy 2x image: Roughly 2x upscaled 1536x1024px SD1.5 Image (from latent)

Notice how the upscale is larger, but it's fuzzy and lacking in detail. The fuzziness occurs because upscaling the latents cannot make up for the lost information between the pixels. We need to fix this problem somehow. We could use an upscaling model, which has been trained on thousands of high and low resolution images of all kinds to "fill in the gaps," but that would be so boring...

So instead, we use the Batch Unsampler node from this node pack to generate a sequence of progressively noisier latents. The noise is added to the latents using the same noise schedule as the underlying SD model. This is the noise schedule that was used during training of the model and it does not rely on conditioning, which is why the Batch Unsampler node does not ask for positive or negative conditioning inputs:

Result of applying the Batch Unsampler node to generate 21 progressively noised latents

We pass the progressively-noised sequence along with the fuzzy 2x-upscaled latent into the Iterative Mixing KSampler Advanced node. The Iterative Mixing KSampler then runs the diffusion sampler step by step, mixing in a bit of the noised sequence latents at each time step. The fraction of the noised latents that is mixed in declines as the steps progress, in accordance with a blending schedule. By default, the blending schedule is a cosine-exponential curve that starts off giving lots of input from the noised sequence, decaying to almost nothing by the end. You can also choose a linear schedule or a logistic curve schedule. Play around with the choise of blending schedule to get different results. There is no hard and fast rule as to which schedule is the best.

After iterative mixing sampling, we get a new 1536x1024px image:

Result of de-noising the rough 2x upscale using the Iterative Mixing KSampler Advanced node

This image is richer detail than the original fuzzy image we passed in (as a latent) before iterative mixing, but it also has some residual noise that shows up as graininess. The graininess results because the stable diffusion model was never trained to operate in this manner and is unable to eliminate all of the noise that was mixed in step by step by the iterative mixing sampler:

Final result from iterative mixing sampler

Fortunately, getting rid of the noise is really easy. To clean up the noise, we simply pass it through another KSampler at a very low denoising strength. This final "clean up" should be run at the lowest possible denoising strength to avoid removing details and generating artifacts.

Tips for getting good results

Here are some things to try to get good results. By default, as mentioned above, the iterative mixing sampler will generate grainy output. Cleaning up the grainy output can be accomplished by a second sampler at a low denoise strength of 0.05 - 0.25. General tips:

  1. Play with the de-noising strength in the Iterative Mixing KSampler node. Using a de-noising strength of 1.0 will generate the grainiest output, but the output will have a very high level of detail when refined. Using a lower de-noising strength will generate less noise at the output, but won't require as much refinement. Generally speaking, if you use a low de-noise in the iterative mixer, then you will need a lower de-noise in the refinement sampler.

  2. Try starting with 20 to 40 steps of iterative mixing. You can use fewer steps to get "interesting" output, but it's possible that the amount of noise will be so high as to skew your generation quite far from the coherent output you desire.

  3. Scale by no more than 2x at a time. If you are seeking a 4x output, use two phases of upscaling and iterative mixing. I have found that scaling up by more than 2x can result in coherence issues like extra limbs and whatnot.

  4. Try using ControlNet nodes. Apply a depth or open pose ControlNet based on the initial low resolution image and send that conditioning into the iterative mixing sampler. This will guide the sampling process more carefully along the lines of the structure that you care about. If you are having problems with too many fingers or extra arms, a pose ControlNet will help. The depth ControlNet can help to ensure the structure of a room remains consistent.

  5. Try using IPAdapter nodes. Similar to ControlNet, IPAdapter allows you to condition the model based on the semantic information in a source image. Feed your 512px image into an IPAdapter and use this to modify your model before inputting into the iterative mixing sampler. You may find this improves the iterative mixing output by better aliging it to the precise composition of the low resolution image.

  6. Try using PatchModelAndDownscale to adjust your model for the upscale sampling passes. This relatively new node (as of December 2023) implements the Kohya "DeepShrink" concept, downscaling one of the layers of the SD model U-Net for a few steps to increase the "receptive field" of the model, which is appropriate when you are generating images at a resolution that is higher than the resolution the model was trained at. For each upscale level (2x, 4x, etc.), I suggest setting the downscale_factor to the same amount. In other words, downscale by 2x for the first pass and 4x for the second pass if you are doing a 4x upscale through two iterative mixing samplers.

What does "Iterative Mixing" mean?

I made up the term "Iterative Mixing." Sorry. In the DemoFusion paper, they use the term "skip residual" (see section 3.3), but I just don't like that term (emphasis below is mine):

For each generation phase $s$, we have already obtained a series of noise-inversed versions of $z_0^{'s}$ as $z_t^{'s}$ with $t$ in $[1, T]$. During the denoising process, we introduce the corresponding noise-inversed versions as skip residuals. In other words, we modify $p_{\theta}(z_{t-1}|z_t)$ to $p_{\theta}(z_{t-1}|\hat{z_t})$ with

$$ \hat{z_t}^{s} = c_1 \times z_t^{'s} + (1 - c_1) \times z_t^{s}, $$

where $c_1 = \left(\frac{1 + \cos(\frac{\pi t}{T})}{2}\right)^{\alpha_1}$ is a scaled cosine decay factor with a scaling factor $\alpha_1$. This essentially utilizes the results from the previous phase to guide the generated image's global structure during the initial steps of the denoising process. Meanwhile, we gradually reduce the impact of the noise residual, allowing the local denoising paths to optimize the finer details more effectively in the later steps.

Note: I correct the author by removing the 2 from the $2{\pi}$ that was originally in the numerator of the cosine expression in the paper's equation 4. The author acknowledged the mistake and will correct it in a future revision.

How are you supposed to use this node?

Latent Diffusion Models (LDMs) are trained on images having a particular resolution. The original Stable Diffusion was trained on images with a maximum resolution of 512x512px and having 3 channels (RGB). When converted into the latent space by a Variational Auto Encoder (VAE), the resolution of the "latent" is 64x64 and the images have 4 channels which have no meaningful analog in color space.

If you try to use a model that was trained on 512x512px images to generate an image larger than 512x512, the model will happily generate pixels for you, but the image will not have internal consistency. For example, you may find the model generating extra people or extra limbs. The perceptive field of the model, in other words, is only 512x512.

The idea behind this node is to help the model along by giving it some scaffolding from the lower resolution image while denoising takes place in a sampler (i.e. a KSampler in ComfyUI parlance). We start by generating an image at a resolution supported by the model - for example, 512x512, or 64x64 in the latent space. We then upscale the latent image by 2x giving us a 1024x1024 latent. And then, within the Batch Unsampler, we add noise to the 2x latent progressively according to the noise schedule that was used in training the original stable diffusion model to obtain a sequence of progressively noisier latents.

These progressively noisier latents are then passed to the Iterative Mixing KSampler, which performs the usual latent diffusion sampling. However, rather than starting our next phase of sampling using the 2x latent, we instead start with the noisiest latent from the Batched Unsampler output. Then, each timestep, we mix in a portion of the next-least-noisiest latent from that batch. The portion that we mix in declines over time until in the final step, we are mixing in almost none of it. The image below illustrates the process outlined in the DemoFusion paper:

The skip residual algorithm

By mixing in a portion of the noised 2x latent, we are helping the LDM model to generate a new image that is more consistent with the structure of the original image, but since the de-noising is indeed happening at the new resolution, the details should be filled in at full strength.

The authors of the DemoFusion paper say that the output of the skip-residual process is grainy. I agree. But we can easily fix this in ComfyUI by bolting a KSampler onto the output with a low denoising level to clean things up. In the DemoFusion paper, they invent an elaborate sliding window algorithm that they call "dilated sampling" to incorporate local and global context during de-noising. However, testing of their algorithm suggests that we may be better off just doing a bit more sampling at a light de-noising level and introducing other concepts like ControlNet to constrain the generate of extra limbs and whatnot. Your mileage may vary.

What the hell is $alpha_1$?

The $alpha_1$ parameter controls the shape of the cosine blending schedule mentioned in the math above from the DemoFusion paper. The value of $alpha_1$ is supposed to be low during the early part of the de-noising process to encourage the model to take lots of hints from the noised 2x latent batch. Later, we want the model to be guided less by the noised latent batch because otherwise we'd end up just copying it - including artifacts of the rough 2x upscaling that we did using bicubic interpolation or whatever other simple interpolation algorithm. In other words, we want the model to have more creative freedom as more of the image comes into focus.

Here is what various values of $alpha_1$ yield in the cosine blending schedule:

The alpha_1 blending schedule

Note: I provide a linear blending option you can try to achieve a different result. There is no inherent reason why the blending schedule should be a cosine exponential. I think the paper's authors just thought it would be a good idea and perhaps their various PhD's in mathematics give them that liberty.

What do the different blending functions do?

The blending_function lets you select between addition, slerp, and norm_only to vary how the latents are blended during sampling. To provide a sense of what these different blending options actually do, here are some illustrations:

addition

The addition blending function

slerp

The slerp blending function

norm_only

The norm_only blending function

What does rewind do?

Iterative mixing often produces results that are a bit blurry or lacking in detail. By rewinding back a bit and doing it again, we can produce more details. The rewind setting tells the sampler to follow this algorithm:

  1. Generate noised latents based on the input latent for the full step range (0 to steps).

  2. De-noise from 0 to steps, mixing in the noised latents at each step as described above.

  3. When finished, go back to steps * rewind_min, generating a fresh set of noised latents for that interval of steps.

  4. De-noise from steps * rewind_min to steps, mixing in the noised latents as usual.

  5. Repeat steps 3 and 4 except each time, rewind half as far as last time and repeat this loop until the procedure would result in rewinding beyond rewind_max steps.

For example, if steps = 100, rewind_min = 0.5, and rewind_max = 0.8, then the rewind process would de-noise the following intervals:

  1. From 0 to 100
  2. From 50 to 100
  3. From 75 to 100

Step 4 would not be reached because half of the interval from 75 to 100 is greater than 80, which would exceed rewind_max.